Step 1: Select a source
Because red lists are often similar within a region, it makes sense to assign countries to specific data extractors. This keeps the work consistent and lets each person focus on the countries they know best, which saves time and improves data quality. Preferences can be set at the start of the project to make assignments more efficient.
1.1 Pick one country
Select a country to extract data from
Data extractors will select a country based on language proficiency, familiarity with the country or region, and/or personal preference and motivation.
The number of countries assigned to each data extractors will vary depending on the complexity and volume of sources, as not all countries/sources require equal effort.
- Do not select multiple countries at once: pick only one.
- There are no fixed quotas: as you finish one country, you can go on to the next one.
As the project progresses, some sources may remain unclaimed due to complexity or lack of familiarity/motivation (e.g., “Austria has too many red lists, it’s becoming demotivating”). In such cases:
- A meeting will be held to discuss and redistribute these sources if needed.
- If consensus cannot be reached, random assignment may be used.
1.2 Pick a source and report progress
Work on one source
After you select a country, you will start working on one source at a time.
Always track your progress
Indicate your progress by filling in the following columns in the table of sources:
dataExtractor: The name of the team member assigned to and responsible for extracting data from the source. | values:Ivo,Adam,Káča,Sofi.
progress: The current status of the data extraction process for the source. | values:in progress,done.
redlistTypeOfContent: Describes the format in which the red list species and their statuses are presented within the source. For example: text (data is in text), table (data is in structured tables), or table+text (a combination of both). This information is key for deciding between manual and automated extraction methods. | values:table,text,table+text,image.
extractionType: Describes how was the data extracted, manually by a data extractor, in an automated way, or by the NRL and translated to the RegRed’s system. | values:EcoParse,Manual, andNRL.numberOfRedlists: The number of identified redlists in the source. | values: integer numbers > 1.dateStarted: The date on which the data extraction for the source officially began. | format:YYYY-MM-DD.
fullyExtracted: Indicates whether all relevant red list data from the source has been completely extracted, e.g., only vertebrates are extracted, but the source also contains data for plants. | values:Yes,No.
extractionRemarks: A field for any relevant notes about the extraction process for a specific source, e.g. broken links, data inconsistencies, or other challenges encountered. | values: free text.
Always check that the file corresponds to the source in the sourceIdentifier, and the sourceTitle. Guide yourself by the sourceTitle, the taxa covered and the year of publication.
You may find the document (PDF, website, dataset) associated with the source under the field fileName in the folder all sources to extract. However, as these were created following a different convention (when the metadata was created), this column may be empty.
If there is no fileName, then find the document associated with the source checking the sourceIdentifier. If the sourceIdentifier takes you to a broken URL, then use the sourceTitle to search for the source online. If you still do not find it, check with Ivo and Adam, because if the source is there there there should be a digital object related to it.
If you find a more complete file for the source (PDF or any digital format) you are extracting, you should replace the current version in the all_sources_to_extract folder with the one you have.
If you find a new file for the source that did not have one (PDF or any digital format), add it to the folder new_sources using the name of the sourceID.
The content in the table tetrapods_sources_to_extract should never be edited. This is because we don’t want to spend time correcting the information in this file as all the correct information will be stored in the file you are extracting.
You should only add details to the fields: dataExtractor, progress, redlistTypeOfContent, extractionType, numberOfRedlists, dateStarted, fullyExtracted, and extractionRemarks.
If there are any comments you want to make about any of the metadata being wrong (e.g., the year is incorrect), you should make them in the field extractionRemarks.
If you find a new source that is not in the table of sources, first check with those involved in the metadata creation (Ivo and Adam) to see why it is not in the list (maybe there is a reason why it was skipped).
If it needs to be added, fill in the details about the new source in the new_sources file and store any associated files in new_sources folder.
If the source should be included in the table of sources, it will need to be added by creating a new line at the end of the file and creating the sourceID as a consecutive number. Ivo or Flo will be in charge of creating it.
- Each member reports on their progress, including any difficulties encountered.
- Plans for the upcoming week are shared, including which country or source they intend to pick next.
1.3 Check if the source has already been extracted by the NRL
Check before starting to extract data.
The National Red List project (https://www.nationalredlist.org/) is one of RegRed’s partners. They have extracted data from many different sources already, and data extractors should make sure not to extract these data again. However, in order for the NRL data to be useful, it will still need to be harmonised by the NRL data harmoniser.
- Check the NRL data and assess whether the source you have is in the NRL database.
- Double-check that the source is effectively fully extracted, as sometimes NRL has a source but only a few species (e.g., only Mammals not all tetrapods).
If the source has already been fully extracted by NRL:
- Indicate this in the table of sources by filling in the
extractionType=NRL.
- The NRL Data Harmoniser will have to adapt the source to RegRed’s structure. You will be done with the source, and you will start over from step 1.2. Pick a source and report progress.
Else, if the source has not already been fully extracted by NRL:
- Keep working on the following steps.
1.4 Evaluate the suitability of the source for manual extraction
Manual or automated extraction
RegRed uses EcoParse, a tool that streamlines the AI automated method, but not all sources yield reliable results through automation. Before beginning to extract any data, Data Extractors will have to assess whether the source is appropriate for manual or automated extraction.
AI-based extraction using large language models (LLMs) is useful for sources that are difficult to extract manually, e.g., because the information is provided inside text descriptions (and not tables). However, automated extraction depends on scientific species names and threat information being present on the same page.
If the source can be extracted automatically
Indicate this in the table of sources by filling in the extractionType = EcoParse.
Else, if the source will be extracted manually
Indicate in the table of sources that the extractionType = Manual.
If the source provides text descriptions and does not contain tables that are easy to copy and paste.
If the source contains only a few species (e.g., a reptile red list from a country with ~10 species). If scientific species names and threat status information are not present on the same page of the text (as the LLM will fail to detect this).
If you want to know more about source types regarding extraction methods are in the CheatSheet. If you are still in doubt, ask Adam or Flo
1.5 Create a folder and name it using the sourceID
A unique folder for the source
Whether the data extractor is doing the extraction manually or automatically, they will create a dedicated folder in the tetrapods_extracted_sources folder to store all related files and outputs for the selected source. This folder must represent a unique source. However, it may contain one or more red lists (see next step for details).
Folder Naming Convention
Name the folder using the sourceID provided in the table of sources. This should be a numeric value. For example if sourceID = 2, the folder will be named “2”.
1.6 Identify the number of red lists in the source and report progress
Multiple red lists per source
Each source may contain multiple red lists, which are loosely defined as lists of species assessed for extinction risk within a specific geographic and taxonomic scope.
A red list is uniquely identified by the combination of:
taxonomicScope: Each red list must have a unique taxonomic scope. As the project focuses on tetrapods, begin by identifying the class of organisms in the source (e.g., Amphibia, Reptilia, Aves, Mammalia). The maximumtaxonomicScopeis class; the minimum could be any. To determine this, considering the following:Primary division: Split by class (Amphibia, Reptilia, Aves, Mammalia) as the default level of
taxonomicScope.Lower-level division: If the source provides distinct assessments within the same class (e.g., separate evaluations for breeding vs. non-breeding birds), you must divide further into order, family, or other relevant groups. However, do not divide into lower taxonomic levels unless the source clearly distinguishes them with separate assessments (e.g., a species has “LC” for a breeding population and “EN” for a non-breeding).
redlistLocation: Each red list must have a unique location. This can be a country,stateProvince,county,locality, or a named custom region.redlistDate: Each red list must have a unique date. We will use the year the assessment was conducted.
If you are uncertain about how to define the taxonomicScope for a source, consult Flo for guidance.
Some sources (e.g., Fauna, Vertebrates, Amphibians and Reptiles) may contain unsorted species tables that mix several taxonomic classes. Manually separating these can be time-consuming.
If the source is easy to extract as a whole but difficult to split by class, you can use our split_script to automatically divide the Excel file into multiple class-specific files.
However, to be able to do this, you must have extracted all the information in the red list first. Once you have done it, see the instruction in the 2.4 Filling the taxon_assessment table section to use the script to divide the file into classes.
Report your progress
To help track the scope and complexity of each source and inform extraction planning, after you have identified the number of red lists in the source, indicate your progress by filling in the following column in the table of sources:
numberOfRedlists: The number of identified redlists in the source | values:1,2,3,4, …,N.
This number will also be used to name the red list files (see next step).
1.7 Create and name red list files using redlistID
Download the templates and work on them in your local machine. Do not work online in the folder of the source. The documents in there, should be the fully extracted versions.
Be aware that template files may be updated over time, therefore you should not reuse previously downloaded versions. You should always check the status of the file online and download the latest version of the template.
Instructions for red list files
For each red list identified within a source, create a copy of the template for data extraction following these steps:
File Creation
Make a local copy in your computer of the template for data extraction for each red list identified in the source.File Naming Convention
Name each file using thesourceIDfollowed by an underscore and a sequential number. For example:sourceID_1,sourceID_2, …,sourceID_nwherenis the total number of red lists in that source.File Location
Once you have finished extracting the source (and your peer has checked the file), you will save all files in the folder named after thesourceID.
Double-Check
Make sure the numbering matches the total number of red lists recorded in the table of sources under numberOfRedlists.