Step 2: Extract the data

Everyone is involved in the extraction
The RegRed database has six tables that relate to each other through IDs - they will be filled as follows.

The data extractors will extract data manually using the data_extraction_template for the tables source, redlist, and location, and will extract data either manually or automatically, depending on the extractionType detected, for the taxon_assessment table. They will also have to create a file for mapping verbatimStatusCode to a standardised statusCode, following the mapping_sources_template.
The spatial data generator will be in charge of verifying the location table, and generating the spatial polygons for all redlistLocations following the geographic_entities standard.
The taxonomy harmoniser will be in charge of the taxonomy table, and will have to reconcile the scientific names in the extracted data to match a taxonomic backbone.

Golden rules when using the data extraction template

Do not modify the data_extraction_template under any circumstances. If you find an error and changes are needed, report it to Flo.
Always refer to this protocol to fill the template. Do not assume any decisions that are not contemplated in this document. If you have doubts about the protocol, contact Flo.
Always check the “Definitions” tab and do not assume any definitions on your side.
Follow the respective controlled vocabulary. If the value you want to add in a column is not there, you have to report this to Káča or Flo. If needed, the option will be added to the template.
Paste values by using Ctrl+Shift+V or “Paste option: Keep Text Only” to avoid any formatting in the Excel files.
Do not copy and paste values blindly. Verify always what you are copying. Remove double spaces, and spaces in the end or start of values (e.g., “Mammals” or “Mammals”).
Do not use the value “Unknown” for information that is missing. Instead, just leave the field empty. Use Unknown only in those cases that have the value as part of the controlled vocabulary.
Verify all the data you extract. Errors can occur, so a careful review is essential to ensure data integrity.

Notes on extracting text/tables from PDFs

If the text in the PDF is readable (copy/pasteable), you can use tabula, Excel or Ctrl+c
If the text in the PDF is not readable, you can use OCR-Software.com to make it readable.
If the text in the PDF is not copyable (e.g., it’s encrypted or corrupted), you can ask Adam to help you convert the full PDF to a readable version.

2.2 Filling the `source` table

What happens if I have multiple classes in one source

Some sources might include unsorted tables of species (e.g. Fauna, Vertebrates, Amphibians and Reptiles). Separating species into classes by hand is a difficult process. If you encounter a source that is simple to extract all at once, but difficult to separate by class, you can extract the data for all species as a single redlist and use our split_script to split the xlsx file into multiple files.

Instructions to use the file
1. Copy the script into your working directory where your xlsx with unsorted taxon_assessment data sheet is located.
2. The script is designed to work with files based on our data extraction template. Fill out all other sheets beforehand.
3. Follow the instructions in the script. The output will be multiple files named filename_Class.xlsx.
4. The only sheet modified by the script is taxon_assessment, all else stays the same.
5. Finally, you need to manually fix the taxonomicScope column and fileName

How do I fill the sourceIdentifier?

The identifier is a resolvable HTTP URI for DOIs and URLs, and an URN for ISBNs.

DOI: https://doi.org/<doi>, e.g., https://doi.org/10.2909/9a752c28-cb5f-4ead-9922-2a8173e0306b
ISBN: urn:isbn:<isbn>, e.g., urn:isbn:0-486-27557-4
ISSN: urn:issn:<issn>,e.g., urn:issn:0953-4563
URL to catalog: e.g., https://portals.iucn.org/library/node/10315
URL to source: e.g., https://example.org/resource.pdf

What happens if there are several possibilities for a sourceIdentifier?

You may have more than one potential identifier, e.g., a DOI and also an URL.

Always follow this hierarchy of preferred identifiers:

DOI
ISSN/ISBN
URL to the electronic catalog entry (e.g., a national library)
URL to the PDF or interactive resource.

URLs should only be used when the source has neither a DOI nor an ISSN/ISBN. If you use a URL, make sure it resolves correctly.
If there are additional ISBN or URLs directly linked to the PDF or interactive resource, please include all of them in the Zotero shared library when adding the bibliographic citation (e.g., under ISBN, URL, or in Extra).

Check the definition, examples and recommendations for each field in the Definitions tab of the template for data extraction.

`sourceID`	find the value in the column `sourceID` of the table of sources, check it, and add it. This should also be the name of the folder you created on step 1.5 Create a folder and name it using the `sourceID`
`sourceTitle`	find the value in the column `sourceTitle` of the table of sources, check it, and add it
`sourceIdentifier`	find the value in the column `sourceIdentifier` of the table of sources, check that it is correct and add it following the recommendations on “How to fill the `sourceIdentifier` when there is more than one file or document?”
`sourceDate`	the table of sources will only have the year as `sourceDate`, you must check if there is a full date available, and add the value. Full dates are preferred (in the format `YYYY-MM-DD`), but you can also add year (`YYYY`), or month and year (`YYYY-MM`)
`sourceLanguage`	find the value in the column `sourceLanguage` of the table of sources, and select values from the list of controlled vocabulary (this could be a list)
`sourceLicense`	select the value from the list of controlled vocabulary (could be unknown, but it is very important that you try your best to find it). This value will be used for the citation of the source
`sourceType`	select the value from the list of controlled vocabulary
`sourceFormat`	select on or many values from the list of controlled vocabulary
`extractionType`	select the value from the list of controlled vocabulary
`sourceCategory`	select the value from the list of controlled vocabulary
`sourcePublisher`	find the publisher or publishers in the source, and add them separate by “`\|`”. Use the full name of the publishers followed by the acronym in brackets. A publisher can be an institution or people
`sourcePublisherType`	select from the list of controlled vocabulary. If multiple publishers, separate by “`\|`” following the same order as `sourcePublisher`
`bibliographicCitation`	use Zotero to create this value by entering the source in the shared RegRed Zotero library (Sources Collection). Export the bibliographic citation in APA 7 style (Englsih UK) and use that value (more information in the Citation Rules appendix)

2.3 Filling the `redlist` table

How to fill the redlistIdentifier or statusAssessmentTypeIdentifier when there is more than one file or document?

Check the previous question “How do I fill the sourceIdentifier when there is more than one file or document?”. The same applies to these identifiers.

How to deal with the fields statusAssessmentType and statusMappingID

The statusAssessmentType is the type of assessment system used to define the status codes for the taxa in the redlist. Most redlists follow an IUCN criteria, however, many countries create their own protocols to assign conservation statuses. This is why we need to translate all statuses to a common standard.
The statusMappingID is the ID for the statusMappingSource that has the mapping (i.e., translation) between statuses.

All the verbatimStatusCodes for the species in our database will be mapped to the latest version of the IUCN regional statusCodes. This is “IUCN. (2012). Guidelines for application of IUCN Red List criteria at regional and national levels: Version 4.0. IUCN. https://portals.iucn.org/library/node/10336”, and has statusMappingSourceID = 1.

If the statusAssessmentType = IUCN but it is not the standard version, or if the statusAssessmentType = Non-IUCN, you will also have to create the status mapping source (see below 2.3.1 Handling Status Mapping Sources).

Check the definition, examples and recommendations for each field in the Definitions tab of the template for data extraction

`redlistID`	fill this value according to the source’s `sourceID` and how many redlist the source has (see `numberOfRedlists` in the table of sources), e.g., if `sourceID` = `2`, and you are filling the information for the first red list in the source, then the `redlistID` should be `2_1`
`redlistTitle`	this field will be generated automatically as “Redlist of <`taxonomicScope`> of <`redlistLocation`> (<year of (`redlistDate`)>)”. Therefore, these fields must be filled first to build the `redlistTitle`
`redlistIdentifier`	this may be the same value as `sourceIdentifier`, but if the red list has a different identifier, use it. Check that it is correct and add it following the recommendations suggested for other identifiers.
`redlistDate`	this may be the same value as `sourceDate`, but if each red list has a different date, add that value. Use only year here (`YYYY`)
`geospatialScope`	select from the list of controlled vocabulary
`taxonomicScope`	select from the list of controlled vocabulary
`isTaxonomicScopeFullyReported`	select from the list of controlled vocabulary. Fill with `Yes` if the number of species reported on the redlist is the same as the number of species in the `taxonomicScope`
`redlistLocation`	this must be a unique value, not a list. Check if the location refers to a `country`, `stateProvince`, or `county`, if it is a location smaller than a `county`, if the location is a protected area or if it is an island, and search for the correct name in the table of geographic_entities
`statusAssessmentType`	select from the list of controlled vocabulary. If it is `Non-IUCN`, you must create the status mapping source following the template
`statusAssessmentTypeCitation`	use Zotero to create this field by entering the source in the shared RegRed Zotero library (Assessment systems Collection). Export the bibliographic citation in APA style and use that value (see Appendix)
`statusAssessmentTypeIdentifier`	check if the mapping source has an identifier and add the value
`statusMappingSourceID`	check if the mapping source has an identifier and add the value

2.3.1 Handling Status Mapping Sources

The following details will help you define when and how to select or create a status mapping source so that all verbatimStatusCodes in RegRed are consistently mapped to the latest IUCN regional status codes.

Decision tree

All verbatimStatusCodes will be mapped to the IUCN regional criteria, and this mapping has statusMappingSourceID = 1:

IUCN (2012). Guidelines for application of IUCN Red List criteria at regional and national levels: Version 4.0. IUCN. https://portals.iucn.org/library/node/10336

If the assessment does not use the latest IUCN regional guidelines (v4.0), you must find or create an appropriate mapping source (statusAssessmentTypeCitation and statusMappingSourceID). Before creating a new source, always check if the mapping source already exists in RegRed. If it does, reuse its existing statusMappingSourceID.
If the mapping source does not exist in the mapping_sources_used, you will have to create a new one. To do this, you need to first check if the assessment uses IUCN criteria or not, and then choose one of the following five different scenarios to create a statusMappingSourceID.

flowchart TD
  Start(["assessment"]) --> IUCN{"Does it use IUCN v4.0 (2012)?"}
  IUCN -- "Yes" --> Use1(["Use statusMappingSourceID = 1"])
  IUCN -- "No" --> Exists{"Does the mapping source exists in RegRed?"}
  Exists -- "Yes" --> Reuse(["Reuse existing statusMappingSourceID"])
  Exists -- "No" --> IsIUCN{"Does the source uses IUCN criteria?"}
  IsIUCN -- "Yes" --> Choose1{"Choose a scenario"}
  IsIUCN -- "No" --> Choose2{"Choose a scenario"}

  %% Scenario D
  Choose1 -- "(1) IUCN old version" --> D0
  subgraph D0["IUCN with modifications"]
    direction TB
    D1(["Create Excel and PDF"])
    D2(["statusAssessmentTypeCitation = IUCN version"])
    D3(["statusAssessmentTypeIdentifier = DOI or URL or catalog ID"])
    D4(["Create new statusMappingSourceID"])
    D4 --> D1 --> D2 --> D3
  end  
  
   %% Scenario E
  Choose1 -- "(2) IUCN unknown version" --> E0
  subgraph E0["Unknown IUCN"]
    direction TB
    E1(["Create Excel and PDF"])
    E2(["statusAssessmentTypeCitation = Unknown"])
    E3(["statusAssessmentTypeIdentifier = leave this field empty"])
    E4(["Create new statusMappingSourceID"])
    E4 --> E1 --> E2 --> E3
  end  

  %% Scenario A
  Choose2 -- "(3) the source provides mapping" --> A0
  subgraph A0["Mapping in the same source"]
    direction TB
    A1(["Create Excel and PDF"])
    A2(["statusAssessmentTypeCitation = cite redlist source"])
    A3(["statusAssessmentTypeIdentifier = DOI or URL or catalog ID"])
    A4(["Create new statusMappingSourceID"])
    A4 --> A1 --> A2 --> A3
  end

  %% Scenario B
  Choose2 -- "(4) an official mapping was published" --> B0
  subgraph B0["Official mapping source"]
    direction TB
    B1(["Create Excel and PDF"])
    B2(["statusAssessmentTypeCitation = cite official document"])
    B3(["statusAssessmentTypeIdentifier = DOI or URL"])
    B4(["Create new statusMappingSourceID"])
    B4 --> B1 --> B2 --> B3
  end

  %% Scenario C
  Choose2 -- "(5) no official or mixed sources" --> C0
  subgraph C0["Custom mapping"]
    direction TB
    C1(["Create Excel and PDF"])
    C2(["statusAssessmentTypeCitation = cite yourself"])
    C3(["statusAssessmentTypeIdentifier = GitHub URL"])
    C4(["Create new statusMappingSourceID"])
    C4 --> C1 --> C2 --> C3
  end
  
  %% --- CUSTOM COLORS ---  
  %% Define a class for decision nodes (Exists + Choose)  
  classDef decision_no fill:#FFE9B3,stroke:#CC9900,stroke-width:2px,color:#000;  
  classDef decision_yes fill:#DFF5D4,stroke:#4CAF50,stroke-width:2px,color:#000; 
  classDef citation fill:#F88379,stroke:#811331,stroke-width:2px,color:#000; 
  classDef boxes fill:#f2f3f5,stroke:#363636,stroke-width:0.5px,color:#000; 
  
  %% Apply the class to the nodes  
  class IsIUCN,Exists,Choose2 decision_no;
  class Use1,Reuse,Choose1 decision_yes;
  class A2,B2,C2,D2,E2 citation;
  %%class A0,A1,A2,A3,A4,B0,B1,B2,B3,B4,C0,C1,C2,C3,C4 boxes;

Where do I find the statusMappingSourceID used?

You can find the statusMappingSourceID in the mapping_sources_used file.

Instructions to create a new statusMappingSourceID
In all scenarios, you will have to create a separate Excel file with its respective documentation PDF.

Scenario 1: The assessment was done following IUCN guidelines that are not the latest version. Some modifications may be needed (e.g., EX = RE). You will cite the version of IUCN used in the statusAssessmentTypeCitation (and corresponding statusAssessmentTypeIdentifier), and you will explain in the mapping document, any modifications introduced or adaptations needed.
Scenario 2: The assessment was done following IUCN guidelines, but it’s not evident from the source which version was used. The statusAssessmentTypeCitation and the statusAssessmentTypeIdentifier should be Unknown.
Scenario 3: The assessment was not done following IUCN guidelines, yet the same the same source provides the mapping of the verbatimStatusCodes to the statusCodes. You will use the source’s citation (bibliographicCitation) as the statusAssessmentTypeCitation, and the source’s identifier (sourceIdentifier) as the statusAssessmentTypeIdentifier.
Scenario 4: The assessment was not done following IUCN guidelines, yet there is a publication (e.g., a manuscript or an official document) that has already done the mapping to the IUCN categories. You will create the citation for this document in Zotero and use it as the statusAssessmentTypeCitation (including the identifier as the statusAssessmentTypeIdentifier).
Scenario 5: The assessment was not done following IUCN guidelines, and there is no official document that maps the statues to the IUCN criteria. You may also have a mix of different sources to define the mapping. In this case, you will cite yourself (use as identifier the URL from our GitHub repository). Contact Flo to store the Excel and PDF in RegRed’s GitHub repository. You will need this to get a statusAssessmentTypeIdentifier.

If you have doubts about how to create the mapping, you can check with Flo how to define the statuses. Always try to come up with something, so that we can finalise together.

Instructions to create the Excel table and PDF document for the mapping source
Creating the mapping source involves preparing two documents:

The Excel table status_mapping_source_X

This file will store the equivalent values of statusCode for every verbatimStatusCode.
You will create it using the mapping_sources_template_xlsx.
You should name the file as follows: <status_mapping_source_X.xlsx>, with X = statusMappingSourceID.
You should store the file in the status_mapping_sources folder, and name the folder after the statusMappingSourceID.

Fields to be filled in:

statusMappingID - fill this with the ID of the status_mapping_source_X (it should be the same in every row).
verbatimStatusCode - fill this with the original status code used in the source (or slightly modify for purpose).
statusCode - the IUCN status codes (e.g., VU) are pre-filled in the template.
statusCategory - the IUCN status codes definition (e.g., Vulnerable) are pre-filled in the template.
statusMappingBy - select your name from the list of controlled vocabulary.
statusMappingRemarks - fill with any specifications about your decisions.

The PDF document status_mapping_source_X

This file will have a verbal description and a list of the sources used and decisions made to define equivalent values of statusCode for verbatimStatusCode (e.g., to determine that X1 = CR).
You will create it using the mapping_sources_template_docx, and then save it as a PDF.
You should name the file as follows: <status_mapping_source_X.pdf>, with X = statusMappingID.
You should store the file in the status_mapping_sources folder with the correct statusMappingSourceID.

Content of the document:

Author: Modelling of Biodiversity Lab (MOBI Lab), Faculty of Environmental Sciences (FZP CZU), <your name>, and <the names of anyone else who assisted you>.
Place and date = (e.g. Prague, 30 November 2025)
Title: “Mapping of <place> status codes from <year> to the standard of IUCN (2012) version 3.1 (2nd edn)” (e.g. “Mapping of Belarus status codes from 1993 to the standard of IUCN (2012) version 3.1 (2nd edn)”).
Summary: A brief description that clarifies the title or adds more information and explains whether you are using any bibliographical material to guide your interpretation of the categories. In some cases, two or more geographical regions for which mappings already exist use the same categories with identical descriptions but assign them different numbers or letters.
Bibliography: The documents and sources you used to define the criteria.

Final step:
When you are done with the mapping source (Excel and PDF), contact Flo to store the folder in RegRed’s GitHub repository. This repository will be archived in the future in Zenodo.

If you got this far and you still have doubts about status mapping sources, contact Sofi.

2.4 Filling the `location` table

What sources does RegRed use for geographic locations?

The geographic_entities are based in the following databases:

geoBoundaries: A global database of political administrative boundaries. See https://www.geoboundaries.org/visualize.html.
WDPA: The World Database on Protected Areas, the most comprehensive global database on terrestrial and marine protected areas. See https://www.protectedplanet.net/en/search-areas?geo_type=region.
Global Islands: The Global Island Database is a global shoreline and associated global islands database. See https://doi.org/10.1080/1755876X.2018.1529714.

Where do I find the standardised information of geographic locations?

Refer always to the website of geographic_entities https://regred-project.github.io/geographic_entities/.

What is a custom region?

A region for RegRed is an official administrative unit. It could be a country (adm0), stateProvince (adm1), county (adm2), or any administrative unit beyond that, which is officially recognised by a country.

Since geoBoundaries provides spatial geometries only for adm0, adm1, and adm2, regions representing adm3, or higher may be confused with custom regions. In doubt, contact Gabriel early on in the extraction period to make a decision.

Examples of a custom region
“Spanish territory without the Islas Canarias”. This group of administrative units does not exist as an administrative unit; therefore, it is a custom region.

Examples of locations that could be custom regions, but they are not
“Provincia Antártica Chilena” is a county (adm2), with two municipalities (adm3): “Cabo de Hornos” and “Antártica”. Separately, they are official administrative regions absent from geoBoundaries; therefore, they are not custom regions.

Instructions to use the geographic entities file
If you are looking for

a country (ADM0): geoboundaries_country.
a stateProvince or second order administrative region (ADM1): geoboundaries_stateProvince.
a county or the lowest administrative region possible (ADM2): geoboundaries_county.
a locality that is a protected area: wdpa_sep2025.
an island that is a country: geoboundaries_country.
an island that is a stateProvince or second order administrative region: geoboundaries_stateProvince.
an island that is county or the lowest administrative region possible: geoboundaries_county.
an island that does not fit any of the above: global_islands.

What happens if I find an error or if none of these options fits my case?

Contact Gabriel. Include the name of the locality, a description, and a picture (screenshot) of an accompanying map if available. The description should be as precise as possible. For example, “The Spanish territory without the Canary Islands”.

Check the definition, examples and recommendations for each field in the Definitions tab of the template for data extraction.

`locationID`	fill this value according to the geographic_entities `locationID` value
`geometry`	do not fill this field
`geometrySource`	do not fill this field
`verbatimSRS`	do not fill this field
`footprintSRS`	do not fill this field
`continent`	fill this value according to the geographic_entities `continent` value
`country`	fill this value according to the geographic_entities `country` value
`countryCode`	fill this value according to the geographic_entities `countryCode` value
`stateProvince`	fill this value according to the geographic_entities `stateProvince` value
`county`	fill this value according to the geographic_entities `county` value
`locality`	fill this value according to the geographic_entities `locality` value
`isCustomRegion`	select from the list of controlled vocabulary. This will be `Yes` if the `redlistLocation` spans multiple administrative regions and therefore a custom polygon is needed for the geometry, e.g., “The Carpathians”

2.4 Filling the `taxon_asseessment` table

Check the definition, examples and recommendations for each field in the Definitions tab of the template for data extraction.

`statusID`	do not fill this field
`verbatimIdentification`	fill this value with the exact name as it appears in the red list. This could be a species or a subspecies name
`verbatimStatusCode`	fill this value with the original abbreviated status category
`statusCriteria`	fill this value with the criteria used to define the `verbatimStatusCode` of the `verbatimIdentification` (if provided in the red list). This value should be the original; you shouldn’t modify it

How to deal with red list with multiple taxonomic classes?

If you have a red list that has multiple classes in one (e.g., Fauna, Vertebrates, Amphibians & Reptiles) you can use the split_script to automatically divide the Excel file into multiple class-specific files.

Instructions to use the split_script
The script is located in our shared OneDrive code/Splitter/split_script.R.

Copy the script into the working directory where your .xlsx file with the unsorted taxon_assessment sheet is located.
The script is designed for files that follow our data extraction template. Make sure ALL fields are completed beforehand; this should be the last step.
Follow the instructions provided within the script.
The script will generate multiple files named filename_<class>.xlsx. Only the taxon_assessment sheet will be modified; all other sheets remain unchanged.
After the split, manually correct the taxonomicScope column and update the filename field as needed.

2.4.1 Instructions to extract the `verbatimIdentification` and `verbatimStatusCode` automatically

Note

This section needs to be developed, but if you have questions, ask Adam.

Golden rules when using the data extraction template

2.2 Filling the source table

2.3 Filling the redlist table

2.3.1 Handling Status Mapping Sources

Decision tree

2.4 Filling the location table

2.4 Filling the taxon_asseessment table

2.4.1 Instructions to extract the verbatimIdentification and verbatimStatusCode automatically

2.2 Filling the `source` table

2.3 Filling the `redlist` table

2.4 Filling the `location` table

2.4 Filling the `taxon_asseessment` table

2.4.1 Instructions to extract the `verbatimIdentification` and `verbatimStatusCode` automatically