Some of the values in geographic distribution are incorrect. This happens because of two reasons.
First of all, they include countries gathered from “country checklists”.
E.g. https://eol.org/pages/326384/data?predicate_id=941 has Argentina and Brazil included, which should not be there.
In traits.csv these two countries are not included and they should not be included on the EOL page either. The country checklists are outdated and should not be used for geographic distribution as the newer method using the countries already listed on GBIF works better.
Secondly, geographic distribution includes occurrences of “Preserved specimens” and presumably also “Fossil specimens”. This results in the example above also having United States as a country of geographic distribution even though it is a preserved occurrence appearing there (also still in the example mentioned here).
Other things that should be excluded is occurrences flagged as suspicious and also “absent” occurrences. Optionally, even more data can be excluded like this: Species occurrence cubes :: Technical Documentation.
I know you are probably busy with the server update, but hope you can take a look anyways.
Thanks very much for reporting, @Bjoe ! Yes, we may repopulate all our structured data from scratch rather than migrating it wholesale when cypher is set up in its new container. The national checklists are a vintage datamining project by now, so it’s a good time to reconsider that method.
Is it the data cubes process in general that you are recommending? That does look promising for both nation checklists and the GBIF maps we display. Excluding data within the query will indeed make it easier to filter. I am much obliged for the lead.
I think the SQL downloads are the best option. I was able to get the data using this query:
SELECT
specieskey,
countrycode
FROM
occurrence
WHERE
specieskey IS NOT NULL
AND countrycode IS NOT NULL
AND occurrencestatus = 'PRESENT'
AND (
basisofrecord = 'HUMAN_OBSERVATION'
OR basisofrecord = 'MACHINE_OBSERVATION'
OR basisofrecord = 'OCCURRENCE'
OR basisofrecord = 'LIVING_SPECIMEN'
OR basisofrecord = 'MATERIAL_SAMPLE'
)
AND NOT ARRAY_CONTAINS(issue, 'ZERO_COORDINATE')
AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_OUT_OF_RANGE')
AND NOT ARRAY_CONTAINS(issue, 'COUNTRY_COORDINATE_MISMATCH')
GROUP BY
specieskey,
countrycode