Posted by Cam on 2019-05-01
Hi again! Back to names after our recent foray into phenotypes.
Having got hold of most of the names “engines”, taken them apart, laid out the parts, tried to jam them together several times, I’m now ready to document the details of assembling our integrated names list. Work over the past months has been “two steps forward, one-and-a-half steps back” as I learned about the complexities an limitations of each source database and the challenges of integrating them. Actually, I think that’s how the work will be all the way through this project! But I think I now have a good base canonical names list. This is the definition of our master list of names:
The list itself makes no note of status (currently accepted vs. synonym); that comes next.
The steps to assemble this list were to:
matchnames
(Blog 3).matchnames
, to remove the Tropicos duplicates of IPNI names.matchnames
.The resulting list contains 19,452 names (13,775 with an IPNI GUID, 4,890 with a Tropicos GUID, and 787 with a PlantList GUID). Here’s a diagram of the process:
Remember, this canonical list of names includes both accepted names and synonyms. As we move towards the goal of an accepted names, we can take our lists of names with taxonomic status, from, e.g., the ALA checklist and the PAF checklist, and reconcile these to the canonical list to get a “clean” name (i.e., as originally spelled, and with standardized authors) for each name we wish to accept and its synonyms. In this way, we were able to get clean names for 3,358 out of 3,740 names from the ALA checklist, and 2,405 out of 2,823 names from the Alaskan plants in PAF. The non-matching names are generally new names, or old, obscure names still not captured in mainline, online databases.
In line with the project goals of maximum transparency and repeatability, all scripts and code needed to build this canonical list, and (later) to generate the accepted names and taxon concept databases are available via the project Github repo. Organizing principles include:
LICENSE
file gives the licensing information for each of these source lists, should anyone want to use them.bash
script called README.sh
.README.sh
script in the canonical directory, with the output file called canon
. The numbered steps (1 to 5) above correspond to the same numbered sections in the README.sh
script.README.sh
script in the dataflow directory.matchnames
). The automated scripts leave these steps out, substituting in pre-calculated files, and patches. The user is alerted to these manual phases as the scripts run, and the manual steps can be uncommented-out if the user wants to re-run these manually.OK, so I’m not expecting anyone to clone the repo and rebuild the data! But by building this documented resource I hope to answer any questions people may have downstream about exactly how the Flora of Alaska database was assembled.