Posted by Cam on 2018-10-31
(In which I discuss issues with matching taxonomic names in different databases , and compare the universe of online name resources.)
Taxonomic names are, among other things, the identifiers, or labels, that traditionally link together data in different information resources; for example, from the publication that contains the original description of a species… to an ecological record of that species occurring at a site. Being vital informatics links, names have been called “key to the big new biology” and a wide array of online resources offer information structured around biological names (e.g., the Encyclopedia of Life and Wikispecies). Many sites also offer information about the names themselves, both as primary representations of taxonomic research and literature, and as aggregators and integrators of others’ lists (e.g., IPNI and The Plant List).
The eventual usefulness of our flora project will partially depend on how well we link to and integrate the data already in these online resources, especially the taxonomic ones. As we set out, it is thus important to step back and ask how we can best use these resources. Which should we focus on? (There are too many to attempt to integrate every one.) And how should we link to them? This blog post outlines the issues and how we intend to move forward with names.
I’ll start with online taxonomic name resources (including sites that are primarily checklists), before moving on to more general biological resources (which will also be integrated via names).
These seem to be the major resources for plant names themselves, and relevant to our project (i.e., leaving out APNI):
Before I compare the above resources, I’ll step back a bit and consider what a comprehensive, generic taxonomic name resource would consist of. A minimal list would be something like:
A graphical template for how these elements fit together might then look like this:
I’ve also added basic distribution data to the above graph (D1, D2). In this first phase our project is not yet integrating occurrences, but since we are working on a regional flora it will be convenient to restrict the list of taxa to just those occurring in Alaska and nearby lands. A number of online name resources offer presence or absence by geographical region (BONAP, USDA-PLANTS).
I’ve divided the above name resources into two classes: 1) those primarily containing original online representations of names databases and/or primary taxonomic literature, and 2) aggregators that primarily integrate and re-serve data from the first class. This classification is imperfect, since many resources in class 2 also incorporate primary data (e.g., The Plant List, the Catalog of Life, uBIO, ?WFO), as well as involve manual checking of names, which is a form of primary data generation. This classification is only with reference to serving taxonomic names; “aggregators” for names may be primary sources for other data (e.g., distribution data in USDA PLANTS).
Resource | GUIDs (A2, C4) | Orig. publ. (A3) | Status (B1, C1, C3) | Status ref. (B2, C2) | Distrib.(D1)/(D2) | API | Source desc. |
---|---|---|---|---|---|---|---|
1. Primary data sources | |||||||
BONAP→ | n | [1] | [3] | [1] | Y / [1] | [3] | Y |
GRIN→ | Y | Y | Y | [1] | Y / [1] | n | Y |
IPNI→ | Y | Y | n | n | n / n | Y | Y |
PAF→ [5] | [2] | Y | Y | ~ | Y / ~ | n | Y |
PLANTS→ | Y | [1] | Y | [1] | Y / [1] | [7] | n |
Tropicos→ | Y | Y | Y | Y | Y / Y | Y | n [6] |
VASCAN→ | [2] | n | Y | Y | Y / n | Y | Y |
2. Aggregators | |||||||
COL→ | Y | Y | Y | n | n / n | n | Y |
GBIF→ | Y | [4] | [4] | [4] | Y / Y | Y | Y: 1 2 |
GNA→ | Y | n | n | n | n / n | Y | ~ |
ITIS→ | Y | n | Y | Y | n / n | Y | Y |
NCBI→ | Y | n | n | n | n / n | Y | Y |
The Plant List→ | Y | Y | Y | n | n / n | [7] | Y |
uBio→ | Y | n | Y | n | n / n | Y | Y |
WFO→ | Y | Y | Y | n | n / n | n | n |
In deciding which taxonomic data to incorporate and how to use available GUIDs, it’s necessary to understand the flow of primary taxonomic data through the aggregators. Here’s a sketch of my current understanding of how these data flow, developed via reading i) the “How we made this” pages for each resource, and ii) the reference section for names in that resource. As with the rest of this blog post, I may have missed resources or misinterpreted what I read on the websites, and I invite you to please correct me with a comment below!
Besides online resources primarily focused on taxonomic names and their synonyms, there are of course many biological databases that might be integrated via taxonomic names. The resources we are already committed to drawing upon are:
Linking among databases would not be a difficult problem if the characters of a name (e.g., Antennaria alpina var. media (Greene) Jeps.
) never varied. But these strings are prone to copying (typing) errors and other orthographic variation, e.g., in abbreviating and punctuating the author string. An author like “Jo Bloggs”, might appear as Bloggs
, Blog.
, J. Bloggs
or J.Bloggs
, despite official recommendations. I just learned that IPNI has a policy of removing spaces after periods, which differs from policies elsewhere, and from the natural inclination of many people. (FYI: this Wikipedia article on author citations in botanical names is super, and one I frequently revisit.)
Short of comparing every pair of names by hand, some sort of computational tool is needed to determine if a name in database A is the same as in database B. Much thought has gone into this problem (e.g., Boyle et al. 2013, Rees 2014, Horn 2016, Patterson et al. 2016). Solutions can involve both ‘fuzzy matching’ (finding a match when not all characters of a string are identical, similar to BLASTing nucleotides), and the applications of rules (such as checking against a list of author abbreviations).
For locally matching a name to a list I’ve been using a home-cooked approach (I’ll post the code in in a subsequent blog):
and
to &
Å
to A
)agrep
(approximate grep
) from the TRE library.However, for matching to existing external lists (i.e., online databases) more capable solutions exist:
(n*n-1)/2
local comparison of every downloaded name list with every other one.Now to work! As I see it now, there are three intertwined elements of our reconciliation process:
Clear as mud? Let’s see how it goes…
(Phew! This blog is definitely TLDR. I’ll try to keep it shorter next time.)