Posted by Cam on 2019-01-03
A single taxonomic name often occurs in different forms in different publications. In terms of character string differences, the variation may be slight (e.g., a missing space after an author’s initials, or a single character misspelling of a specific epithet), or major (e.g., a missing basionym). Consider: Ciminalis prostrata (Haenke) Á. Löve & D. Löve
vs. Ciminalis prostrata (Haenke) A.Love & D.Love
vs. Ciminalis prostrata Love & Love
. These variations sometimes arise via copying or data entry errors, but more often are created through the choices of how to encode the history of a name that are made by authors citing an earlier name: to add the basionym or not, to treat a validly publishing author as an “ex” author or as the main author, and how to abbreviate the authors’ names. While the International Code of Nomenclature for algae, fungi, and plants contains precise rules for name citation (Articles 46-50), there are also many recommendations, and often several “correct” ways to cite a name. The Code has also evolved over time, and many citations were published before clear rules existed. Hence the same name is often cited validly in a variety of forms, as well as being cited in erroneous ways. These citations find their way into databases and therefore present problems for the integration of data, whether it is linking synonyms to valid names, or linking secondary data from different sources.
How do we then determine if two variants of a names are really “the same?” I.e., despite the variation, do these two name strings refer to the same name as first published. No judgment about the taxon concept (or circumscription) of the names is being attempted at this step. The most reliable way to solve this question is to track down the usage(s) of both names in the primary literature and carefully determine if the authors were referring to the same name as originally published. While this approach is feasible in the case of a taxonomist performing a revision, it is too slow to incorporate when assembling large lists of names, as in the data integration step of our flora of Alaska project. What is needed is a sensible, transparent algorithm (set of rules) that can be applied by a machine, with human judgment being required in only a small number of cases. As long as the original names are preserved and the details of the automated judgment are recorded, then even if a few matches might later be found to be incorrect, no information is lost. This process of designing a taxonomic matching algorithm and associated software tool has occupied me for the past weeks.
There do exist online now at least two good tools to help match submitted names to various large names databases: The Global Names Resolver of GNA and iPlant’s TNRS (Taxonomic Name Resolution Service). Both of these offer approximate (fuzzy) matching of submitted names to sources such as IPNI and Tropicos. TNRS returns an overall, numeric match score, as does GNR. GNR also returns one of five match_type
values (Exact match, Exact match by canonical form of a name, Fuzzy match by canonical form, Partial exact match by species part of canonical form, Partial fuzzy match by species part of canonical form). However, where an non-exact match is found, both tools still require manual checking to confirm the exact nature of that match (e.g., misspelling, differences in author string).
[Added 2019-01-10: There is also the generic OpenRefine, a data-cleaning tool that can be applied to taxonomic names from any source. See these relevant posts: Rod Page, GBIF, Tersigni & Vaidya, Vaidya.]
The majority of mismatches are due to minor variations in the author string and often an exact match between intended names can be inferred using some “taxonomic logic.” This additional taxonomic resolution was not available in any pre-existing tool that I know of.
Because there are general rules that govern the construction of taxonomic names, some additional pattern-matching can be applied in comparing the different elements of two similar names. This can then produce an estimate of the likelihood that two name strings actually refer to the same name.
The most common elements of a (hypothetical) botanical name are these:
Salix alaxensis subsp. glauca (Andersson ex DC.) R. Coville ex Jones in Smith <---> <-------> <----> <----> <-------> <----> <--------> <---> <---> gen sp irank infr basio ex_bas auth ex_auth in_auth <---------------------------> = taxon_name <-------------> + <----> = canonical_form author_string = <-------------------------------------------->
(The codes used in this blog are:
gen
: Genus.sp
: Specific epithet.irank
: Infraspecific rank.infr
: Infraspecific epithet.basio
: Basionym author(s): the author of the specific epithet before a change of genus or of infraspecific rank.ex_bas
: ex Author(s) of basionym (see ex_auth
).auth
: Primary author of name: the author responsible for first publishing the combination of gen
and sp
(and irank
and infr
if they exist).ex_auth
: ex Author(s) for primary author: if the publication of the name by auth
was invalid, the ex_auth
was the author who subsequently published the combination validly.in_auth
: in Author(s) for primary author: if auth
or ex_auth
were responsible for the combination but were not actually the authors of the publication in which that combination first appeared, the author(s) of the publication are should be added after in, usually with some bibliographic citation details following.taxon_name
: gen
and sp
(and irank
and infr
if they exist)canonical_form
: gen
and sp
and infr
without irank
.author_string
: The whole author string.)The presence or absence of, or variation in, these elements in two names that are being matched indicates different kinds of choices or errors by the person citing the name, and thus can give an indication of the likelihood that the matched names refer to the same name. Here is our current logic, which is encoded into the matchnames
software tool (below). The types of variation are approximately ranked in order of decreasing likelihood that the names are truly the same:
Variation | E.g. | Likely cause of variation | Likelihood that names refer to same name [3] | (Match code; see below) |
---|---|---|---|---|
Spacing differences, missing periods, differences in ASCII vs. non-ASCII characters (accents, etc.) in author_string
|
[A] | Minor formatting choices, variation in encoding | Very high |
auto_punct
|
Missing initials for an author, or different abbreviations of author’s surnames, in author_string
|
[B] | Choices and/or errors by the citing authors | Very high |
manual
|
Spelling variation in gen , sp , infr with (essentially) the same auth
|
[B1] | Copying errors | High |
manual
|
Missing in_auth in one of the names
|
[C, D2] | One of the citing authors failed to include the in author | High |
auto_in+ , auto_in-
|
One name with a basionym, the other without | [D] | One of the citing authors failed to include the basionym | Medium |
auto_basio+ , auto_basio-
|
Missing ex_auth (or ex_bas ) in one of the names
|
[E] | One of the citing authors failed to include the ex author | Medium |
auto_exin+ , auto_exin-
|
auth in one name is ex_auth in the other
|
[F] | Confusion by citing author [5] | Medium |
manual?
|
auth differs: 2nd author missing from a pair of authors (sep. by “&” or “et.”)
|
[G] |
More likely to be an error (missing author) than a republication of the same taxon_name by a different author team [6]
|
Medium |
manual?
|
Different basionym author (basio ), same primary author (auth )
|
[H] | This should not happen [1], and is likely to be an error | Medium |
auto_basexin
|
Obviously incomplete author list in auth
|
[I] | An error | Medium |
manual?
|
Same basio , missing auth
|
[J] | An error [7] | Low |
manual??
|
Same canonical_form and auth , different irank
|
[K] | An error by citing author [2] | Low |
auto_irank , manual??
|
Same taxon_name , different auth initials
|
[L] | Probably separate publications of the name, but could be error | Low |
manual??
|
Same taxon_name , missing first author in auth
|
[M] | Maybe separate publications of the name, but may also be error [6] | Low |
manual??
|
Same taxon_name , clearly different auth
|
[N] | Separate publications of the name [6] | Very low |
no_match
|
matchnames
program To deal with this problem of i) needing more taxonomic precision in judging name similarity than that provided by a generic fuzzy-match score, while ii) not having time to check every fuzzy match (and definitely not having time to return to the primary literature for every name encountered), I created the matchnames
tool. It applies a sequence of taxonomic, rule-based transformations to the names in two lists—a query list (A) and a reference list (B)—and then if a match is found outputs the match and the kind of variation. If two names cannot be matched automatically, but do match approximately (“fuzzy regex matching”), they are presented to an operator for a human judgment. This is usually a small subset of the whole of list A.
matchnames
is available on Github). Full details of use are in the tool’s man
page. The sequence of matching logic (and output match codes) follows the above framework of potential reasons for name mismatch:
exact
.auto_punct
.auto_noauth
.auto_basio+
, auto_basio-
.auto_in+
, auto_in-
.auto_exin+
, auto_exin-
.auto_basexin
.auto_irank
.manual
, manual?
or manual??
.no_match
.The human operator can then apply further rules which are hard to program for automated decision-making. This is our current rule-set, following the above table:
Accept a match at high confidence (manual
):
author_string
,gen
, sp
, infr
with (essentially) the same auth
.Accept a match at medium confidence (manual?
):
auth
in one name is ex_auth
in the other,auth
differs: 2nd author missing from a pair of authors (sep. by “&” or “et.”)auth
Accept a match at low confidence (manual??
):
basio
, missing auth
taxon_name
, different auth
initialstaxon_name
, missing first author in auth
Reject the match (no_match
):
With large lists this manual checking can be a time consuming phase, and is prone to operator errors due to concentration lapses. Some of these decisions might still yet be achieved with clever code.
The potential reasons for a mismatch between two citations of the same name generate different match codes, which can at a later date be used to limit the accepted matches to certain levels of confidence:
Confidence of match | Match codes |
---|---|
Very High and High |
auto_punct , manual , auto_in+ , auto_in-
|
Medium |
auto_basio+ , auto_basio- , auto_exin+ , auto_exin- , manual? , auto_basexin
|
Low |
auto_irank , manual??
|
Very low; not a match |
no_match
|
Using the matchnames
tool it is possible in a matter of only a few hours to compare a list of Alaskan plant names to other resources, such as IPNI and Tropicos. The type of orthographic match can be recorded, as can the original names, and the date and name of the operator. Thus if in the future, after further research, a particular match is deemed to actually be incorrect, no data has been lost and the match status can simply be corrected. In the next post I’ll talk about the data model we’re using to store these orthographic matches.