Digitizing Hultén’s Flora
The standardized layout of a taxon account
The digitization steps were:
- Scanning. We used Dateline Printing in Fairbanks, who have a Xerox Color 550. A copy of the book was sacrificed (the covers and spine were removed), and the pages were fed into the sheet feeder. The positioning of each page was quite consistent, which helped greatly with sectioning the page elements (below).
- Sectioning. Using a combination of Imagemagick tools and scripted algorithms, each page image was divided into subimages: one or two taxon accounts and/or identification keys, then each account into illustration, text, and two maps (see image above). About 25% of pages required some manual intervention.
- OCR. Each text block was processed with tesseract to extract text, using a comprehensive list of expected genus and species names as input to increase spelling accuracy of scientific names.
- Text sectioning. In a text editor, the raw OCR text was manually divided into six paragraphs: taxon names, synonyms, description, ecology, taxonomy notes, and uses.
- Specimen point extraction. The image of the Alaska specimen distribution map was processed (see details) to extract coordinates for each specimen dot. Spatial error in the extracted locations can be up to 50 km from the point on Hultén's maps.
- Collection. Scripts then scanned appropriate directories and files to check for a known set of likely errors, and then assembled the text elements and point locations into a single XML file. Family information (Hultén and APG4 familes) was also added to each taxon at this stage.
Steps 2-4 required manual input, and were executed page-by-page, working through the book. Overall the quality of the OCR has been good, but some spelling errors remain. Misspelled taxonomic names will be reconciled to our canonical list of name before Hultén’s data is incorporated into the new online flora.
Finally this web app was created to reassemble the various elements of each taxon account. The app uses a single Gawk script to reserialize the XML file. The search string is interpreted as a (case insensitive) regular expression, and is applied to scientific name, Hultén family, APG4 family, synonyms, and common name. The Leaflet javascript library is used to display the specimen locations.
Please send any feedback to Cam Webb (cowebb@alaska.edu).
This is a digital representation of Eric Hultén’s ‘Flora of Alaska and Neighboring Territories: A Manual of the Vascular Plants’, which was published by Stanford University Press in 1968. The book was digitized by C. Webb (at UAMN) as part of the Flora of Alaska project, with funding by the US NSF (Grant 1759964 to Ickert-Bond & Webb), and with permission of Stanford University Press. Data and images © 1968 Board of Trustees of the Leland Stanford Jr. Univ. Usage licence: Creative Commons BY-NC-SA 4.0. NB: You may find OCR errors; please refer to the hard-copy if in doubt.