Bridging Biodiversity and Biomedical Knowledge Through Literature
Indra Neil Sarkar, PhD
Divisions of Invertebrate Zoology & Library Services
American Museum of Natural History
Refactoring Digital Natural History Texts
University of Illinois at Urbana-Champaign
April 17, 2006
Life on Earth
Biomedical Literature
Medline
17 Million+ Records
Citations back to 1960's
Free, Web-based Interface
Organized/Indexed According to MeSH
Biodiversity Literature
Natural History Collections
PDF or Image Files
Free*, Institution Specific Search
Minimal Meta-data
Exploring Literature
Meet Information Needs
Map to Other Knowledge
Extract Domain Specific Features
Perform Data Mining for Novel Correlations
Automated Methods
Biomedical (Yes!)
Biodiversity (No!)
Biological Data Revolution
"Modernizing" Biodiversity Literature
Digitization
Optical Character Recognition
First Pass Markup
Domain Specific Markup
TaxonX
Designed as a W3C XML Schema
Identify Taxonomic Treatments
Major Structural Components
Open Source (http://sourceforge.net/projects/taxonx)
Stable
Consistent
TaxonX
Iterative Markup
Identify Taxonomic Treatments
TaxonX
Iterative Markup
Identify Taxonomic Treatments
Identify Lower Level Meta-Data
TaxonX Goals
Describe Structure of Systematics Publications
Enable Specific Queries
Taxa: Ant Literature since 1995
Fauna: Ant Literature from Madagascar
(http://antbase.org/databases/xml_publications.htm)
TaxonX Integration
Currently Stand-Alone Schema
Designed to Link with Other Schemata
XHTML
NLM Journal Archiving
TEI
Publisher-specific
Mash-ups: e.g., http://www.ispecies.org
TaxonX Going Forward
Plausible Intermediate Step
Biodiversity Heritage Library
Amenable to Automated Methods
Linkages to Molecular, Distributional, and Nomenclatural Resources
TaxonX Summary
Lightweight Schema
Small Learning Curve
General Applicability to Heritage Literature
Allows External Linkages
MODS for File-Level Bibliographic Metadata
Enables Iterative Deployment
Accommodates Differing Granularities
Darwin Core, Linnaean Core, SDD, etc.
ITIS, GUID, LSID, etc.
More Details to Come...
taXMLit, INOTAXA
Biodiversity Heritage Library
Unifying Thread:
Bringing Legacy Data into A Digital Commons For Sharing Knowledge
Then?
Taxonomic Knowledge
Extracting Taxonomic Names
Name Entity Recognition
Taxonomic Name Recognition (TNR)
Rule-based TNR Tools
TaxonGrab (AMNH)
FindIT (uBio)
Rule-Based TNR
Reverse-Lexical Lookup
Non-Taxon Name Words
Genus Species Nomenclature
Genus Capitalized
Suffix Patterns (e.g., -ia, -us)
Other Features
Infra-species, Strain
Authority Designation
Extracting Taxonomic Names
Medline
Digitized Text
Zoological Record
Chapin's Birds of the Congo
Why Not Just Look up Names?
Scientific Names
Escherichia coli
Federate Scientific Names
Collect Scientific Names
Digital Taxonomy Resources
Data Marts
Natural Language Text
Scientific Name Reconciliation
Many Names for Same Organism
Objective: Escherichia coli, Bacterium coli, Bacillus coli
Subjective: Brucella melitensis, Brucella canis, Brucella ovis
Many Organisms for Same Name
Agathis montana is both a wasp and plant
uBio
6 Million+ Name Strings
Reconciliation Groups
http://www.ubio.org
Scientific Names in Medline
Medline Is Indexed With MeSH Terms
MeSH Contains Some Scientific Names
How Many Medline Documents Contain Scientific Names?
How Many Scientific Names are Contained in Medline?
Scientific Names in Medline
Medline vs. Biodiversity Literature
Medline vs. Catalogue Lists
Remaining Questions
How many species overlap?
Are there valid linkages to be drawn?
Biomedical (e.g., molecular)
Biodiversity (e.g., biogeography)
Tracking Prospective Knowledge
WSDL/SOAP Integration
Taxonomic Intelligence
RSS
http://portal.ubio.org
Concluding Remarks
Retrospective Biological Knowledge
Not Just PDF's!
Biodiversity Heritage Library
Contemporary Biological Knowledge
Titles, Abstracts, Meta-Data (MeSH)
Medline
Prospective Biological Knowledge
Track New Publications via RSS
Services Integrated Into Interfaces
Acknowledgments
NSF/AMNH Team
Terry Catapano
Drew Koning
Bob Morris
Donat Agosti
Tom Moritz
Christie Stephenson
MBL/WHOI uBio
David Remsen
Patrick Leary
David Patterson
Cathy Norton
National Science Foundation (IIS-0241229)
Lewis B. & Dorothy Program for Molecular Systematics
Indra Neil Sarkar, PhD
sarkar@amnh.org
Pacific Symposium on Biocomputing - Biodiversity Informatics Track
http://psb.stanford.edu