The problem
The vast amount of data on life's biodiversity that scientists collect across the globe becomes truly powerful only when data from all the different sources can be integrated, queried, and aggregated with precision by machines. However, the by far most common way to link organismal data, using taxon names based in Linnaean nomenclature, is fundamentally fraught. The semantics of a Linnaean name, crucial to determine the organisms it does and does not refer to, change over time, are applied inconsistently, and most importantly, are unavailable to machines. And the explosion of molecular data has uncovered many groups, not only microbial, that do not and may never have a Linnaean name.
Our approach
We use ontologies and OWL (a formal ontology language developed by the W3C) to enable users to create definitions with precise, unambiguous, and fully computable semantics for any organismal clade. We call such definitions phyloreferences, in analogy to georeferences, which have enabled countless geo-enabled applications. We employ machine reasoners to determine which elements in a phylogeny match a phyloreference.
What we do
We are developing the OWL ontologies, OWL data models, and tooling infrastructure needed to put our approach into practice, and to validate that it works correctly and scalably. We will also develop web applications that allow users to create, find, and reuse phyloreferences, and to apply them to phylogenies both small and Tree of Life-scale, for the purpose of integrating biodiversity data of interest. To accomplish this, we will be working with other biodiversity e-Science projects (including Open Tree of Life and the Encyclopedia of Life) as well as large organismal research collaborations.