Introduction to BioGUID

Edited on: 28 Aug 2015

BioGUID.org is a service for indexing and cross-linking identifiers for data objects within the realm of Biodiversity informatics. As you might suspect, you can access BioGUID.org at http://bioguid.org.

The importance of reliable globally unique identifiers for mobilizing biodiversity data has been well established through multiple workshops, whitepapers, and publications spanning several decades. However, the biodiversity informatics community has made very little real progress in establishing broadly adopted standards and coordinating data providers towards common and sensible practice. BioGUID.org is a platform for indexing and cross-linking identifiers of all kinds, to better facilitate establishing relationships among digital biodiversity objects.

The core items managed by BioGUID.org are “Identifier Domains” (sets of identifiers) and the Identifiers within those Domains. Each Identifier Domain may have one or more “Dereference Services” associated with it (e.g., http://dx.doi.org/ is a Dereference Service for the DOI Identifier Domain). Identifiers are cross-linked to each other by being anchored to the same “Identified Object”. For example, the fish genus Odontanthias is represented by identifiers in ZooBank, ITIS, and the Catalog of Fishes (among others), and all of these identifiers are anchored to the same Object within BioGUID. With a core set of web services, documentation and web tools (with more features on the way), BioGUID.org is designed to help tie Biodiversity data together.

Structure

The primary component of BioGUID.org is a database system that stores information on Identifiers, Identifier Domains (sets of identifiers), “Dereference Services” (web services that perform some sort of action on identifiers), and a set of support tables that together enable the indexing and cross-linking functions that BioGUID.org provides. To understand how it works, it’s important to understand the basic components.

Identifiers

BioGUID.org considers identifiers in a broad sense. It certainly includes everything that modern informatics practitioners consider to be “good” identifiers (such as Digital Object Identifiers [DOIs], Universally Unique Identifiers [UUIDs], HTTP-URIs, and other kinds of robust identifiers). But it also accommodates other kinds of identifiers, such as ISSN and ISBN numbers, specimen catalog numbers, certain kinds of codes (e.g. country codes or language codes), and almost any text-string that a person might think of as an Identifier.

Identifier Domains

Identifier Domains represent sets of identifiers that are assigned to objects related to biodiversity. In theory, each Identifier Domain assigns one identifier to one object. In the real world, the same object may have been issued more than one identifier by the same Identifier Domain, but this is usually in error (e.g., when the same specimen is cataloged twice with two different catalog numbers, or when a duplicate record is discovered in a database of taxon names or literature citations). Also, there should generally be no duplicate identifiers assigned within the same Identifier Domain; but again, in the real world, mistakes sometimes happen (e.g., when the same catalog number is accidentally issued to two separate specimens). Identifier Domains my be very broad in scope. For example, Digital Object Identifiers, or DOIs, represent a single Identifier Domain because the identifiers are unique across all DOIs, and the intention is to issue one unique identifier for each object. Or, Identifier Domains may be very narrow in scope. For example, there is an Identifier Domain to represent the Language Codes assigned by the Biodiversity Heritage Library. In some cases, it may be arbitrary how finely to parse sets of identifiers into discrete Identifier Domains. For example, Bishop Museum issues catalog numbers to specimens in its collection. Considering only the catalog numbers themselves, there would be a separate Identifier Domain for each of the major specimen collections. There is a fish specimen with the catalog number “1234”, and a bird specimen with the same number, and a plant specimen, and an insect, and so on. Each “1234” would be an identifier within a different Identifier Domain. However, if the Collection Code is considered part of the identifier along with the catalog number, then there may be only one Identifier Domain for all of the Bishop Museum specimen collections combined. In this context, the fish specimen would have the identifier “I-1234”, and the bird would have “B-1234”, and the plant would have “BISH 1234”… and so on. One could even take a further step back and establish a single Identifier Domain to represent ALL Darwin-core triplets (instiutionCode + collectionCode + catalogNumber). This would be somewhat impractical to manage, however. Although there is some wiggle-room for defining when to establish a separate Identifier Domain for a particular series of Identifiers, for the most part it is a relatively straightforward and obvious process.

Dereference Services

One of the confusing and contentious issues concerning modern identifiers for biodiversity data objects revolves around the conflation of the role of “identification&rdquo of objects with information about how to retrieve information about something. In particular, the Linked Data (Linked Open Data, or LOD) community follows strict conventions that all identifiers must be represented as HTTP-URIs (that is all LOD identifiers begin with the characters “http://”, followed by a domain name and the normal template for HTTP URIs). One of the greatest values of doing this it that such identifiers are self-resolving; that is, they not only allow for the unique identification of objects (every HTTP-URI is unique), but the identifier itself includes information about a particular internet protocol (HTTP) and a particular internet address (via the DNS system) where information about the object can be accessed. There are costs and benefits to this approach (which would take far more text than this web page can afford to adequately address). However, BioGUID.org is designed to accomodate the LOD community standards (HTTP-URIs can certainly be indexed as identifiers), while simultaneously recognizing the need to index the multitude of identifiers that are nor represented as HTTP-URIs.

The way BioGUID.org achieves this “best of both worlds” approach is by managing the concept of “Dereference Services”. The easiest way to explain a Dereference Service in the context of identifiers is with the example of DOIs. To most people, a DOI (i.e., the DOI as an identifier) begins with the characters “10.”. Another convention is to prefix the DOI with the letters “doi:” — sometimes in upper-case; sometimes lower-case. Sometimes with a space between the colon and the 10.; sometimes not. In either case, a “naked” DOI (that begins with 10.) is perfectly acceptable as an identifier, but by itself fails to meet the needs of the LOD community. To perform action on a naked DOI, one has to know what to do with it. By broad convention, a DOI becomes actionable when it is prefixed with the characters “http://dx.doi.org/”. Doing so enriches the DOI by converting it to an HTTP URI. Most people (including us at BioGUID.org) do not consider the text “http://dx.doi.org/” to be part of the identifier itself; but rather it represents a form of dereferencing metadata — or, what we refer to as a “Dereference Service”. Effectively the text “http://dx.doi.org/” is a prefix that allows the DOI identifier suffix to be dereferenced. Again, it would take more space than this page allows to describe this in detail, but the approach taken at BioGUID.org is to track Dereference Services as independent objects with their own properties, that can be combined with Identifiers (and in some cases additional suffix text after the identifier) to make otherwise “naked” identifiers both actionable, and compatible with LOD requirements. Importantly a single Dereference Service may be able to perform action on identifiers from many different Identifier Domains, and likewise identifiers within a single Identifier Domain may be serviced by multiple Dereference Services. BioGUID.org supports this many-to-many relationship between Dereference Services and Identifier Domains (and the Identifiers they perform action on.)

Identified Objects

Another aspect of BioGUID that is much less visible, but equally important (if not more so) is the concept of Identified “Objects”. These objects are generally conceptual in nature (e.g., a taxon name, or a literature citation, or a collecting event, or an organism occurrence); and even when they apply to a “physical” thing (like a specimen), it’s better to think of the identifier as representing a conceptual object (note: there is also a long discussion concerning the application of identifiers to digital objects, such as media files and individual database records; but that is yet another topic that, if addressed adequately, would consume the space on this web page). If BioGUID.org did nothing other than index identifiers (and their associated Identifier Domains and Dereference Services), it would be performing a very valuable service. However, the true value of BioGUID.org lies in its ability to cross-link identifiers that have been assigned to the same object. For example, the bluefish (Pomatomus saltatrix) has been registered on ZooBank (under its original establishment by Linnaeus in 1766). This species also exists in the WoRMS database, and it also has a record in the Catalog of Fishes database, and in FishBase. Indeed, these various identifiers are all cross-linked to each other on the ZooBank site. However, the WoRMS record also includes a link to the Barcode of Life database and the Encyclopedia of Life database, as well as GenBank (i.e., the NCBI taxonomy). Each of these other websites contains additional links to other external databases. Some of these cross-links were generated semi-automatically; and others were made painstakingly by hand. One of the main purposes of BioGUID.org is to serve as a global cross-linking service so that, rather than have every individual website individually attempt to establish these cross-links among related records, BioGUID.org provides a set of services and tools to consolidate all of these different records through anchoring identifiers to a common conceptual “object”. That way, by establishing just one link (e.g., between a ZooBank record and the corresponding WoRMS record), the identifier equivalencies of each can be inherited not only by each other, but by every single cross-linked data system. The effect is a potential exponential expansion of cross-linking identifiers and their associated metadata.

Services

BioGUID.org is primarily a database system with a set of data services, and a web-service layer that support basic search and record creation functions. Although it has a simple website (which you are reading right now), this is primarily intended to serve as a repository for documentation, and to demonstrate the sorts of functionality that the web services are designed to fulfill. We plan to spend much more time in the near future expanding content and improving features and functionality. We will likely continue to make minor improvements to the website as well, but that is a lower priority. For more detailed documentation of the BioGUID.org web services and other information (such as the Data Model), please visit the API page.

Data Access and Licensing

Creative Commons ZeroAll content available within the BioGUID.org site, including information on Identifier Domains, Dereference Services, Identifiers, and cross-linked objects are available under the Creative Commons Zero (CC0, Public Domain Dedication).

The GBIF Ebbe Nielsen Challenge

The general concept behind this website has been floating around in my head for nearly a decade, and I registered the domain name over a year ago. However, the real impetus to finally build the underlying database and associated services was inspired by the announcement of the GBIF Ebbe Nielsen Challenge. In the weeks leading up to the submission deadline, the database was built from scratch, with a rich set of data-layer services and a few basic web services (and even more basic web page). However, we are firmly committed to the continued expansion of BioGUID.org and its services, to demonstrate the power and potential of developing a universal identifier index and cross-linking system for biodiversity data.

Back
Next