Trouble and solutions with museum specimen identifiers

(First published Aug 13th 2015, edited Oct 26th 2015)

After reading a few articles discussing the troubles with museum specimen identifiers (Guralnick et. al. 2014, 2015), I decided to write down how we have managed identifiers at the Finnish Museum of Natural History, and what kind of solution we have come up with.

We are currently building a collection management system (CMS) ’Kotka’ in the Finnish Biodiversity Information Facility project, to be used in some Finnish natural history museums and botanic gardens. The development is done with agile methods: we have had a working CMS since 2011, which is continuously improved and expanded to new collections and domains (gardens, DNA labs). Also the use of identifiers has gone through some changes, most notably from LSID’s to HTTP URI’s in 2012.

Currently, every specimen is given a globally unique HTTP-URI, e.g. http://id.luomus.fi/GV.123. The HTTP-URI consists of several parts:

http://	protocol prefix
id.luomus.fi	domain name
GV	namespace identifier
123	object identifier

In the beginning we used only our own domain name id.luomus.fi as a part of the identifiers, but as new organizations have come in, we’ve started also using an organization-neutral tun.fi domain for their new specimens.

Desirable properties of the id’s

Short that they will fit on a small insect label.

Dumb, in that they do not contain any information. This makes changes easier, since identifiers don’t have to be changed e.g. when specimen is donated from a collection to another. All users don’t agree with this (see below).

Human-readable, memorable and easy to type (compared e.g. with UUID’s)

Creating HTTP URI’s

There are a few ways we use to create HTTP-URI -identifiers in our system:

1) Automatically. The CMS creates a new identifier when specimen data is recorded into it. This fits to a process where specimens are added to the CMS one by one.

2) Manually by a collection manager, before recording the data to the CMS. This fits better a process where data is entered into Excel sheets and uploaded to the CMS in batches.

In this case, each person is assigned a unique 2- or 3-letter namespace identifier. Then they are able to use consecutive numbers for the specimens under that namespace, and are responsible only for keeping their own numbers unique. For instance, Jere K. can create identifier http://id.luomus.fi/GV.123 and Juho P. http://id.luomus.fi/GP.123 without conflict.

3) By using the old identifier as a part of the new one. This fits a case where specimens are already digitized and labelled with catalogue numbers.

There are a few reasons for using old identifiers:

The old identifier (which may have been used elsewhere) acts as shorthand for the new one, so that users can easily see that H0003706 and http://id.luomus.fi/HA.H0003706 are the same thing (this also reduces errors)

Mitigating resistance to change, by making new identifiers to resemble the old ones (this has been surprisingly important for some).

If new identifiers are needed anyway, why not use the old ones to create them (instead of random numbers).

But there are also drawbacks:

The same numbers have been used in multiple collections. For example, there may be several (even dozens!) of specimens ’1’ in institution X; one in each subcollection. We make these unique by adding a collection specific namespace identifier, so specimen number 1 could become http://tun.fi/SLE.1

The same numbers have mistakenly been used for several specimens in single collection (e.g. when a stamping machine has jammed, or old CMS has not validated them for uniqueness). These need new identifiers, which can be created e.g. by adding letter A to the end of the identifier: http://tun.fi/SLE.0A and writing that by hand on the specimen label.

Typing old identifiers during mass-digitization process to create label stickers is slow and error-prone. In this case, we have created new, automatically generated identifiers and storing the old number as a synonym. E.g. http://id.luomus.fi/EIG.123

Roadblocks

There are quite a few issues which have slowed down and complicated the adoption of HTTP URI’s.

1) Many people think that HTTP URI’s are first and foremost web addresses, and refer to them as such. Since they know that web addresses break and change, they fear that identifiers also will.

To alleviate this idea, I have referred to the identifiers as ”random character strings that just happen to resemble web addresses (and also function as such). How lucky!”

2) People fear that others would not accept or use them. Will an academic journal accept specimens to be referred with ”web addresses”, or will they require old-style catalogue numbers or DwC triplets?

In some cases, this is a valid fear. For example, BOLD doesn’t accept slashes in identifiers, but advises using DwC triplets for both voucher specimens and samples.

Moreover, there can be several, conflicting requirements or recommendations for identifiers. Canadian Centre for DNA Barcoding (CCDB), which is used to sequence samples that are sent to BOLD, doesn’t accept colons (or slashes) in sample identifiers, so DwC triplets are out of the question, as are HTTP URI’s.

What to do in these cases? Label the specimens and samples with different secondary identifiers for each purpose?

We have made guidelines on how to convert identifiers for certain purposes. For example, http://id.luomus.fi/JA.123 would become MZH:JA.123 for BOLD Museum ID and MZH_JA.123 for CCDB Sample ID.

Several layers of requirements and recommendations also create misunderstandings, especially when combined with different terms for things (catalogue numbers, specimen identifiers, museum ID’s, voucher codes…) and the fact people come and go. Who remembers what John said about museum id’s in 2011 and Jane about catalogue numbers 2013? This can create a horrible mess (see examples*).

3) People are used to using numbers and collection/institution abbreviations. Some are even proud of them and see that abbreviations on specimen labels bring prestige. ”We are one of the few who have one-letter abbreviations, so we should use it.”

4) People don’t see why global uniqueness and using identifiers automatically is important. ”Skillful researcher will know where to look for information about specimen ABC123” (Just check the institution from registry X and go to its website to search with the identifier.)

The idea of automatization and using large specimen datasets are unfamiliar to many. This can lead to careless use of the identifiers; e.g. they are delivered to external systems (such as BOLD or Genbank) in varying formats. Examples should be provided to show how specimen data can be used and mashed up in a large scale.

5) People feel that it’s easier to use the specimens, if the identifier tells something about them (even if it could be looked on a database):

Owner institution of the specimen
Consecutive numbers, which can be used to calculate how many specimens were acquired during a particular period

For these reasons one can create new identifiers by hand (using consecutive numbers) in Kotka, and joining institutions can choose their own domain name to be part of the identifier.

6) All digitized and databased specimens have no identifiers, or they are not printed on the label. Re-labelling all specimens would be too much work. These we have just left to be labelled later (e.g. when the specimen is loaned of referred to in a study).

Protocol prefix should be used

Some have questioned whether the http protocol suffix (http://) should be used, since we don’t know if (or when) it will be replaced with something new. I think it should be used because:

It makes the identifiers dereferenceable. Better to use something that is working at least now, than something that isn’t working even now.
It states the type of the identifier and the logic of whom is responsible for its global uniqueness (domain management is outsourced to an official entity).
There will very probably be standard processes to dereference HTTP URI’s in the future even if http in itself is abandoned, because they are used in so many domains and places (not only in biodiversity informatics, but everywhere). And if there aren’t, it’s because of some global disaster, which makes this the least of our problems.

Solution

There is no prefect or standardized way to handle this, but little by little a solution has evolved.

The original catalogue number is recorded in a separate field. Then a new HTTP URI is created automatically. Both fields are made searchable, so the specimen can be found with either one. At least the HTTP URI is printed on the labels, and possibly the original one also.

This way the problems with the old numbers are not passed on to the HTTP URI’s. This also gives flexibility to researchers: they are free to use either one (or both) during the research and publication process.

Drawback is that curators are probably going to use the new identifier less, and stick to the old one. As a solution to this, the benefits of using the new HTTP URI should be made clear, for example by making tools that mash up and analyze the data in new and scientifically relevant ways. This requires effort from not only our institution, but as the whole biodiversity informatics community.

* Examples of duplicate identifiers

This is a made-up example of what kind of identifiers could have been used for one specimen. I have seen all of these kinds of identifiers being used or recommended.

123	catalogue number, hand-stamped on the label
ABC123	institution code (pre-printed on the label) and catalogue number
ABC:123	DwC triplet without collection part
ABC:diptera:123	DwC triplet with a made-up collection code
http://id.abc.org/XY.123	HTTP UURI with namespace ’XY’ that makes identifiers unique between collections
XY.123	Short version of the HTTP URI (’qname’); should be avoided, but is used anyway
ABC_XY.123	Sample ID for CCDB and Museum ID for BOLD without colons and slashes
ABC_XY_123	Different version of the previous ID
ABC:XY.123	DwC triplet version of the URI

And also perhaps:

http://id.abc.org/XY.123A	HTTP UURI after someone noticed that there are several specimens ’123’ in the collection
XY.123A	Short version of the previous id

…and so on.

Even more combinations are possible, if identifiers from the previous table are used to create new identifiers automatically. For example:

ABC:diptera:ABC_XY_123 DwC

triplet of institution code, made-up collection code and a museum id

References

Guralnick R, Conlin T, Deck J, Stucky BJ, Cellinese N (2014) The Trouble with Triplets in Biodiversity Informatics: A Data-Driven Case against Current Identifier Practices. PLoS ONE 9(12): e114069. doi:10.1371/journal.pone. 0114069

Guralnick RP, Cellinese N, Deck J, Pyle RL, Kunze J, Penev L, Walls R, Hagedorn G, Agosti D, Wieczorek J, Catapano T, Page EDM (2015) Community Next Steps for Making Globally Unique Identifiers Work for Biocollections Data. ZooKeys 494: 133–154. doi: 10.3897/zookeys.494.9352