Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change examples for term - occurrenceID #491

Open
ahahn-gbif opened this issue Aug 14, 2023 · 25 comments
Open

Change examples for term - occurrenceID #491

ahahn-gbif opened this issue Aug 14, 2023 · 25 comments

Comments

@ahahn-gbif
Copy link

ahahn-gbif commented Aug 14, 2023

Term documentation change suggestion: move away from triplet combination examples for occurrenceID

  • Submitter: ahahn@gbif.org
  • Efficacy Justification (why is this change necessary?): In an effort to move towards stable identifiers for occurrence records, we observe that a non-negligible portion of inadvertent ID changes in GBIF-ingested datasets happens because e.g. collection identifier schemes within an institution get consolidated to common pattern across collections. This update then translates into the concatenated occurrenceID values, resulting in an id change. While we are still in a transitional phase where the former triplet of institutionCode, collectionCode, catalogNumber does a meaningful job as a record identifier, we would recommend not to promote this strategy for newly configured datasets, and rather suggest the use of identifiers that do not carry a meaning.
  • Demand Justification (if the change is semantic in nature, name at least two organizations that independently need this term): n/a
  • Stability Justification (what concerns are there that this might affect existing implementations?): n/a
  • Implications for dwciri: namespace (does this change affect a dwciri term version)?: n/a

Current Term definition: https://dwc.tdwg.org/list/#dwc_occurrenceID

Proposed attributes of the new term version (Please put actual changes to be implemented in bold and strikethrough):

  • Term name (in lowerCamelCase for properties, UpperCamelCase for classes):

  • Term label (English, not normative):

  • Organized in Class (e.g., Occurrence, Event, Location, Taxon):

  • Definition of the term (normative):

  • Usage comments (recommendations regarding content, etc., not normative):

  • Examples (not normative):
    http://arctos.database.museum/guid/MSB:Mamm:233627
    000866d2-c177-4648-a200-ead4007051b9
    urn:catalog:UWBM:Bird:89776

  • Refines (identifier of the broader term this term refines; normative):

  • Replaces (identifier of the existing term that would be deprecated and replaced by this term; normative):

  • ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG; not normative):

@deepreef
Copy link

I STRONGLY support this proposal (as those who know my opinions about this general topic could probably guess).

@matdillen
Copy link

If we remove these two recommendations, do we state to which terms such identifiers should be mapped instead? I presume this would be materialEntityID or parentMaterialSampleID, but those terms are not accepted yet.

Or is the implication that persistent identifiers with any kind of meaning (including most URIs that are also URLs) should not be used for specimens at all?

@albenson-usgs
Copy link

While I understand the rationale for this, in my opinion having worked with many data providers across a spectrum of data collection methods, it's practically speaking impossible to implement unless someone (hand waving) is going to mint and keep track of these opaque identifiers for projects. Most of the people I work with are still operating in excel spreadsheets - they absolutely do not have a good way to mint and keep track of opaque identifiers. If you ask them to mint opaque identifiers for their data what will happen is they will run this line of code (occurrence$occurrenceID <- sapply(occurrence$occurrenceID, function(x) UUIDgenerate(use.time = TRUE)) and you will get a brand new set of occurrenceIDs every time they republish the data. From my experience at least, if they are creating their occurrenceIDs using information in their data a person can walk that back to the actual record. Having opaque identifiers makes that impossible. If there is some solution to this that I'm am completely missing I would really like to learn about it.

@jdpye
Copy link
Member

jdpye commented Aug 16, 2023

Following up on what Abby said, because I was struggling with how to say it and I think she represented it very well, a UUID is not a useful tool for people updating and extending datasets that are actively adding new records and need to distinguish old records from new ones.

An example I have is that a listening station detects a coded pinger that is associated to an animal attachment event. Well, the attachment event tells me species, the station deployment tells me place, and the pinger recording event on the instrument tells me time.

If one of these components changes due to a mistake in recording any piece of that process, when I republish the dataset, I need to locate the entry for the occurrence record I need to correct, using the IDs of the components that I know very well contributed to the creation of the occurrence record. If it's a UUID, I have potentially lost my ability to do that transparently and authoritatively.

From my experience, which I grant is specific compared to the whole of the database, guidance for this field that would be valuable to a user learning how to create these archives would be to use something that is guaranteed unique, related to the occurrence record's creation, and authoritative to an individual record, and if you don't have something obvious to do that with, use a UUID as a last resort.

@peterdesmet
Copy link
Member

As I understand, the intent is to remove examples that carry meaning (which I understand). The only example that then remains is a UUID, which seems to give the impression (from the comments above) that that is the only valuable alternative.

In my opinion, integer identifiers that are maintained by the source (e.g. a database) and are immutable, are also valuable alternatives. I would therefore include the following as an additional example:

20622886648

Note: this example is derived from https://www.gbif.org/occurrence/3797662301, where the occurrenceID is the record identifier assigned by the source database (in this case Movebank).

@dbloom
Copy link
Contributor

dbloom commented Aug 16, 2023

I was just responding to this when @peterdesmet's message posted which mirrors my own understanding. This is a request only to have certain examples removed (which I understand, too), not a request to stop the use of certain types of identifiers, such as Triplets.

I will, however, admit to a little confusion about the implications of removing certain types of identifiers from the list of examples. Is there an implication here that GBIF is planning to move away from Triplets and URLs in the occurrenceID field?

Certain collections systems, Arctos being a solid example, are not likely to stop using a URL any time in the foreseeable future. Similarly, as @albenson-usgs described, the Triplet might be the most stable option for many collections with limited staff and technical expertise, especially as we work to include more data from ecologists and biologists not affiliated with traditional museum collections directly. Like Abby, I work with datasets like these, and their publishers, regularly.

I am also curious about the implications here as a GBIF Trainer/Mentor. Historically, Triplets are a recommended option for publishers moving through the BID Programme as well as other GBIF-inspired trainings around the world. Data publishers tend to take the definitions and examples provided in the DwC Quick Reference Guide quite literally, so I could see some confusion arising among publishers with this change.

@albenson-usgs
Copy link

I'd really love to discuss this topic in more detail than I think is possible in a GitHub thread (and apologies if I have taken this change request off topic a bit). I'd like to pose it as an unconference topic at the upcoming TDWG. Would folks be interested? Give me a thumbs up emoji if so.

@dbloom
Copy link
Contributor

dbloom commented Aug 16, 2023

@albenson-usgs That could be a useful conversation. I am unable to attend TDWG this year, so I don't know how participation will work in that unconference setting - we'll have to see how that room is set up.

For purposes here, I would also be curious for a little more clarification from @ahahn-gbif on this matter. No doubt she and the GBIF Data Products folks see many more datasets than we do, but I'm curious to know a little more about the value of removing examples like these from the GBIF perspective. I'm not opposed to the change, I would just like to know more and, like @albenson-usgs, I'm curious what else might be out there that isn't a randomly generated UUID.

@qgroom
Copy link
Member

qgroom commented Aug 16, 2023

Please don't let this thread get sidetracked on the various merits of different identifier systems.
@ahahn-gbif is not recommending a UUID she is removing the implicit recommendation of using a triple.
Triples have long be discredited, they are sometimes not unique, but worse still they totally fail when confronted by the myriad of ways people construct them.

Therefore I suggest adding an example of a URI as an example of a stable occurence ID

e.g. http://www.botanicalcollections.be/specimen/BR5020224598676V

References about rubbish triples...
https://doi.org/10.1371/journal.pone.0114069
https://doi.org/10.37044/osf.io/93qf4

@albenson-usgs
Copy link

Ok but the justification for this change states

In an effort to move towards stable identifiers for occurrence records, we observe that a non-negligible portion of inadvertent ID changes in GBIF-ingested datasets happens because e.g. collection identifier schemes within an institution get consolidated to common pattern across collections. This update then translates into the concatenated occurrenceID values, resulting in an id change.

And what I'm saying is that I don't believe making this change will lead to occurrenceIDs that do not change. If that is the goal then this change, I do not believe, will achieve that result. People will just create a series of letters and numbers instead that will change when they update the dataset.

If what we want to see is stable identifiers for occurrence records then I think we should work on that problem which I don't think will be resolved by only showing UUIDs as the example.

@dbloom
Copy link
Contributor

dbloom commented Aug 16, 2023

Yes. That is also where my confusion/concern lies. I don't see how the means leads to the desired end.

@qgroom
Copy link
Member

qgroom commented Aug 16, 2023

Except that all the evidence has shown that triples do not work. So far I've not seen any evidence that URIs or DOIs are not stable.

@MattBlissett
Copy link
Member

Recommended best practice is to use a persistent, globally unique identifier.

Perhaps expanding the notes would help, to explain why "persistent" and "globally unique" are useful.

Recommended best practice is to use a persistent, globally unique identifier.

A persistent identifier is one that will not change, allowing others to link to this record in perpetuity. Avoid including words or numbers that might change, such as the name or abbreviation of a museum department, or a current storage location.

A globally unique identifier works without any other information to identify the occurrence.

A UUID can meet both criteria, so long as the UUID is reliably stored, for example in a collections database.

(I'm not aware of any specific examples Andrea may have in mind.)

@albenson-usgs
Copy link

albenson-usgs commented Aug 16, 2023

Ok here is an example where the occurrenceIDs completely changed between V1.0 and V1.1. Admittedly it is me as a naïve data manager just trying to do my best and being told that it must be a UUID. But I am not alone in this- this is what people will do and it will not lead to the result you are wanting. I think where I struggle with this requirement is how you anticipate it being carried out for small datasets that are managed in Excel. Can you please show me an example of such a dataset managed completely in Excel that has successfully implemented a DOI or URI scheme for their occurrenceIDs?

@qgroom
Copy link
Member

qgroom commented Aug 16, 2023

Human errors will always happen with whatever system is used. Excel is particular clever at amplifying the chance of human errors.
The problem with triples is that in addition to human errors they are not inherently unique, stable or standardized.

I think we can suggest any of the common systems, including UUIDs, but not triples, because they are not fit for purpose.

@deepreef
Copy link

deepreef commented Aug 16, 2023

Wow... great discussion! So, I was probably too anemic in my initial response to this proposal. Although I do happen to support UUIDs as the ideal identifier for our context, that was not the reason I strongly supported this proposal. My support was for the notion of encouraging opaque identifier -- regardless of whether they are UUIDs or integers or randomly generated character strings or whatever. When I first started developing biodiversity data management systems (1980s), I was strongly opposed to what we used to call "surrogate primary keys". Instead, I preferred what we called "natural primary keys" -- that is, identifying some field or combination of fields in the database that (collectively) represented a unique combination of values for each record. My reasoning was that there was no need to maintain yet another field to uniquely identify each record, when unique identity was self-evident from the actual data-bearing fields themselves.

It didn't take very long for me to realize what a bad mistake that was. In the decades since then, I've not only embraced "surrogate primary keys" for databases, I've also recognized the importance of NOT using these surrogate primary keys as publicly accessible identifiers. Instead, they should be optimize for database purposes, such as integrity enforcement and performance for complex joins and stuff like that. That led me to a database architecture whereby a single data table in my database (called "PK") is responsible for locking in two permanent values: an integer and a UUID. The integer serves as the source for internal primary keys for every table in the database, and the UUID is married 1:1 with these integers and is used for representing an identifier for each record whenever the data are exposed externally.

Alas, despite two full-on workshops, multiple mini-confrences and symposia, several robust whitepapers, a couple of publicaitons, and dozens of presentations at TDWG, our community still seems to be stuck when it comes to representing unique identifiers for data records in our exchanged data. Part of the problem, I think, is that our community got too hung up on the "resolvable" thing. Basically, to make writing code a little easier, a lot of people wanted to follow the LOD path and conflate the roles of "resolution" (=dereferencing) and "identification" (i.e., unique, stable, persistent). Some even advocated that every identifier should begin with the characters "http://" (which, of course, results in breaking every single identifier once an SSL certificate is implemented, but I'll leave that one alone for now...)

Coming back to the issue at hand: there is a strong desire by many to make our identifiers friendly to human eyeballs. This is one of the reasons everyone likes DOIs more than UUIDs. But the REAL advantage of DOIs is not that they're easy on the human eyeballs, but because of the robust network of identifier dereferencing that exists (e.g., Crossref, etc.). It's also one of the reasons why (unfortunately) the outcome of the aforementioned TDWG/GBIF workshops resulted in a recommendation of LSIDs (sigh).

Sorry for the rambling context above, but this all leads me to my key point: We should keep things like triplets and other non-opaque values (i.e. "natural primary keys") in our datasets to make it easier for humans to track things down. But if we're ever going to cross the threshold of data integration and reusability in our community, we really need to get serious about moving towards real identifiers for data records. And I think this proposal (encouraging opacity for occurrenceID values) is an important step in that direction (though it's certainly not the only step).

Even for people who maintain their data in Excel -- if they can manage a column in their spreadsheets for things like "catalogNumber", why can't they simply add one more column for "occurrenceID", and populate it with some arbitrary and unique and non-information-bearing value (UUID, integer, whatever), then never change that value? They can use a formula like the one @albenson-usgs gave to initially populate the value, but there's no need to re-run the formula everytime the dataset is exported.

Sorry for the rant -- it is NOT my intention to hijack this discussion to become a general debate about identifiers (which we've had many, many times before). But I felt the need to respond (indirectly) to some of the comments posted, and expand the explanation for my support for this proposal.

@Jegelewicz
Copy link

Triples may not be unique, but I don't think you can get more unique than http://arctos.database.museum/guid/MSB:Mamm:233627? A web address must be globally unique or the internet wouldn't function. Just because we include what someone calls a "triple" in it should not make it unworthy of being a GUID. @dustymc

@deepreef
Copy link

deepreef commented Aug 16, 2023

Triples may not be unique, but I don't think you can get more unique than http://arctos.database.museum/guid/MSB:Mamm:233627? A web address must be globally unique or the internet wouldn't function. Just because we include what someone calls a "triple" in it should not make it unworthy of being a GUID. @dustymc

Sure, you can certainly achieve uniqueness; but you break persistence the moment the collection decides to change the collectionCode to "Mam", or someone discovers a typo in the catalogNumber, or if you add an SSL certificate to the website and the identifier changes to "https://arctos.database.museum/guid/MSB:Mamm:233627".

Sure, I know that last one is easily brushed aside because web resolvers usually hande redirection appropriately. But the point of opacity in identifiers is more about persistence (stability) than it is about uniqueness.

@albenson-usgs
Copy link

Ok here is an example of a dataset just freshly published. Code is here. For this one the project was observing sound in the ocean which also includes species observations from those sounds. The project leads have archived all the data and shared it publicly. They are interested to share the species observation portion but the data are not managed with that goal. A data manager separate from the project then took those observations from those sound recordings, aggregated them together (they were spread across multiple datasets), aggregated them so that only one occurrence per day is recorded and aligned them to Darwin Core. An occurrenceID is necessary. The person performing the work is not the holder of the data. I'm truly and honestly curious what those advocating for opaque identifiers would have the data manager do in this instance? Anything that's done will not be stored by the original project.

@dustymc
Copy link

dustymc commented Aug 16, 2023

Yup. MSB:Mamm:233627 isn't a GUID, and if I had my way we'd stop using it even internally. (It works fine internally, but tends to leak and that leads to things like low-quality citations.)

http://arctos.database.museum/guid/MSB:Mamm:233627 is a GUID, and is as stable as the Curators care to make it, just like the material is represents. Containing a "triple" doesn't contaminate the identifier.

We run SSL and it still works.

8088573f-8aba-4f9c-90ab-fefa895b9532 is more likely to be unique than 1 - unless there's a unique index involved - but it's can't really DO anything and won't generally lead humans to material. (Anyone with the keys to MSB can probably find https://arctos.database.museum/guid/MSB:Mamm:233627 without involving electrons.)

break persistence the moment the collection decides to change the collectionCode to "Mam",

That's not a collectionCode, so nope, it'll survive that change no problem. It might devolve to a UUID-like "probably unique but doesn't DO STUFF" identifier if the HTTP protocol goes away or something, but that wouldn't remove its identifier-ness.

DOIs might keep DOING STUFF through events that could remove functionality from URLs, but they're also more work to maintain. There are a few (~20K - example: https://doi.org/10.7299/X7C829FR ) assigned to records in Arctos, there's no barrier to using them I think the cost/benefit analysis just tends to come up lacking except in some specific situations.

someone discovers a typo in the catalogNumber,

That can be dealt with in the same sort of way that discovering a problem with the labeling of physical items can be dealt with, and doesn't do anything to the identifier. http://arctos.database.museum/guid/DGR:Mamm:10002772 is an example of a record that's been recataloged (not quite a typo, still a change).

more about persistence (stability) than it is about uniqueness.

Both are necessary if one wants to DO STUFF at scale.

@Jegelewicz
Copy link

In the grand scheme of things who cares what the recommendations are? People will do whatever until there is an affordable and manageable path that allows them to do what is necessary.

@deepreef
Copy link

Sure, as long as "http://arctos.database.museum/guid/MSB:Mamm:233627" is minted once, and remains as such always even if someone realizes the catalog number is actually 233672 for this record -- that's great. This, of course, requires that the identifier be stored as a literal value (not contstucted automatically from the constituent parts at export time) -- and I wonder how often that's done. Even if it is done, in my experience there is an overwhelming temptation to want to "correct" the identifier to "http://arctos.database.museum/guid/MSB:Mamm:233672" (when the catalog number error is discovered). But as long as the data manager can resist that urge, then everything should be OK.

As for conflating the functions of identity with "DOING STUFF", I'll refer to here. I've written elsewhere (extensively) about why the roles of identity and doing stuff should not be conflated; but that doesn't mean I'm right.

@ahahn-gbif
Copy link
Author

Thanks for this great discussion! I am rather overwhelmed.

As more background was asked, just a few notes: for the last year, GBIF has been monitoring more closely which datasets change a significant portion of their occurrenceIDs in successive ingestion runs, holding ingestion on those, and communicating with publishers on options. It turns out that for some datasets, ids will always change and there is (supposedly) nothing to be done about it - this is more often the case for data aggregates and observation data, and maybe a different conversation to be had; others change through some kind of error, and publishers notified are happy to revert; and yet another group change because some systematic change in one of the text values causes all occurrenceIDs to change.

My suggestion was not meant to promote UUIDs as the only option, sorry if it came across that way. It was rather to not promote triplet ids as much as we have done so far by removing the explicit triplet example (no. 3 more than no. 1) - and being lazy in not providing examples for alternative options.

Just to be clear, there is no intention on the side of GBIF to enforce a particular use of occurrenceIDs, or to phase out others. We recognize that people will do what is most practical in their daily work. Where the option exists, however, we would want to encourage going for both persistent and globally unique, so that we can move one more step away from identifiers that keep going out of scope. What exact shape that takes is not as important, provided it aims for real persistence alongside uniqueness.

@ymgan
Copy link

ymgan commented Aug 26, 2024

Hi everyone!

@albenson-usgs @sformel-usgs @jdpye @emiliom and I would like to bring in our perspectives as data managers for this topic by taking a different tangent - the PROCESS of creating Occurrence record and occurrenceID. We work with field scientists, we clean and transform their raw data into Darwin Core tables. We felt that this part of the conversation was not being understood.

Our data providers do not usually manage an Occurrence table

Very often, our data providers DO NOT have an Occurrence table in the original data. Examples:

Screenshot 2024-08-26 at 18 01 37
Screenshot 2024-08-26 at 18 01 46

Very often Occurrence is represented as a cell linked to multiple tables. Why wide table? We asked our data providers. This is how field biologists think of data!

occurrenceID is needed to trace back to the original data!

Exactly why we use transparent identifier for occurrenceID!

Screenshot 2024-08-26 at 18 02 53
Screenshot 2024-08-26 at 18 03 03

Since our data provider does not manage an Occurrence table. They will be confused if a user found their contact info on the dataset EML, email them "Hey, I think you might be missing a negative in your decimalLatitude in your record (occurrenceID: 4823f29b-2c6e-4a43-86b1-430e16a9c34e)". How can our data provider knows which row, which cell to look at? Unless they send the link to the record. Btw, not all aggregators display verbatim data while changing the data (e.g. replacing values).

Data comes in different shapes and forms

Most of our data providers manage data in excel spreadsheets, the tables also often lack primary key (they do not need it). We also receive word and pdf documents where the data consists of multiple tables, unstructured text and sometimes with multimedia. We all have different level of resources. Some data providers have databases, some data providers do not have CMS.

We may not have the resources to maintain opaque occurrenceID, which leads us to the next point.

The cost of changing our data provider's approach

I tried developing data templates for our data providers before they go to expedition which forces them to record data in a tidier manner. It creates friction in how they manage the data. When I receive the data upon their return, they added columns they used to have and only use some of the columns that I created. The data becomes even more difficult to clean and transform. Subsequently, we were being excluded from the next expedition meetings and I did not receive datasets from them anymore.

FAIR data is not a requirement (at least for most of the data providers of the antarctic node) and also not a priority for many of our data providers (they are being evaluated by the ranking of the journals, not number of datasets). Many of our data providers are only required to make their data freely available. When there is friction, they will go to easier route, such as publishing dataset to Zenodo, PANGAEA.

What we are really saying is that if we promote meaningless identifier and REQUIRE them to be STABLE when our data providers DO NOT even manage Occurrence table, is creating friction with our data providers who lack the resources. This may come with a cost and risk of having dataset not being updated or not wanting to publish the datasets in Darwin Core Archive at all.

Transparent identifiers are still useful

We certainly do not want our data providers to stop updating their datasets or stop publishing their datasets to GBIF.

We also think that the stability of gbifID and stability of occurrenceID are two separate issues. We think that acknowledging transparent identifiers is important because at this stage, transparent identifiers maybe the best thing that we could come up with! We hope that we could acknowledge this with curiosity and empathy.

Please keep the transparent identifier example

Please keep the triplet examples in the example. The current triplet of institution:collection:catalogNumber is still useful and valuable for data providers without a CMS.

We think that it will be helpful to have a set of principles to guide the user in creating or choosing the identifiers they need and maintain. Exactly where the principles should be, we are not sure, perhaps in the comments?

SYM25 in TDWG 2024

Finally, we submitted an abstract titled "What matters for an occurrenceID and what is an occurrenceID that matters?" in SYM25 Occurrences are Neither Specimens or Samples: Data modelling challenges and opportunities for information storage and exchange

Some of the screenshots above are taken from our slides. If you are interested, we will be delighted to share our experience with you in the talk!

Thank you so much!

@tucotuco
Copy link
Member

Thank you all for this thorough and fundamental perspective. The "Darwin Core Triplet" example was included exactly to reduce obstacles to data sharing. It is good to have this additional experience from another community of practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests