Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisiting comments on dwc:recordedBy, dwc:identifiedBy, dwc:georeferencedBy #450

Open
dshorthouse opened this issue Jun 5, 2023 · 2 comments

Comments

@dshorthouse
Copy link

dshorthouse commented Jun 5, 2023

The comments on the above terms recommend the use of pipes ( | ) to separate the values in a list. I wish to raise an observation about this recommendation and urge that these comments be removed or replaced with something far less syntactically stringent. The root of my concern is whether or not purported items in a value for this term can in fact be cleanly represented as a list. My argument here is that no, they cannot. Forcing them into a list of units introduces unintended bias.

Although none of these terms make any mention of verbatim (due in part to the philosophical rabbit holes that such a word has us tumble down), neither do they recommend that these terms convey the identity of people or organizations. Nonetheless, recommending that pipes be used to separate items in string as through they were a list does in fact nudge us down that path. Rarely, if ever would one see a pipe separating members of a team on a collecting label. Rarely, if ever would such values be expressed as seemingly formal Western structures as provided in the comment like Oliver P. Pearson | Anita K. Pearson. The recommendation in the comments for these terms gives the impression that values are best computed; some form of local disambiguation activity is advisable. None of the definitions or recommendations suggest that shared content here be factual representations. As a result, collection managers who use a relational data management systems may be less inclined to record what is written on a label because it's perceived as having little downstream value – the use of pipes suggests someone has a purpose for the contained parts who does not know how to deal with other separators – favouring instead the use of a computed name(s) held elsewhere in their system. This is a mistake.

Values for dwc:recordedBy and dwc:identifiedBy should be absent any implicit statement about identity that artificial separator characters like pipes introduce. When presented with examples like, Dr. & Mrs. John Smith on a collector label, do we eschew the recommendation or do we construct it like, Dr. Smith | Mrs. John Smith or Dr. John Smith | Mrs. John Smith or Dr. John Smith | Mrs. ? Smith or John Smith | ? Smith or simply John Smith (it's all too common that the "Mrs." is entirely dropped)? Do we construct an awkward group if there is such an object type in one's collection management system? All of these are perfectly possible, but their implementations and expressions depend on one's familiarity with Western names. Likewise, it would appear exceedingly bizarre to some if ampersands were arbitrarily replaced by pipes for the purposes of publishing data as Dr. | Mrs. John Smith. Note that all are semantically different from the original form, which are likely to result in differing disambiguation routines should these occur outside the walls of the collection management system. Similarly, there are many examples of collector names written in native languages on labels whose separators might be 'e' or other. Introduction of pipes in their place might sway a collection manager to use a canonical, managerial, transliterated/translated form of these names. In short, pipes introduce unintended, cultural bias when it is likely that their purpose was to remove such biases. If it is truly identity we wish to convey, we have the terms dwc:recordedByID and dwc:identifiedByID for this very purpose. So...I do not know what clarity of purpose pipes serve in the exchange of occurrence records that contain these two terms.

@matdillen
Copy link

I don't think the pipes should be dropped from the documentation, although some clarification may be needed so people don't alter name strings for no good reason, as you described. The pipes are a tool to indicate multiple instances of recordedBy in the spreadsheet format of simple Darwin Core. A system can connect an occurrence to multiple people who were there to record it and who are modelled as separate person records themselves. The pipes then help collapse this one-to-many relationship in a (sort of) standardized way. Without the pipes recommendation, we could see various other ways of doing this, making parsing (even more) difficult than it already is.

As you say, this can produce ambiguity when trying to shoehorn verbatim transcribed data from a specimen label or a field notebook into this field as well. Here, things get even more messy than you suggest, because many systems do not allow for names to be stored in a verbatim, not delimited manner, or at least recommend that teams/individual names are interpreted into parts. As a result, names will come out explicitly not verbatim and teams may get scrambled as well. This may not be best practice, but it is practice and not easy to change. The pipes are still a good tool for those cases to at least mitigate some of the interoperability problems.

A verbatimRecordedBy and verbatimIdentifiedBy field may help by spreading out some of the ambiguity. But I'm not sure there is much demand for that. And, as you said, the exact meaning of verbatim is its own can of worms. Verbatim can still mean the text is interpreted as the names are spread out in a piece of prose text, or uncertainty is indicated with quesiton marks, square brackets and dots, or that the text was interpreted at some point but the curators failed to connect it to an existing record in the collector table (i.e. the name string's identity is not known).

Currently, the recordedByID and identifiedByID should contain a globally unique identifier (per the documentation). A standardized name string is typically not globally unique (and does not resolve, making it less useful). I have seen quite some use of values that are not globally unique identifiers in GBIF (e.g. integers, abbreviations or name strings), but I don't think this should be recommended.

@dshorthouse
Copy link
Author

The pipes are a tool to indicate multiple instances of recordedBy in the spreadsheet format of simple Darwin Core.

Perhaps then the comments and recommendations for these terms should prompt us to consider the origins of the content & whether or not the use of pipes could risk misrepresenting or poorly communicating whatever provenance some downstream users of these terms hoped they might contain.

A born digital occurrence (eg observation) may have no such issue with the use of pipes if the origins of content that would be shared in dwc:recordedBy are already stored as digital units; choice of any separator is a trivial concatenation routine, unlikely to misrepresent the intentions of the participants. An occurrence born physical (eg specimen label) on the other hand may have content that could be partitioned into dwc:recordedBy as a digital representation, but further atomization of this introduces additional interpretation, which may degrade others' ability to make sense of its content if later concatenated. It does not necessarily matter whether this is in spreadsheet format or pulled from a relational collection management system.

As is the case with most Darwin Core terms, dwc:recordedBy, dwc:identifiedBy, and (to some extent) dwc:georeferencedBy are a compromise to facilitate the exchange of information. I can only assume that the comprise in these cases had been between the tensions of sharing provenance and expressing identity – or there had been no such tension, just few discussions and we simply got on with a practical solution. Now that we have terms like dwc:recordedByID and dwc:identifiedByID, perhaps it's time to refocus on the intent and use-cases of those string-based terms and refine the comments and recommendations to reflect the new advancements. It would be a shame if producers of data transcribed from physical media had been led down a path (at some considerable expense if development effort & staff time was earmarked for this) when the outcome may be undesirable for the purposes of data exchange, disambiguation, and discovery of putative overlap (eg clustering of occurrences).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants