Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss on citeproc import #20

Open
bwiernik opened this issue Jan 19, 2020 · 16 comments
Open

Data loss on citeproc import #20

bwiernik opened this issue Jan 19, 2020 · 16 comments
Milestone

Comments

@bwiernik
Copy link
Collaborator

bwiernik commented Jan 19, 2020

cp_txt <- '[  {"id":"JolySilencetablemanners2008","accessed":{"date-parts":[[2019,10,27]]},"author":[{"family":"Joly","given":"Janneke F."},{"family":"Stapel","given":"Diederik A."},{"family":"Lindenberg","given":"Siegwart M."}],"container-title":"Personality and Social Psychology Bulletin","container-title-short":"Pers. Soc. Psychol. Bull.","DOI":"10.1177/0146167208318401","ISSN":"0146-1672, 1552-7433","issue":"8","issued":{"date-parts":[[2008,8]]},"language":"en","page":"1047-1056","references":"Retraction published 2012, <i>Personality and Social Psychology Bulletin, 38</i>[10], 1378, https://doi.org/10.1177/0146167212462821","source":"Crossref","title":"Silence and table manners: when environments activate norms","title-short":"Silence and table manners","type":"article-journal","URL":"http://journals.sagepub.com/doi/10.1177/0146167208318401","volume":"34"}]'
cp_parsed <- citeproc_reader(cp_txt)
names(cp_parsed)

When reading the Citeproc/CSL JSON format, handlr currently discards any valid CSL variables that are not part of its internal Crosscite format. This seems quite suboptimal, because it means that handlr can really only properly work with Citeproc data for a small number of item types (pretty much just article-journal and webpage). For example, the genre and medium variables that are used to indicate the category for a report or thesis are discarded. The variable editor is used for books and book chapters. In the example data I provide above, the variable references is discarded.

If I were to generate a reference for this item using the American Psychological Association CSL style, it would be:
Joly, J. F., Stapel, D. A., & Lindenberg, S. M. (2008). Silence and table manners: When environments activate norms. Personality and Social Psychology Bulletin, 34(8), 1047–1056. https://doi.org/10.1177/0146167208318401 (Retraction published 2012, Personality and Social Psychology Bulletin, 38[10], 1378, https://doi.org/10.1177/0146167212462821)

However, if I import the item to handlr, export to CSL JSON again, and render the citation, it's:
Joly, J. F., Stapel, D. A., & Lindenberg, S. M. (2008). Silence and table manners: When environments activate norms. Personality and Social Psychology Bulletin, 34(8), 1047–1056. https://doi.org/10.1177/0146167208318401

The retraction information has been lost.

Other variables, such as annote, , genre, note, medium, collection-title, number, and illustrator are also all discarded on import
For item types and fields that don't have a Crosscite analogue, it seems like it would be wise to store these in the item data (e.g., as csl_note, csl_medium) and map them to other formats at translation time as needed.

@sckott
Copy link
Contributor

sckott commented Jan 21, 2020

thanks for this @bwiernik - I definitely want to improve the citeproc reader/writer.

it seems like it would be wise to store these in the item data (e.g., as csl_note, csl_medium) and map them to other formats at translation time

can you explain what you mean here. i'm not sure I follow. what is csl_note and csl_medium?

@bwiernik
Copy link
Collaborator Author

bwiernik commented Jan 21, 2020

By the way, I'm working on a package cslr that creates a class for citeproc-formatted data, similar to the BibEntry class in RefManageR, and provides import, management, sorting, and citation tools.

The list of CSL variables is given here: https://aurimasv.github.io/z2csl/typeMap.xml
My suggestion is that, if there are fields that don't fit into the CrossCite format, they should be stored. For example, currently handlr will discard medium from a citerpoc JSON object if it is provided. Instead, I would recommend that these get stored, with the prefix csl_ to indicate they come from citeproc. So for example, if a citeproc file has specifies something for medium, that could get stored in the field csl_medium in the handl object.

For example:

[
  {"id":"CuttsHappiness2017",
    "abstract":"[truncated]",
    "accessed":{
      "date-parts":[[2019,10,26]]
      },
    "dimensions":"PT00H04M16S",
    "director":[{"family":"Cutts","given":"Steve"}],
    "issued":{
      "date-parts":[[2017,11,24]]
    },
    "medium":"Video",
    "publisher":"Vimeo",
    "source":"Vimeo",
    "title":"Happiness",
    "type":"motion_picture",
    "URL":"https://vimeo.com/244405542"}
]

Here, accessed, dimensions, director, medium, source, and URL will get dropped. These should either be mapped to appropriate fields (e.g., URL to b_url, director to author with a field to indicate the creator type) or stored as CSL-specific fields (e.g., csl_dimensions, csl_medium, csl_source).

A similar argument could be made for fields that are also specific biblatex, bibtex, or other formats and not represented in the Crosscite schema. In general, I think it would make sense to create a table that cross-references the fields for each data format (e.g., biblatex_urldatecsl_accessed). This table could then be used when converting fields from one data format to another. This could provide greater conversion fidelity versus relying on the limits of any particular data format.

I am happy to help create such a table for the formats handlr currently supports.

@sckott
Copy link
Contributor

sckott commented Jan 21, 2020

looks like you forgot to finish a thought:

So for example, if a citeproc

@bwiernik
Copy link
Collaborator Author

Sorry, fixed that.

@sckott
Copy link
Contributor

sckott commented Jan 23, 2020

thanks for the fix.

I think it would make sense to create a table that cross-references the fields for each data format

As you've probably seen, we do have some named lists, e.g, https://github.com/ropensci/handlr/blob/master/R/translations.R as converters between formats. A table would be good though.

I agree about not dropping fields, and assigning them a csl_ prefix.

@sckott
Copy link
Contributor

sckott commented Feb 25, 2020

@bwiernik Are you still interested in making that table?

@bwiernik
Copy link
Collaborator Author

Yes, I'm hoping to get to it in the next week or two.

@sckott
Copy link
Contributor

sckott commented Feb 26, 2020

Okay, thanks

@sckott
Copy link
Contributor

sckott commented Mar 5, 2020

notes:

google spreadsheet started in https://docs.google.com/spreadsheets/d/1p1XaEtTBU_CmZba0P8nGpIlqAS2A8r4ZUs-WJarKUxo/edit#gid=0 - then move to the package when more stable

@bwiernik
Copy link
Collaborator Author

bwiernik commented Mar 5, 2020

@sckott
Copy link
Contributor

sckott commented Mar 5, 2020

thanks! Do you know where to get a complete list of JATS types?

@bwiernik
Copy link
Collaborator Author

bwiernik commented Mar 5, 2020

There is the full list in the JATS spec https://groups.niso.org/apps/group_public/download.php/21030/ANSI-NISO-Z39.96-2019.pdf

@sckott
Copy link
Contributor

sckott commented Mar 5, 2020

i don't see a full list in there for @publication-type it only says on page 276:

Category of publication being cited (for example, “book”, “letter”, “review”, “journal”, “patent”,“report”, “standard”, “data”, “working-paper”).

@bwiernik
Copy link
Collaborator Author

bwiernik commented Mar 5, 2020

Oh I see what you mean. Hmm. I’m not sure there is a formal list anywhere. Probably the best option would be to compile the converter programs, such as those listed in the Wiki article here https://en.wikipedia.org/wiki/Journal_Article_Tag_Suite, and see what conventions have emerged.

@sckott
Copy link
Contributor

sckott commented Mar 6, 2020

Okay, thanks - not sure we need to include JATS, but if its easy enough to do seems worth it

@sckott sckott added this to the v0.3 milestone Oct 13, 2020
@sckott
Copy link
Contributor

sckott commented Oct 14, 2020

there's better support for citeproc now. im sure could be better, but need to submit a new version for other reasons, so moving this to the next milestone - still need to finish the crosswalk between all formats spreadsheet linked above and then implement using that here

@sckott sckott modified the milestones: v0.3, v0.4 Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants