Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSL support #2082

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Conversation

Omikhleia
Copy link
Member

@Omikhleia Omikhleia commented Jun 29, 2024

Closes #2074

It already does nice things (see screenshots in the referred issue).

In order to support CSL (Citation Style Language), we need to:

  • Convert the BibTeX entries to CSL format.
  • Use a CSL engine to format the citations and bibliography references. It boils down to:
    • Support CSL locales
    • Support CSL styles
    • Implement the CSL processor/renderer (a "reasonable" subset at least)

Regarding the conversion of BibTeX entries, the mappings are not straightforward, but there is some prior art that we can check... None of the implementations I checked did the exact same things, so it's likely a bit messy...

Regarding the CSL engine, there are various existing implementations.
Yet, I had a look at them, and I am not really convinced by their code quality, so I went implementing the CSL 1.0.2 specifications from scratch. Because it's fun, and SILE has the guts to do it. And because I think I can.

Additionally, this would also close several other items.

Closes #2024 = The CSL locales takes care of it.

Closes #2022 = The CSL styles have appropriate fallbacks (substitutes, conditionals, etc.)

Closes #2027 = The CSL styles and locales define how to format localized dates in the selected citation or bibliography style.

Closes #2026 = Some CSL styles sort entries by citation order ("citation-number"), so keeping track of cited entries was needed anyhow.

@Omikhleia Omikhleia requested a review from alerque as a code owner June 29, 2024 06:40
@Omikhleia Omikhleia marked this pull request as draft June 29, 2024 06:40
csl/core/engine.lua Outdated Show resolved Hide resolved
csl/core/engine.lua Outdated Show resolved Hide resolved
csl/core/engine.lua Outdated Show resolved Hide resolved
csl/core/engine.lua Outdated Show resolved Hide resolved
@Omikhleia Omikhleia force-pushed the bibliography-csl branch 3 times, most recently from 0f5d659 to 330cd9d Compare July 14, 2024 18:57
@Omikhleia
Copy link
Member Author

2024/07/14 "Stage 0" milestone: Successfully processed 1355 references

  • with en-US locale and styles chicago-author-date, chicago-fullnote-bibliography and apa.
  • with fr-FR locale and styles chicago-author-date, chicago-author-date-fr, chicago-fullnote-bibliography-fr and apa

@Omikhleia
Copy link
Member Author

Omikhleia commented Jul 20, 2024

2024/07/20 "Stage 1" milestone: Successfully processed 1508 references,

  • with entry sorting according to the CSL style.
  • tested with fr-FR locale and styles chicago-author-date-fr, chicago-fullnote-bibliography-fr
  • Some entries belong to numbered "series" (a.k.a. collection-title and collection-number in CSL)

@Omikhleia
Copy link
Member Author

Omikhleia commented Jul 28, 2024

Soon leaving for vacations, so here are just some advancement notes to myself, in order to remember:

  • Stage 2
    • implement subsequent-author-substitute (I had it done more or less this week-end on an experimental ground, but I'm unhappy with the code so I didn't push it... I prioritized working on my bib files, now over 2000 references, and couldn't finish that code properly today...)
    • implement "locators" in citations
  • Stage 3 = implement page-range-delimiter so page ranges would look decent...
  • Stage 4 = Understand how to handle properly demoting/non-demoting particles in names (I've some of these in my bibliography files, so I guess it's time to dig into the topic...)
  • Stage 5 = review package commands for multiple citations (but then, how to handle locators?) --> I'm gonna postpone this item, it needs some further discussion.

That's a minimal set. There would still be a few missing features from the CSL spec, but at least all Chicago-styles would be covered fairly decently, and a first milestone would be passed.

@Omikhleia
Copy link
Member Author

Slowly back on track.
I rebased the branch, and added a commit with support for #2026 (see rationale in main description). Some tests performed with the American Chemical Society" (ACS) style, which uses the "citation-number".
We are not yet there, but it's a progress.
I also included silently ("in passing") a small refactor/fix for an issue I experienced with the Modern Language Association (MLA) style, which I used (with a few adaptations) for the 2600+ references in the book I made this summer, A bibliography of Tolkien studies in French & English -- But there's still some code to clean-up and refactor from that work-in-progress ;)

@Omikhleia
Copy link
Member Author

Omikhleia commented Sep 12, 2024

Let's refactor a bit and support locators. It's a refactor, since none of this is released yet.

Chicago style:

image

This is demonstrated in \csl:cite[page=30-35]{FullInProceedings}; see also \csl:cite[fig=5, key=FullBook].

…phies

Honor the page-range-delimiter from the locale.
@Omikhleia
Copy link
Member Author

I hate names with particles, definitively. 🤣 -- Doh, it was hard for my tired brain. One checkbox ticked.

@Omikhleia Omikhleia self-assigned this Sep 14, 2024
We can use the bibtex.style setting to help switching implementations
We can also ensure printbibliography works with legacy citations.
This will make deprecations and transition easier.
@Omikhleia
Copy link
Member Author

Omikhleia commented Sep 17, 2024

The wild landscape of subsequent author substitutions

(Or the reason why something started weeks ago and announced in a previous comment takes so long!)

There are lots and lots of styles in the CSL repository. They don't all use subsequent-author-substitute, but when they do (approx. 297 styles), there are around 35 different patterns. This doesn't mean much, so let's detail. It boils down to 7 different categories. With possibly some approximations and simplifications:

  • main case (210+): 1 to 7 en- or em- dashes (the majority of cases), possibly with trailing spaces or periods
  • empty = 21, mostly on styles with special display ("block" etc.)
  • 1 to 6 regular dashes = - (1), --- (26), more dashes (4)
  • 3 to 10 underscores = three (4), more (15); with or without training spaces or periods
  • plain words (e.g. "id.") = 6
  • spaces only = 2
  • dubious = 2 ("???")

This brings a few considerations.

  • The empty cases are mostly used with block or column-like display. I haven't supported these cases in this initial implementation, so let's not bother with them for now...
  • Series of dashes, underscores, en-/em-dashes are rather idiosyncratic, probably a legacy from the early days of typesetting machines! Anyhow, unless one is very lucky with their choice of font, it never looks great on a typography point of view...
  • CSL "hanging-indent" is a mere boolean, unrelated with the length of substitution pattern. I mention it just in passing, as in my own preferences with MLA and Chicago styles, I'd like some consistency with the length of the substitution pattern and the hanging indent (e.g. 3em for both), heh.

Leaving aside the block/column cases for now, full/strict compliance is still complex and not that great typographically. I'll head therefore towards a few (debatable) "shortcuts"...

Then, there's the subsequent-author-substitute-rule... If I counted well, out of the 297 styles with author substitution, 16 are in "partial-each" mode, 4 in "complete-each", 0 in "partial-first". Let's say we can leave these aside for now, and assume the default "complete-all", this will cover a lot of standard cases.

But there's still another trap: "Substitution is limited to the names of the first cs:names element rendered. In the huge majority of cases, the first names are the authors (or their own substitutes when unknown), that is, a single list. The specification however does not forbid multiple variables in a cs:names (e.g. "author editor"), but as far as I can tell at a quick glance, it's not used much in the wild; that is, for the main contributors as first rendered field. The pattern occurs for secondary contributors etc. but of course it's not relevant for the substitution at stake. There's nothing that would prevent us from implementing the generic case, though it's kind of an extra refactor for a very uncertain benefit. So I'm heading again towards a "good enough" solution, for the sake of simplicity, which will work for "usual" styles such as Chicago and MLA.

The devil is in the details, as always. Yet, however many shortcuts are taken here, it's still superior to the "legacy" implementation we had. Simon's old 400-500 LoC experiment (682bbc5) has lived long since 2015. Whether the present 2000-2500 LoC re-implementation will fare as long, only time will tell.

So it's a 80-20 situation. The 80% of the work is done, and the 20% remaining is the most complex to achieve. Whether it's really needed is hard to ascertain.
But it's a possible base for future improvements, maybe.
What I can sure say is that I am glad to see how good it performs for my bibliography project, compared to the older solution -- so it's a huge improvement already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
3 participants