-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zbMATH: Updates and multiple fixes. #3052
base: master
Are you sure you want to change the base?
Conversation
- Updated the selectors/XPaths to match the current state of the site. - Prefer selector to XPath to simplify code. - Made the scrape()/doWeb() functions async. - Changes to keyword/tag handling: the returned tags now contain MSC numbers, their readable labels, and the "Keywords" content. - Strip the duplicated characters in MSC labels and abstracts that had been caused by inline MathML rendered by MathJax. In abstracts, the math content is replaced by their LaTeX annotation, surrounded by the dollar signs ($ $), to mark the places where math text appeared. - Prefer the cleaner permalinks in URL fields. - Updated test cases. Resolves zotero#3039
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small things -- I'd want dstillman or AbeJellinek to chime in on the handling of LaTeX in fields
@@ -158,16 +216,16 @@ var testCases = [ | |||
"date": "2012", | |||
"DOI": "10.1002/rsa.20472", | |||
"ISSN": "1042-9832", | |||
"abstractNote": "We prove that a given tree TT on n vertices with bounded maximum degree is contained asymptotically almost surely in the binomial random graph G(n,(1+ε)lognn)G\\left(n,\\frac {(1+\\varepsilon)\\log n}{n}\\right) provided that TT belongs to one of the following two classes: \n\n(1)TT has linearly many leaves; (2)TT has a path of linear length all of whose vertices have degree two in TT.", | |||
"extra": "MSC2010: 05C05 = Trees\nMSC2010: 05C80 = Random graphs (graph-theoretic aspects)\nZbl: 1255.05045", | |||
"abstractNote": "We prove that a given tree $T$ on n vertices with bounded maximum degree is contained asymptotically almost surely in the binomial random graph $G\\left(n,\\frac {(1+\\varepsilon)\\log n}{n}\\right)$ provided that $T$ belongs to one of the following two classes: \n\n(1)$T$ has linearly many leaves; (2)$T$ has a path of linear length all of whose vertices have degree two in $T$.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite know what to do about this -- we don't actually support TeX in Zotero fields (other then the new notes), so this is a bit messy, but I'm also not sure what else we could do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't, and yes this is a bit messy here. I'd like to hear more thoughts about this too.
Before this change, you can see that the MathJax-rendered elements became "TT" for a one-letter math symbol. It's even worse now, for without the change it would become "TTT" under newer MathJax. In addition, more complicated MathML text loses meaning when converted to text in the usual way. For instance, the fraction line became lost, so "log n over n" became lognn
in the text.
In other words, without further processing, meaning could be easily destroyed, and silently. It's difficult to spot the change from "T" to "TTT" in the wall of text.
So I chose to preserve the LaTeX-y annotation as substitute, and mark it so, using the $ .. $
. This at least signals to the reader that here used to be some rendered math, and the LaTeX source is in principle a lossless substitute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is a reasonable approach. And, honestly, we could probably support math in abstract fields pretty easily (just showing as $…$
in edit mode).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is roughly consistent with the arXiv translator's output (e.g. see https://arxiv.org/abs/2306.07357). There, the abstract is handed to us by the OAI API, which is a verbatim copy of what the preprint author puts into that field.
// Clean up the MathJaX-rendered text in elements. Returns a clone of the node | ||
// with the duplicate-causing elements removed and the LaTeX math text | ||
// converted to text nodes (surrounded with $ $ if laTeXify = true). | ||
function cleanupMath(element, laTeXify = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd want to hear from @dstillman or @AbeJellinek what Zotero's view is on handling LaTeX/MathJaX in fields. It's currently not supported, so adding things like $$ doesn't do any good, but given the nature of the translator it might still make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally we just save the LaTeX as is, even though nothing in Zotero will render it. I hadn't thought about it very much but I think I agree with the approach here - try to use the rendered Unicode version of the LaTeX for short fields, keep the LaTeX in the abstract and similar.
Title with math probably needs cleaning, too. Examples: |
- Eliminated a little bit of dead code. - Used less cryptic syntax for the "guard" logic. - Removed unnecessary use of `Zotero.Utilities` namespace for top-level names.
The title may contain math text, and it's currently not very well understood by the BibTeX import translator. A more reliable way is to scrape the page.
Hmm, there's more to that. The title cleanup should be applied in And I'll reconsider the way to cleanup the |
That doesn't matter. Zotero will automatically remove any characters that aren't valid in filenames. |
I don't think fs compatibility would be a problem. The problem is that the characters As an example, for this article https://zbmath.org/7695752 Also for users less familiar with the rules of shell variable expansion, |
Ah, got it. |
- In titles, the math is typically brief inline text. By using rendered text instead of LaTeX source, the saved PDF's file names will contain less "special" characters. This improves interoperability, and reduces the likelihood of certain user errors with "special" characters in file names. In addition, the look and feel will be closer to the normal expectation. - In tags, already brief, the LaTeX math is a distraction. Note that the MathML extraction routine is tested against most typical MathJax preferences (and with the Firefox "Native MML" extension). If Assistive MathML is turned off (default on, and strongly recommended), the result will be more accurate. In the extreme case of Assistive MathML off and SVG rendering on, the math text may disappear altogether.
Errrh, wrong commit message of 2f77a8a. It should've read "If Assistive MathML is turned off (default on, and strongly recommended), the result will be less accurate." |
In this comment to the issue #3039 (comment), the user suggested that the I'm asking because I see that in items translated from arXiv, the |
I don't love identifiers in callNumber fields in the first place. We are typically putting identifiers (as opposed to actual call numbers) into Extra. To make sense in Extra, they do need the prefix |
This is what the code does atm. The identifier shows up as |
// Clean up the MathJaX-rendered text in elements. Returns a clone of the node | ||
// with the duplicate-causing elements removed and the LaTeX math text | ||
// converted to text nodes (surrounded with $ $ if laTeXify = true). | ||
function cleanupMath(element, laTeXify = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally we just save the LaTeX as is, even though nothing in Zotero will render it. I hadn't thought about it very much but I think I agree with the approach here - try to use the rendered Unicode version of the LaTeX for short fields, keep the LaTeX in the abstract and similar.
There's a few more issues.
item.attachments = [{
title: "Snapshot",
document: doc
}]; It probably may be useful to save a fully-rendered page as a single file, for the math, but I'm not sure if it's worth it. |
When saving items from a multiple-result search page, try to be more consistent with the behavior of single-item saving, by converting the \( \) delimiters for MathJax source into $ $ in abstracts.
Resolves #3039