zbMATH: Updates and multiple fixes. #3052

zoe-translates · 2023-06-15T10:08:43Z

Updated the selectors/XPaths to match the current state of the site.
Prefer selector to XPath to simplify code.
Made the scrape()/doWeb() functions async.
Changes to keyword/tag handling: the returned tags now contain MSC numbers, their readable labels, and the "Keywords" content.
Strip the duplicated characters in MSC labels and abstracts that had been caused by inline MathML rendered by MathJax. In abstracts, the math content is replaced by their LaTeX annotation, surrounded by the dollar signs ($ $), to mark the places where math text appeared.
Prefer the cleaner permalinks in URL fields.
Updated test cases.

Resolves #3039

- Updated the selectors/XPaths to match the current state of the site. - Prefer selector to XPath to simplify code. - Made the scrape()/doWeb() functions async. - Changes to keyword/tag handling: the returned tags now contain MSC numbers, their readable labels, and the "Keywords" content. - Strip the duplicated characters in MSC labels and abstracts that had been caused by inline MathML rendered by MathJax. In abstracts, the math content is replaced by their LaTeX annotation, surrounded by the dollar signs ($ $), to mark the places where math text appeared. - Prefer the cleaner permalinks in URL fields. - Updated test cases. Resolves zotero#3039

adam3smith

A couple of small things -- I'd want dstillman or AbeJellinek to chime in on the handling of LaTeX in fields

adam3smith · 2023-06-15T12:24:38Z

zbMATH.js

@@ -158,16 +216,16 @@ var testCases = [
 				"date": "2012",
 				"DOI": "10.1002/rsa.20472",
 				"ISSN": "1042-9832",
-				"abstractNote": "We prove that a given tree TT on n vertices with bounded maximum degree is contained asymptotically almost surely in the binomial random graph G(n,(1+ε)lognn)G\\left(n,\\frac {(1+\\varepsilon)\\log n}{n}\\right) provided that TT belongs to one of the following two classes: \n\n(1)TT has linearly many leaves; (2)TT has a path of linear length all of whose vertices have degree two in TT.",
-				"extra": "MSC2010: 05C05 = Trees\nMSC2010: 05C80 = Random graphs (graph-theoretic aspects)\nZbl: 1255.05045",
+				"abstractNote": "We prove that a given tree $T$ on n vertices with bounded maximum degree is contained asymptotically almost surely in the binomial random graph $G\\left(n,\\frac {(1+\\varepsilon)\\log n}{n}\\right)$ provided that $T$ belongs to one of the following two classes: \n\n(1)$T$ has linearly many leaves; (2)$T$ has a path of linear length all of whose vertices have degree two in $T$.",


I don't quite know what to do about this -- we don't actually support TeX in Zotero fields (other then the new notes), so this is a bit messy, but I'm also not sure what else we could do.

No, we don't, and yes this is a bit messy here. I'd like to hear more thoughts about this too.

Before this change, you can see that the MathJax-rendered elements became "TT" for a one-letter math symbol. It's even worse now, for without the change it would become "TTT" under newer MathJax. In addition, more complicated MathML text loses meaning when converted to text in the usual way. For instance, the fraction line became lost, so "log n over n" became lognn in the text.

In other words, without further processing, meaning could be easily destroyed, and silently. It's difficult to spot the change from "T" to "TTT" in the wall of text.

So I chose to preserve the LaTeX-y annotation as substitute, and mark it so, using the $ .. $ . This at least signals to the reader that here used to be some rendered math, and the LaTeX source is in principle a lossless substitute.

Yeah, I think this is a reasonable approach. And, honestly, we could probably support math in abstract fields pretty easily (just showing as $…$ in edit mode).

BTW, this is roughly consistent with the arXiv translator's output (e.g. see https://arxiv.org/abs/2306.07357). There, the abstract is handed to us by the OAI API, which is a verbatim copy of what the preprint author puts into that field.

adam3smith · 2023-06-15T12:27:15Z

zbMATH.js

+// Clean up the MathJaX-rendered text in elements. Returns a clone of the node
+// with the duplicate-causing elements removed and the LaTeX math text
+// converted to text nodes (surrounded with $ $ if laTeXify = true).
+function cleanupMath(element, laTeXify = true) {


I'd want to hear from @dstillman or @AbeJellinek what Zotero's view is on handling LaTeX/MathJaX in fields. It's currently not supported, so adding things like $$ doesn't do any good, but given the nature of the translator it might still make sense?

I think generally we just save the LaTeX as is, even though nothing in Zotero will render it. I hadn't thought about it very much but I think I agree with the approach here - try to use the rendered Unicode version of the LaTeX for short fields, keep the LaTeX in the abstract and similar.

zbMATH.js

zoe-translates · 2023-06-15T14:14:32Z

Title with math probably needs cleaning, too. ~~I'll do this later.~~ Updated in f3cb2e3.

Examples:

https://zbmath.org/7694014
https://zbmath.org/7693571

- Eliminated a little bit of dead code. - Used less cryptic syntax for the "guard" logic. - Removed unnecessary use of `Zotero.Utilities` namespace for top-level names.

The title may contain math text, and it's currently not very well understood by the BibTeX import translator. A more reliable way is to scrape the page.

zoe-translates · 2023-06-16T00:12:34Z

Hmm, there's more to that.

The title cleanup should be applied in getSearchResults() too, to prevent the "TTT" effect in the item selection dialogue.

And I'll reconsider the way to cleanup the item.title: This will go into the full-text PDF filename. So we'd prefer less special characters such as $ or \. Since math in title is usually reserved for simple inline elements, I'll probably choose the inner text of the rendered MathML over the TeX source.

dstillman · 2023-06-16T00:14:29Z

This will go into the full-text PDF filename. So we'd prefer less special characters such as $ or \

That doesn't matter. Zotero will automatically remove any characters that aren't valid in filenames.

zoe-translates · 2023-06-16T02:12:27Z

I don't think fs compatibility would be a problem.

The problem is that the characters $ or so, as LaTeX control characters, would look out of place in a filename, and this may not be what the user expected.

As an example, for this article https://zbmath.org/7695752
The filename
BMO ε-regularity ... .pdf
would look more consistent with usual filenames than would
BMO $varepsilon$-regularity ... .pdf.

Also for users less familiar with the rules of shell variable expansion, $ in filenames might be a cause of havoc when they use the commandline.

dstillman · 2023-06-16T03:00:51Z

Ah, got it.

- In titles, the math is typically brief inline text. By using rendered text instead of LaTeX source, the saved PDF's file names will contain less "special" characters. This improves interoperability, and reduces the likelihood of certain user errors with "special" characters in file names. In addition, the look and feel will be closer to the normal expectation. - In tags, already brief, the LaTeX math is a distraction. Note that the MathML extraction routine is tested against most typical MathJax preferences (and with the Firefox "Native MML" extension). If Assistive MathML is turned off (default on, and strongly recommended), the result will be more accurate. In the extreme case of Assistive MathML off and SVG rendering on, the math text may disappear altogether.

zoe-translates · 2023-06-16T13:19:59Z

Errrh, wrong commit message of 2f77a8a. It should've read

"If Assistive MathML is turned off (default on, and strongly recommended), the result will be less accurate."

zoe-translates · 2023-06-20T03:46:42Z

In this comment to the issue #3039 (comment), the user suggested that the callNumber field be set to the Zbl ID without any Zbl or Zbl: prefix (because libraryCatalog is already set to zbMATH?). @adam3smith, @nonobsense, please let me know if this is the correct understanding? And is this how it "should" be done?

I'm asking because I see that in items translated from arXiv, the archiveID field's value includes the arXiv: prefix.

adam3smith · 2023-06-20T16:33:08Z

I don't love identifiers in callNumber fields in the first place. We are typically putting identifiers (as opposed to actual call numbers) into Extra. To make sense in Extra, they do need the prefix
arxiv: is part of the Archive ID because arXiv includes arXiv as part of the identifer. Generally, where identifiers aren't otherwise self-identifying (like, say, DOIs), it makes sense to store them with a namespace-y prefix.

zoe-translates · 2023-06-20T23:12:16Z

To make sense in Extra, they do need the prefix

This is what the code does atm. The identifier shows up as Zbl: [...] on a line in the extra.

zbMATH.js

AbeJellinek · 2023-07-13T19:11:36Z

zbMATH.js

+// Clean up the MathJaX-rendered text in elements. Returns a clone of the node
+// with the duplicate-causing elements removed and the LaTeX math text
+// converted to text nodes (surrounded with $ $ if laTeXify = true).
+function cleanupMath(element, laTeXify = true) {


I think generally we just save the LaTeX as is, even though nothing in Zotero will render it. I hadn't thought about it very much but I think I agree with the approach here - try to use the rendered Unicode version of the LaTeX for short fields, keep the LaTeX in the abstract and similar.

zbMATH.js

zoe-translates · 2023-07-14T10:25:06Z

There's a few more issues.

When we save the multiples from search results, any MathJax source will not be rendered (we only get the static DOM). If we want to "laTeXify", it's not a big problem and we can simply do a text search and replacement, because $ $ are clear markers of MathJax source. But there's not going to be rendered text even if it's just a single Greek letter from e.g. \alpha. I'm adding a few more line to get rid of $ $ when we "laTeXify".
Another problem is saving "snapshot". Do we need this?

		item.attachments = [{
			title: "Snapshot",
			document: doc
		}];

It probably may be useful to save a fully-rendered page as a single file, for the math, but I'm not sure if it's worth it.

When saving items from a multiple-result search page, try to be more consistent with the behavior of single-item saving, by converting the  delimiters for MathJax source into $ $ in abstracts.

adam3smith requested changes Jun 15, 2023

View reviewed changes

zoe-translates added 2 commits June 15, 2023 22:22

zbMATH: [Minor] Readability fixes.

5c8c14e

- Eliminated a little bit of dead code. - Used less cryptic syntax for the "guard" logic. - Removed unnecessary use of `Zotero.Utilities` namespace for top-level names.

zbMATH: Use page scraping for title.

f3cb2e3

The title may contain math text, and it's currently not very well understood by the BibTeX import translator. A more reliable way is to scrape the page.

AbeJellinek requested changes Jul 13, 2023

View reviewed changes

zoe-translates added 2 commits July 14, 2023 17:34

zbMATH: [Minor] Fix a misleading comment.

c1b3255

zbMATH: Remove spurious :scope in selectors

0520df5

zbMATH: Try to be more consistent when MathJax source is not rendered

15910d4

When saving items from a multiple-result search page, try to be more consistent with the behavior of single-item saving, by converting the  delimiters for MathJax source into $ $ in abstracts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zbMATH: Updates and multiple fixes. #3052

zbMATH: Updates and multiple fixes. #3052

zoe-translates commented Jun 15, 2023

adam3smith left a comment

adam3smith Jun 15, 2023

zoe-translates Jun 15, 2023 •

edited

Loading

dstillman Jun 15, 2023

zoe-translates Jun 15, 2023

adam3smith Jun 15, 2023

AbeJellinek Jul 13, 2023

zoe-translates commented Jun 15, 2023 •

edited

Loading

zoe-translates commented Jun 16, 2023 •

edited

Loading

dstillman commented Jun 16, 2023

zoe-translates commented Jun 16, 2023

dstillman commented Jun 16, 2023

zoe-translates commented Jun 16, 2023

zoe-translates commented Jun 20, 2023

adam3smith commented Jun 20, 2023

zoe-translates commented Jun 20, 2023

AbeJellinek Jul 13, 2023

zoe-translates commented Jul 14, 2023

zbMATH: Updates and multiple fixes. #3052

Are you sure you want to change the base?

zbMATH: Updates and multiple fixes. #3052

Conversation

zoe-translates commented Jun 15, 2023

adam3smith left a comment

Choose a reason for hiding this comment

adam3smith Jun 15, 2023

Choose a reason for hiding this comment

zoe-translates Jun 15, 2023 • edited Loading

Choose a reason for hiding this comment

dstillman Jun 15, 2023

Choose a reason for hiding this comment

zoe-translates Jun 15, 2023

Choose a reason for hiding this comment

adam3smith Jun 15, 2023

Choose a reason for hiding this comment

AbeJellinek Jul 13, 2023

Choose a reason for hiding this comment

zoe-translates commented Jun 15, 2023 • edited Loading

zoe-translates commented Jun 16, 2023 • edited Loading

dstillman commented Jun 16, 2023

zoe-translates commented Jun 16, 2023

dstillman commented Jun 16, 2023

zoe-translates commented Jun 16, 2023

zoe-translates commented Jun 20, 2023

adam3smith commented Jun 20, 2023

zoe-translates commented Jun 20, 2023

AbeJellinek Jul 13, 2023

Choose a reason for hiding this comment

zoe-translates commented Jul 14, 2023

zoe-translates Jun 15, 2023 •

edited

Loading

zoe-translates commented Jun 15, 2023 •

edited

Loading

zoe-translates commented Jun 16, 2023 •

edited

Loading