Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LaTeX to Unicode formatter should not replace \% with % #8490

Closed
2 tasks done
JasonGross opened this issue Feb 8, 2022 · 13 comments
Closed
2 tasks done

LaTeX to Unicode formatter should not replace \% with % #8490

JasonGross opened this issue Feb 8, 2022 · 13 comments
Labels
unicode unicode related issues

Comments

@JasonGross
Copy link

JabRef version

5.5 (latest release)

Operating system

Windows

Details on version and operating system

Windows 10

Checked with the latest development build

  • I made a backup of my libraries before testing the latest development version.
  • I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

  1. Create the following .bib file:
@Misc{test,
  abstract = {10\%},
}
  1. Open the file in JabRef, select the entry, click Quality -> Cleanup Entries, ensure that "Enable Field Formatters" is checked and "LaTeX to Unicode" is enabled for Abstract, as in the following image, and then click "Ok"
  2. Notice that the abstract is abstract = {10%}

Since % is a comment character in LaTeX, this change is incorrect. More generally, escaped special characters in LaTeX should not be unescaped when converting to Unicode (or at least the general "convert to Unicode" should not have this behavior)

Appendix

image

@Siedlerchr
Copy link
Member

Well, technically this is the correct behavior, it converts everything to Unicode. What you probably want is to use the LaTeXCleanup formatter as well. That respects those things
https://docs.jabref.org/finding-sorting-and-cleaning-entries/saveactions#latex-cleanup

@JasonGross
Copy link
Author

Technically correct but practically wrong. LaTeXCleanup will fix the issue with % but will not escape $, right? I want a transformer that will transform LaTeX to Unicode-aware LaTeX, preferring Unicode characters when available. What use is "LaTeX to Unicode" if it generates text that breaks the .bib file?

@ThiloteE
Copy link
Member

ThiloteE commented Apr 6, 2022

Thinking about this a little, the way forward might indeed be to transform the $ sign to \$ when using the Latexcleanup action.
Background story: $ opens mathmode in Latex. One does not want to accidentially open mathmode, just because a $ sign was in the library.

The code that would need to be changed is here: https://github.com/JabRef/jabref/blob/main/src/main/java/org/jabref/logic/formatter/bibtexfields/LatexCleanupFormatter.java

@ThiloteE
Copy link
Member

ThiloteE commented Apr 7, 2022

LaTeXtoUnicode:

The LatextoUnicode converter assumes the bibliographic data is formatted in Latex Syntax. In LaTeX syntax, writing the percentage sign requires a backslash in front (\%). A simple % would denote the start of a LaTeX comment.

Hence, the removal of a simple backslash \ is correct.

From this we can see that:

  • If the bibliographic data is already in Unicode format, using the LaTeXtoUnicode converter is not advised.
  • If the bibliographic data is in mixed LaTeX and Unicode format, using the LaTeXtoUnicode converter is not advised. Manual cleanup (or another cleanup action) might be necessary.

LaTeX Cleanup:

Furthermore, the "LaTeXcleanup" turns out to be slightly a Frankenstein. https://docs.jabref.org/finding-sorting-and-cleaning-entries/saveactions#latex-cleanup. The name is misleading. It does not only clean up redundant LaTeX code or special characters. It actually mostly does the opposite: It makes bibliographies ready to be used with LaTeX (by removing characters, though)

I would recommend a name change or at least link to the documentation page for this command within Jabref. E.g. something to Make LaTeX ready

Examples:

  • On the one hand, the command makes the bibliographic data ready to be used with LaTeX: e.g. "scape percent character (e.g.50%50\%)".
  • On the other hand, this command removes LaTeX code e.g. by removing redundant $ signs. With redundant, it means for example two $$ in a row. Therefore, making it ready to be used with programs that require Unicode, if there was a lot of math-mode stuff in the bibliographic data before. Of course, this would also make bibliographies formatted in Unicode ready to be used with LaTeX.

Interestingly, I just did a test. Running the LaTeXcleanup command does NOT remove a singular $ sign!

Jason, maybe you still had your LaTeXtoUnicode cleanup running before or after you used the LaTeX Cleanup action? Maybe you actually had math-mode stuff in the library?

Fun fact: Searching on google scholar for % or $ yields 0 results.
Maybe not a good idea to put these special characters into the title of an entry :D

After having written all this, I still am of the opinion that the way forward would be to change the LaTeX Cleanup action OR the UnicodetoLaTeX action to add a backslash to $ sign. Maybe do both.

UnicodeToLaTeX:

Doing a similar test for UnicodeToLaTeX, for whatever reason, both the $ and the % sign do not get backslashed ... am I missing something?

@ThiloteE
Copy link
Member

Interestingly, I just did a test. Running the LaTeXcleanup command does NOT remove a singular $ sign!

Since this is the case, I assume you should have no problems anymore.

Closing this.

If you still have problems, feel free to open again and report them.

@ThiloteE
Copy link
Member

Technically correct but practically wrong. LaTeXCleanup will fix the issue with % but will not escape $, right?

@JasonGross The next release of JabRef will contain a separate cleanup action that excapes $ signs. Please do not use it lightly. Use with care. JabRef is not able to know if dollar signs were present to A) start mathmode or B) simply render a $ sign. Using this cleanup action will require a double check by users, unless you want to challenge your "luck".

@JasonGross
Copy link
Author

I am still interested in a cleanup action that converts LaTeX to mixed LaTeX and Unicode, ie, it should be valid LaTeX code and display the same, but anything that could be replaced by a non-special Unicode character is. As I've said above, the current behavior of LaTeX to Unicode is useless because it generates invalid bibliographic files. Should I open a new issue for this, or reopen this one?

@ThiloteE
Copy link
Member

ThiloteE commented Apr 22, 2022

I would propose trying to fix this via an integrity check. #8712
You could convert from LaTeX to Unicode and then to do the integrity check. Would that work for you?

@ThiloteE
Copy link
Member

The problem is, somebody would need to do the mapping from LaTeX to "Unicode aware LaTeX" or since we are at it from Unicode to "LaTeX aware Unicode", which is a lot of work. The Comprehensive LATEX Symbol List lists

18150 symbols and the corresponding LATEX commands that produce them. Some of these symbols are guaranteed to be available in every LATEX 2𝜀 system; others require fonts and packages that may not accompany a given distribution and that therefore need to be installed.

A conversion (e.g. via cleanup actions) is non-trivial.

@JasonGross
Copy link
Author

I would propose trying to fix this via an integrity check. #8712 You could convert from LaTeX to Unicode and then to do the integrity check. Would that work for you?

That would be great! However, even better would be a version of LaTeX to Unicode that lets the user explicitly deactivate any subset of the mapping that they'd like. The default exclusion list would just include special/control characters like % and \ .

The Comprehensive LATEX Symbol List lists

18150 symbols and the corresponding LATEX commands that produce them. Some of these symbols are guaranteed to be available in every LATEX 2𝜀 system; others require fonts and packages that may not accompany a given distribution and that therefore need to be installed.

A conversion (e.g. via cleanup actions) is non-trivial.

This is a red herring. If the symbol is not available in the font, it doesn't matter whether it comes from a Unicode character or not. If the symbol is available via command and you're using a Unicode-aware TeX engine, I expect it to be available by Unicode character too.

@ThiloteE
Copy link
Member

That would be great! However, even better would be a version of LaTeX to Unicode that lets the user explicitly deactivate any subset of the mapping that they'd like. The default exclusion list would just include special/control characters

Ok, I finally may understand why this might be useful. If you want to bring really old databases up to date and transform to unicode, but not for the sake of using the database to export to LibreOffice/OpenOffice or Microsoft Office (These would be fine with "pure" Unicode I think), but still would want to continue to export them to a (La)TeX engine (that can read unicode), you would only need to do ONE conversion (with some excluded terms), instead of TWO conversions + integrity check. You would not need to check all entries via "integrity check", because the terms you excluded were already working fine with LaTeX before the conversion.

Suggestion to change the name of this issue to: "Add cleanup action for "LaTeX to LaTeX aware Unicode"".

Have you tried what Christoph suggest by the way? Using "Latex cleanup"? Have you run into problems with it?

It does:

  • Escape percent character (e.g.50% ⇒ 50%)
  • Remove redundant $, {, and } (but not if the } is part of a command argument​)
  • Move numbers, +, -, /, and brackets into equations
  • Move numbers followed by a space left of $ inside the equation (e.g. 0.35 $\mu$m)
  • Replace all @@ with $
  • Replace multiple spaces with a single space

@JasonGross JasonGross changed the title LaTeX to Unicode formatter should not replace \% with % Add cleanup action for "LaTeX to LaTeX aware Unicode" Apr 22, 2022
@JasonGross
Copy link
Author

Ok, I finally may understand why this might be useful. If you want to bring really old databases up to date and transform to unicode, but not for the sake of using the database to export to LibreOffice/OpenOffice or Microsoft Office (These would be fine with "pure" Unicode I think), but still would want to continue to export them to a (La)TeX engine

Yes! (Though more often it's "I copy-pasted from Google Scholar or some internet-provided .bib file" than "I had a really old database".)

Suggestion to change the name of this issue to: "Add cleanup action for "LaTeX to LaTeX aware Unicode"".

Name changed, please reopen issue.

Have you tried what Christoph suggest by the way? Using "Latex cleanup"? Have you run into problems with it?

I have not tried it yet. I'll try it the next time I'm manipulating databases.

@ThiloteE ThiloteE reopened this Apr 23, 2022
@ThiloteE
Copy link
Member

@JasonGross lets rename this issue back to "LaTeX to Unicode formatter should not replace % with %" again, then we close this issue and open a new issue with a well explained first post understandable for people that have no clue about these issues listing:

  • problem
  • desired solution
  • example for how a future workflow would look like
  • list "special symbols" that would need to be excluded

E.g., you can copy paste following text:

Problem:

  • There is no cleanup action that allows converting (old) bibliographic data that is (still) formatted in LaTeX with Non-Unicode characters to Unicode aware LaTeX formatting (newer LaTeX engines (e.g. LaTeX2e) can now read most Unicode characters).
  • Current workarounds include converting to from LaTeX to Unicode and then back to LaTeX, while manuall checking, if any characters were wrongly converted. This is inefficient and takes a long time.

Desired Solution:

  • Create cleanup action for "LaTeX to Unicode aware LaTeX".

Example workflow:

  1. Have the following entry (BEFORE using the cleanup action):

    @Article{Testkey,
      author   = {Testauthor},
      title    = {Bibliographic data that can be read by LaTeX engines},
      a = {Here is a backslashed percentage sign \% and it should be excluded from conversion},
      b = {Here is a \textcopyright{} and it should be converted to Unicode}, 
    }
    

    (Comment: \textcopyright{} can be converted to © by the inputenc package. When using the LaTeX to Unicode aware LaTeX cleanup action, the result of the conversion should also be ©)

  2. Use cleanup action "LaTeX to Unicode aware LaTeX"

  3. AFTER using the cleanup action, the following result should emerge:

    @Article{Testkey,
      author   = {Testauthor},
      title    = {Bibliographic data that can be read by LaTeX engines},
      a = {Here is a backslashed percentage sign \% and it should be excluded from conversion},
      b = {Here is a © and it should be converted to Unicode}, 
    }
    

"Special Symbols" that would need to be excluded from conversion:

  • The list should be similar to the symbols mentioned in Add integrity check for LaTeX special characters #8712.
  • At the very least Page 15 (Tables 1); Table 1 lists escapable special characters in LaTeX.
  • Maybe also Page 15 Table 2 and Page 16 Table 3.
  • There might be a lot more, but I am not knowledgable enough to list them here. If you know of any, just post it in this thread.

Additional Information

  • When working on this, The Comprehensive LATEX Symbol List will be of help. Especially chapters about "Unicode" (Page 272) and "Special Characters" (Page 15-16).
  • JabRef currently uses https://github.com/tomtung/latex2unicode; Maybe it can be adapted internally in JabRef (e.g. some pre-processing). Another solution would be to fork it or ask tomtung about creating a LaTeX2UnicodeAwareLaTeX converter.

@JasonGross JasonGross changed the title Add cleanup action for "LaTeX to LaTeX aware Unicode" LaTeX to Unicode formatter should not replace \% with % Apr 23, 2022
@ThiloteE ThiloteE removed the status: waiting-for-feedback The submitter or other users need to provide more information about the issue label May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode unicode related issues
Projects
None yet
Development

No branches or pull requests

3 participants