Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning should be displayed when using illegal/unescaped characters in bibtex fields (e.g., %, #, &) #1188

Closed
ajbelle opened this issue Apr 13, 2016 · 18 comments
Labels
status: waiting-for-feedback The submitter or other users need to provide more information about the issue type: enhancement

Comments

@ajbelle
Copy link

ajbelle commented Apr 13, 2016

JabRef version <3.2 and 2.1> on <Windows 7>

Steps to reproduce:

  1. Have &Update javafx to 17.0.1 #8211 (which I believe is html for dash) pulled into a title of a bib file created with Ver 3.2
  2. Revert back to JabRef Ver2.1 and attempt to save
  3. Message stating unescaped # not acceptable in BibTeX and will not save until the non-BibTeX character is removed.

It may not be a fault in BibLaTeX which I have selected, but I report it in case it indicates that the BibTeX save parsing algorithm is no longer working/deleted in ver3.2 and later.

@koppor koppor changed the title Prarsing BibTeX Titles Parsing BibTeX titles Apr 13, 2016
@koppor
Copy link
Member

koppor commented Apr 13, 2016

biber (or bibtex) processes the bib file without issues?

In this case, it is an issue in JabRef 2.1 (from August 9th, 2006), which we will not fix as we focus on JabRef 3.x. 😇

@stefan-kolb stefan-kolb added the status: waiting-for-feedback The submitter or other users need to provide more information about the issue label Apr 13, 2016
@ajbelle
Copy link
Author

ajbelle commented Apr 13, 2016

To clarify @koppor, I used Ver2.1 because that is a version I have left installed, to check the search delay issue and the message appeared, requiring me to remove the # before saving. My thought was "WHY DOES AN OLDER VERSION PICK UP A BibTeX file ERROR" while the current one does not! The older version isn't the problem, but it is IDENTIFIYING AN ERROR in the bib file accepted by the current version. I am not expert, but assume it may cause issues if you attempt to use the citation in a LaTeX paper. Lots of features are being removed, and I wondered if parsing the bib file for errors was one that should be retained. Maybe BibLateX handles non-BibTeX characters, so the check has been intentionally removed from the current versions. I hope someone who knows can identify if this is a problem in Ver 3.2 and JabRef_windows-x64_3_3dev--snapshot--2016-04-05--fast-search--e0380b7, or an intentional removal of code.
I have reworded the steps in the original post to make it clearer (I hope).
PS: I use LyX and have experienced major issues trying to Typeset if there are unicode characters in wrong place in the bibfile (by cutting and pasting directly from pdfs). LyX doesn't really tell you that is what is crashing things, so it takes ages to figure out it is something in the bib file, and even longer to find it. Typically it is a hidden character or a ligature. I was hoping these were picked up by the parsing that this issue refers to.

@koppor
Copy link
Member

koppor commented Apr 14, 2016

@ajbelle Please write me a personal email stating the features you miss. I personally fight for keeping all available issues and support everyone wanting to add an issue. The only thing I currently see being removed and really affecting someone is #496. - I know that there are other things being removed listed at https://github.com/JabRef/jabref/blob/master/CHANGELOG.md, but does anything affect you? Everything else seem to be issues being raised, because other issues (affecting other users).

I made a minimal example:

\documentclass{scrartcl}

\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}
@misc{A01,
  author = {Author, A.},
  year = {2001},
  title = {The &#8211 thing},
}
\end{filecontents}

\begin{document}

\cite[1]{A01}

\bibliography{\jobname}
\bibliographystyle{alpha}

\end{document}

The error is

 Misplaced alignment tab character &.
l.5 \newblock The &
                   #8211 thing, 2001.

Another example:

\documentclass{scrartcl}

\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}
@misc{A01,
  author = {Author, A.},
  year = {2001},
  title = {The #8211 thing},
}
\end{filecontents}

\begin{document}

\cite[1]{A01}

\bibliography{\jobname}
\bibliographystyle{alpha}

\end{document}

Result:

! You can't use `macro parameter character #' in horizontal mode.
l.5 \newblock The #
                   8211 thing, 2001.
?

So, you are right, JabRef allows saving files not being treatable by pdflatex.

@koppor koppor changed the title Parsing BibTeX titles Warning should be displayed when using illegal/unescaped characters in bibtex fields (e.g., %, #, &) Apr 14, 2016
@koppor koppor added type: enhancement and removed status: waiting-for-feedback The submitter or other users need to provide more information about the issue labels Apr 14, 2016
@koppor
Copy link
Member

koppor commented Apr 14, 2016

Is this something for our integrity check? Should the integrity check being run "on save"?

@ajbelle
Copy link
Author

ajbelle commented Apr 14, 2016

THX @ koppor, I hoped someone clever could see if what I noticed was a problem. I am never sure it isn't my old brain.
I checked the CHANGELOG and could not see any issues I have not raised already. I think what the team is doing is great, and the speed increase is very noticeable with my large file.

3.0 - 2015-11-29
Changed Added more characters to HTML/Unicode converter

Could this change be related to the above issue, and possibly part of the solution given &#8211 is HML for –

It would be good if the integrity check could make sure all 'unacceptable characters' are eliminated/converted rigourously.
A young programer next to me who uses JabRef imports everything though an ASCII text editor to prefilter, but that seems 'extreme' to me :-).

@oscargus
Copy link
Contributor

Correct HTML is &#8211;, you may also use &ndash; (or &#x2013;). This character has been around for quite some time, so it shouldn't be related to that ChangeLog entry.

I'm surprised to see that JabRef now allows saving fields with a single # in it. This was definitely not allowed in earlier versions and I do not really know when this bug was introduced.

If you run the HTML to LaTeX converter it should be replaced with -- (or, maybe not, since it is not proper HTML).

Another interesting aspect is that if you happen to have two of these, say &#8211;&#8211; it will be saved as {\&} # 8211;& #{8211;}, so & is correctly escaped and 8211;& will be considered a string. As you may realize it is in the general case quite challenging for software to determine if you really would like that or not. (There are definitely use cases where one would like to have a similar structure for a field and JabRef use # in the fields to switch between strings and content, still an odd number of # doesn't make sense.)

@oscargus
Copy link
Contributor

Regarding %, I am not convinced that it should be automatically escaped. Apart from # which JabRef deals with explicitly, what is written in the field is LaTeX code. If you would have written &#8211 in a text editor you would have got exactly that behaviour. JabRef cannot check that you don't write \somesymbolthatdoesntexist. Hence, I'm even skeptical to the escaping of & and I guess one can argue that it is only if you write two # that JabRef interfers.

With that said I still believe that JabRef should have functionality to automatically escape & etc, not just blindly on save. There may be cases where I actually want to have a & in my file. Say, if there is a tabular in the abstract. Note that the errors for @koppor are LaTeX errors, not bibtex errors, similar to whatever bad LaTeX code you might have written in your entries.

(Btw, the file saved with the &#8211;&#8211; cannot be opened again...)

@Siedlerchr
Copy link
Member

Regarding LaTex, all reserved chars should be escaped:
# $ % ^ & _ { } ~ \

@ajbelle
Copy link
Author

ajbelle commented Apr 15, 2016

To clarify, as a user, I do not input HTML codes and ligatures on purpose. They get pulled in with the various imports, often without my awareness. THX for the HTML to LaTeX converter hint @oscargus.

Ver 2.1 behaviour that identified the offending character for manual intervention, was excellent, making it obvious to me as a user what the problem was and where to fix it without RTFM. Where is can't be made automatic, can JabRef highlight possible problem characters allowing the user to decide (A Regex Search entry, written by someone smarter than me, is all I would need).

Not the issue discussed, but related in a users mind as extra characters you have to deal with, issue #1153 means I have many junk characters in some imports and it would be nice to reverse there utf8-ASCII conversion on a per entry basis using |Quality>Cleanup entries>. This is fixing the symptoms, but I mention it due the frequency the utf8-ascii corruptions experienced during file exchanges, in case it could be coded at the same time any changes deemed required are implemented. Not an expectation, just an idea.

@oscargus
Copy link
Contributor

@Siedlerchr: No. Obvious from the example A {$\Sigma_{i}$} {DSP} generator which you do not want to save as A \{\$\\Sigma\_\{i\}\$\} \{DSP\} generator

@ajbelle Are the imports JabRef searches or BibTeX-files provided to you from colleagues and websites? For the first case, it would be good to now about as we should provide conversion automatically. For the second case one can think of having general import converters. The idea in the later versions of JabRef is though to implement this as "save actions", i.e., conversions/clean-ups that are always applied on saving so that HTML-encoded characters are never saved, but converted to LaTeX-sequences, when that is activated.

Clearly, JabRef could warn for characters that are unlikely to be what the user actually wants, as single #, unescaped & and % etc.

It is not, I believe, in the general case possible to figure out which two consecutive 8-bit characters actually should be combined to a 16-bit character. I see the point though and it would be nice if it did work...

@ajbelle
Copy link
Author

ajbelle commented Apr 19, 2016

@oscargus Oh yes, you are correct about automatic reverse conversion to 16-bit. Could it be coded for a user selected character sequence as manual correction requires you to know what the symbol should be? There are usually only a few mashed characters. A suggested feature, requiring no reply as I doubt the team has time for such a specialist feature given the current philosophy.

@oscargus All of my offending imports are from downloaded .bib files, from reputable sources that should know better. I am amazed at the entry content and formatting served up as BibTeX and see JabRef as offering an authoritative implementation of the BibLaTeX standard. Specific to the 16bit encoding issure see #1153 . I have tried to reset the encodings before import but it doesn't always seem to work on my Windows box. A cut and paste does not suffer the encoding translation problem. I could be making a mistakes, but I am sure everyone at some time encounters this issue.

@oscargus
Copy link
Contributor

I should be quite feasible to, say, mark two characters, right-click and with some magic (Character = 256*char1 + char2 sort of) convert those two to a Unicode character.

In the latest master there is an integrity checker that checks for an odd number of unescaped #. It will, if nothing else, detect half or so of fields with HTML characters in it. It should be possible to detect sequences looking like HTML characters as well, but it may require a bit more thinking to find the correct regular expressions (&#[0-9]+; (numbers) &#x[0-9A-Fa-f]+; (hex numbers) &[A-Za-z0-9]+; (named entities)) might work though when thinking about it, so adding an integrity checker that warns for (most?) HTML characters should be quite OK. Actually only the last one is required.

Both these might come in a master build in the near future. I'll update here if it happens.

I'm surprised (but somehow not) to hear that bad .bib-entries are produced by knowledgable sources. If you happen to directly search from such a source in JabRef, just let us know and we'll at least add automatic conversion there.

@oscargus
Copy link
Contributor

oscargus commented Apr 19, 2016

The current master have checkers for an odd number of #-signs in a field and for any HTML characters, which should help in many of the cases if nothing else.

I've also implemented the two 8-bit characters to one 16-bit character conversion, but when reading up it seems like it is only applicable to UTF-16. Do you have any example string that I can try on? Writing dd gives 摤 which is correct (d = 0x64 and 摤 = 0x6464 = "\u6464" ), so it works but I am not sure that one will ever get that type of incorrectly converted characters in practice... It seems like the UTF-8 encoded version of that character is 0xE6 0x91 0xA4, but then I do not completely follow the details...

@oscargus
Copy link
Contributor

And for example string: just copy from the field editor in JabRef and paste here. That should work.

@ajbelle
Copy link
Author

ajbelle commented Apr 20, 2016

@oscargus I think you have it understood (better than me anyway). Since you said copy and paste I enclose examples from my file. I am not sure if this is what you wanted.

My encoding is set to utf-8 at all times in all editors. I do have entries from Endnote file import (original sources unknown) into various versions of JabRef. Due to their size and deeper coverage (including maths) the Abstracts are normally the problem. Embedded # have been a common annoyance (JabRef wouldn't save), along with the unbalanced {} which JabRef picks up.

As previously mentioned ° came in as ° which is very annoying for me given I use it regularly as a special file marker. I just retested import of utf-8 to utf-8 and it worked perfectly! Previously nothing I did seemed to get it in unchanged. Maybe it is a setting save glitch.

Punctuation:
– mostly as – sometimes as �? (this is a dash not a hypen)
’’ mostly as �? occasionally as ^a€�? (this is not the keyboard ")
“ mostly OK! “ occasionally as ^a€œ (this is not the keyboard ")
∼ came in as ∼

Maths symbols:
° came in as ° This is very annoying for me given I use it as a special file marker. I retested inport of utf-8 to utf-8 and it worked perferctly! Previously nothing I did seemed to get it in unchanged.
6000 ≤ Re ≤ 40,000 came in as 6000 ≤ Re ≤ 40,000
1 mm to 3.65 mm came in as 1 mm to 3.65 mm (very strange, but common )
θ = 60 came in as θ = 60 (2nd example)
∝ came in as âˆ�
superscript − came in as −
when it gets complex it is a mess.
eg: formatted Nu¯∝Re0.75(H/d)−0.016 came in as Nu ¯ âˆ� R e 0.75 ( H / d ) − 0.016

OCD pdf due to fi, ligature http://ilovetypography.com/2007/09/09/decline-and-fall-of-the-ligature/
finite-volume appears as ®nite-volume
specific appears as speci®c
Of course there are the garbage OCR misreads in abstracts that seem to have been pulled from the pdf file.

Basically any character beyond the 8bit encoding as the following indicate
Greek characters :
θ came in as θ
δ came in as δ
’ came in as ’
μ came in as μ
Ω came in as Ω
φ came in as �?†
ϕ came in as �?•
Λ came in as Λ
...

Accented characters
ö came in as ö
ä came in as ä
ü came in as ü
...

@tobiasdiez
Copy link
Member

What is the status of this issue? Can it be closed?
At least the unescaped # are now detected during integrity check, right @oscargus?

What else is missing?

@simonharrer simonharrer added status: waiting-for-feedback The submitter or other users need to provide more information about the issue and removed status: devcall labels May 10, 2016
@koppor
Copy link
Member

koppor commented May 10, 2016

What are the two first lines of the bib file? Could it be that the encoding information is wrong there?

@ajbelle
Copy link
Author

ajbelle commented May 13, 2016

@koppor if your question is to me I always have JabRef UTF-8 and try ensuring the input file is UTF-8 (except when trying to figure out what is wrong).

The first line in my files is: % Encoding: UTF-8

The garbage content posted is a function of importing from many sources (including content obviously from pdf OCR) and are not necessarily a single translation fault. They simply list of the sort of garbage JabRef has to deal with, that could/will cause problems. The list is not supposed to represent the original problem I identified, which has been fixed per @tobiasdiez.

I have experienced situations where changing the encoding format to UTF-8 with Notepad++/Notpad2/CrimsonEditor... didn't seem to correct the import translation problem (but cut and paste on my Win7 box was fine!). I didn't have time to work out with certainty if the PFEs were actually changing the format, so did not report it as an issue with JabRef. While I changed the file format by PDF I did not change the encoding line in the file, so I guess JabRef was being told it was CP1252 when in fact I had converted it to UTF-8. If true that would finally explain why I could never make it import without character corruption, and maybe a clarification should be added to a help file somewhere. Sorry if I have missed it in the manual.

The bottom line from my experience is that all users at some stage will get this sort of character corruption and if JabRef could help in any way that would be great. Reversing the one-way translation isn't usually possible I understand.

IMHO @oscargus understands the situation best, and I believe he thinks everything that can be implemented has been. I have therefore closed the issue, on the understanding no-one can see how to strengthen JabRef further. I am not sure if I am/was expected to close it.

@ajbelle ajbelle closed this as completed May 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-feedback The submitter or other users need to provide more information about the issue type: enhancement
Projects
None yet
Development

No branches or pull requests

7 participants