Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ucto fails on some UTF-8 characters in tei2folia generated FoLiA #93

Closed
martinreynaert opened this issue Feb 26, 2024 · 12 comments
Closed
Labels

Comments

@martinreynaert
Copy link

Hi,

I ran a batch of rather scientific, older English texts converted to FoLiA by tool tei2folia and find Ucto fails on a number of them.

Most seem to have some version of e.g. '½' and '¼', which seems to get lost, causing the tool to fail. One was a discussion of the annotation of floating point numbers and what should have been their annotation, i.e. e.g. '3̂14159' instead of '3.14159'.

I sent you a package containing three of these files, with the Ucto's stderr files, by mail.

Hope this can be solved!

Thank you!

@proycon proycon added the bug label Feb 26, 2024
@proycon
Copy link
Member

proycon commented Feb 26, 2024

There is something fishy going on here indeed! Here's a minimal version that replicates this problem with ¼ getting lost:

issue93.folia.xml.gz

$ ucto -Len -X issue93.folia.xml out.folia.xml
$ foliavalidator out.folia.xml    
VALIDATION ERROR on full parse by library (stage 2/3), in out.folia.xml
ParseError: FoLiA exception in handling of <div> @ line 42 (in parent <text> @ parent line 41) : [InconsistentText] Text for <Paragraph at 133695250065680 id=undefined.text.div.2.div.1.div.2.div.5.div.1.p.33 set=https://github.com/raw/proycon/folia/master/setdefinitions/tei2folia/paragraphs.foliaset.ttl class=p>, is inconsistent: EXPECTED (deep text after normalization) *****>
risen about of an Inch
****> BUT FOUND (strict text after normalization) ****>
risen about ¼ of an Inch
******* DEVIATION POINT: sen about <*HERE*>¼ of an In
(also checked against older rules prior to FoLiA v2.4.1)

kosloot added a commit that referenced this issue Feb 27, 2024
@kosloot
Copy link
Contributor

kosloot commented Feb 27, 2024

¼ was internally detected as an 'unknown' character/token. And subsequently ignored.
I fixed this and now it works as expected.
But it might be better to rethink the logic here. Strange characters should NEVER be ignored.
I assume it is is best to see them as a separate token of type UNKNOWN.

So another patch is immanent

@proycon
Copy link
Member

proycon commented Feb 27, 2024

Agreed

@kosloot
Copy link
Contributor

kosloot commented Feb 27, 2024

I extended the test with another problem involving UNKNOWN characters.
The sequence 3̂14159 (note the caret, which is a NON_SPACING_MARK
There is no sensible way tot tell Ucto that this is a valid NUMBER.
Best solution seems to get 3 tokens: <3>, < ̂> and <14159>
I implemented that

@kosloot kosloot closed this as completed Mar 19, 2024
@martinreynaert
Copy link
Author

Hi,

Sorry to be too late, but this was closed just a little prematurely...

I have just rerun Ucto on the first batch of my almost 600 files. These were just 26 exceptional ones. One fails, the error message is this:

$ cat Ockham-A-xxxx-PM_eng_Ockham_Work_of_Ninety_Days_A_Defense_of_Franciscan_Poverty_against_Pope_John_XXII.folia.stderr
ucto: inputfile = /data/Ockham-A-xxxx-PM_eng_Ockham_Work_of_Ninety_Days_A_Defense_of_Franciscan_Poverty_against_Pope_John_XXII-V1.txt
ucto: outputfile =
ucto: configured for languages: [eng]
ucto:Warning: Problematic character encountered: type=UNKNOWN value=0xfeff ( -0xffef--0xffbb--0xffbf-)
terminate called after throwing an instance of 'folia::ValueError'
what(): attempt to add an empty to word: Ockham-A-xxxx-PastMasters_eng_Ockham_Work_of_Ninety_Days_A_Defense_of_Franciscan_Poverty_against_Pope_John_XXII.p.4.s.57.w.8

I have not yet examined the input file. You can have a copy should you wish.

Thanks!

@martinreynaert
Copy link
Author

This fixed this file's problem:

$ sed -e 's/\xef\xbb\xbf/ /g'

I mistakenly, it seems, thought the BOM only occurred at the beginning of a file.

@kosloot
Copy link
Contributor

kosloot commented Mar 19, 2024

A nice you fixed it

  1. BOM's shouldn't occur anywhere else than at the start of the file, but in practice they do
  2. ucto should filter BOM's out but, probably with the new UNKNOWN value, this is handled wrong.
    BOM's must be discarded without any mercy.
  3. I'm glad the error message was clear and helpful

Still, point 2. needs more attention

@kosloot kosloot reopened this Mar 19, 2024
@proycon
Copy link
Member

proycon commented Mar 19, 2024

ucto should filter BOM's out but, probably with the new UNKNOWN value, this is handled wrong.
BOM's must be discarded without any mercy.

I'd say if there are extra (therefore invalid) BOMs, ucto can just fail with an error message and let the user clean up the input first.

@jeroenjeremy
Copy link

@proycon That could be complicated in batch processing situations, don't you think?

@proycon
Copy link
Member

proycon commented Mar 19, 2024

Right, that's a good point, especially now @kosloot is working on #94

@kosloot
Copy link
Contributor

kosloot commented Mar 19, 2024

A fix is available in Git now.

@kosloot
Copy link
Contributor

kosloot commented Mar 20, 2024

released v0.32.1

@kosloot kosloot closed this as completed Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants