Ucto fails on some UTF-8 characters in tei2folia generated FoLiA #93

martinreynaert · 2024-02-26T16:26:39Z

Hi,

I ran a batch of rather scientific, older English texts converted to FoLiA by tool tei2folia and find Ucto fails on a number of them.

Most seem to have some version of e.g. '½' and '¼', which seems to get lost, causing the tool to fail. One was a discussion of the annotation of floating point numbers and what should have been their annotation, i.e. e.g. '3̂14159' instead of '3.14159'.

I sent you a package containing three of these files, with the Ucto's stderr files, by mail.

Hope this can be solved!

Thank you!

proycon · 2024-02-26T19:46:27Z

There is something fishy going on here indeed! Here's a minimal version that replicates this problem with ¼ getting lost:

issue93.folia.xml.gz

$ ucto -Len -X issue93.folia.xml out.folia.xml
$ foliavalidator out.folia.xml    
VALIDATION ERROR on full parse by library (stage 2/3), in out.folia.xml
ParseError: FoLiA exception in handling of <div> @ line 42 (in parent <text> @ parent line 41) : [InconsistentText] Text for <Paragraph at 133695250065680 id=undefined.text.div.2.div.1.div.2.div.5.div.1.p.33 set=https://github.com/raw/proycon/folia/master/setdefinitions/tei2folia/paragraphs.foliaset.ttl class=p>, is inconsistent: EXPECTED (deep text after normalization) *****>
risen about of an Inch
****> BUT FOUND (strict text after normalization) ****>
risen about ¼ of an Inch
******* DEVIATION POINT: sen about <*HERE*>¼ of an In
(also checked against older rules prior to FoLiA v2.4.1)

kosloot · 2024-02-27T10:48:09Z

¼ was internally detected as an 'unknown' character/token. And subsequently ignored.
I fixed this and now it works as expected.
But it might be better to rethink the logic here. Strange characters should NEVER be ignored.
I assume it is is best to see them as a separate token of type UNKNOWN.

So another patch is immanent

proycon · 2024-02-27T11:43:11Z

Agreed

kosloot · 2024-02-27T12:18:46Z

I extended the test with another problem involving UNKNOWN characters.
The sequence 3̂14159 (note the caret, which is a NON_SPACING_MARK
There is no sensible way tot tell Ucto that this is a valid NUMBER.
Best solution seems to get 3 tokens: <3>, < ̂> and <14159>
I implemented that

martinreynaert · 2024-03-19T15:25:44Z

Hi,

Sorry to be too late, but this was closed just a little prematurely...

I have just rerun Ucto on the first batch of my almost 600 files. These were just 26 exceptional ones. One fails, the error message is this:

$ cat Ockham-A-xxxx-PM_eng_Ockham_Work_of_Ninety_Days_A_Defense_of_Franciscan_Poverty_against_Pope_John_XXII.folia.stderr
ucto: inputfile = /data/Ockham-A-xxxx-PM_eng_Ockham_Work_of_Ninety_Days_A_Defense_of_Franciscan_Poverty_against_Pope_John_XXII-V1.txt
ucto: outputfile =
ucto: configured for languages: [eng]
ucto:Warning: Problematic character encountered: type=UNKNOWN value=0xfeff ( -0xffef--0xffbb--0xffbf-)
terminate called after throwing an instance of 'folia::ValueError'
what(): attempt to add an empty to word: Ockham-A-xxxx-PastMasters_eng_Ockham_Work_of_Ninety_Days_A_Defense_of_Franciscan_Poverty_against_Pope_John_XXII.p.4.s.57.w.8

I have not yet examined the input file. You can have a copy should you wish.

Thanks!

martinreynaert · 2024-03-19T16:23:10Z

This fixed this file's problem:

$ sed -e 's/\xef\xbb\xbf/ /g'

I mistakenly, it seems, thought the BOM only occurred at the beginning of a file.

kosloot · 2024-03-19T16:50:47Z

A nice you fixed it

BOM's shouldn't occur anywhere else than at the start of the file, but in practice they do
ucto should filter BOM's out but, probably with the new UNKNOWN value, this is handled wrong.
BOM's must be discarded without any mercy.
I'm glad the error message was clear and helpful

Still, point 2. needs more attention

proycon · 2024-03-19T17:09:19Z

ucto should filter BOM's out but, probably with the new UNKNOWN value, this is handled wrong.
BOM's must be discarded without any mercy.

I'd say if there are extra (therefore invalid) BOMs, ucto can just fail with an error message and let the user clean up the input first.

jeroenjeremy · 2024-03-19T17:11:27Z

@proycon That could be complicated in batch processing situations, don't you think?

proycon · 2024-03-19T17:13:43Z

Right, that's a good point, especially now @kosloot is working on #94

kosloot · 2024-03-19T19:43:32Z

A fix is available in Git now.

kosloot · 2024-03-20T09:16:56Z

released v0.32.1

proycon added the bug label Feb 26, 2024

kosloot added a commit that referenced this issue Feb 27, 2024

fix for #93

6d3e7f0

kosloot closed this as completed Mar 19, 2024

kosloot reopened this Mar 19, 2024

kosloot closed this as completed Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ucto fails on some UTF-8 characters in tei2folia generated FoLiA #93

Ucto fails on some UTF-8 characters in tei2folia generated FoLiA #93

martinreynaert commented Feb 26, 2024

proycon commented Feb 26, 2024

kosloot commented Feb 27, 2024

proycon commented Feb 27, 2024

kosloot commented Feb 27, 2024

martinreynaert commented Mar 19, 2024

martinreynaert commented Mar 19, 2024

kosloot commented Mar 19, 2024 •

edited

Loading

proycon commented Mar 19, 2024

jeroenjeremy commented Mar 19, 2024

proycon commented Mar 19, 2024

kosloot commented Mar 19, 2024 •

edited

Loading

kosloot commented Mar 20, 2024

Ucto fails on some UTF-8 characters in tei2folia generated FoLiA #93

Ucto fails on some UTF-8 characters in tei2folia generated FoLiA #93

Comments

martinreynaert commented Feb 26, 2024

proycon commented Feb 26, 2024

kosloot commented Feb 27, 2024

proycon commented Feb 27, 2024

kosloot commented Feb 27, 2024

martinreynaert commented Mar 19, 2024

martinreynaert commented Mar 19, 2024

kosloot commented Mar 19, 2024 • edited Loading

proycon commented Mar 19, 2024

jeroenjeremy commented Mar 19, 2024

proycon commented Mar 19, 2024

kosloot commented Mar 19, 2024 • edited Loading

kosloot commented Mar 20, 2024

kosloot commented Mar 19, 2024 •

edited

Loading

kosloot commented Mar 19, 2024 •

edited

Loading