-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ucto fails on some UTF-8 characters in tei2folia generated FoLiA #93
Comments
There is something fishy going on here indeed! Here's a minimal version that replicates this problem with ¼ getting lost:
|
¼ was internally detected as an 'unknown' character/token. And subsequently ignored. So another patch is immanent |
Agreed |
I extended the test with another problem involving UNKNOWN characters. |
Hi, Sorry to be too late, but this was closed just a little prematurely... I have just rerun Ucto on the first batch of my almost 600 files. These were just 26 exceptional ones. One fails, the error message is this: $ cat Ockham-A-xxxx-PM_eng_Ockham_Work_of_Ninety_Days_A_Defense_of_Franciscan_Poverty_against_Pope_John_XXII.folia.stderr I have not yet examined the input file. You can have a copy should you wish. Thanks! |
This fixed this file's problem: $ sed -e 's/\xef\xbb\xbf/ /g' I mistakenly, it seems, thought the BOM only occurred at the beginning of a file. |
A nice you fixed it
Still, point 2. needs more attention |
I'd say if there are extra (therefore invalid) BOMs, ucto can just fail with an error message and let the user clean up the input first. |
@proycon That could be complicated in batch processing situations, don't you think? |
A fix is available in Git now. |
released v0.32.1 |
Hi,
I ran a batch of rather scientific, older English texts converted to FoLiA by tool tei2folia and find Ucto fails on a number of them.
Most seem to have some version of e.g. '½' and '¼', which seems to get lost, causing the tool to fail. One was a discussion of the annotation of floating point numbers and what should have been their annotation, i.e. e.g. '3̂14159' instead of '3.14159'.
I sent you a package containing three of these files, with the Ucto's stderr files, by mail.
Hope this can be solved!
Thank you!
The text was updated successfully, but these errors were encountered: