Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-BMP Unicode code points break XMI files #338

Open
benadelm opened this issue Aug 27, 2020 · 3 comments
Open

Non-BMP Unicode code points break XMI files #338

benadelm opened this issue Aug 27, 2020 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@benadelm
Copy link

When the attached UTF-8 text file (Unicode-Test.txt) is imported into CorefAnnotator and then saved, the attached XMI file is generated (Unicode-Test-xmi.txt, originally Unicode-Test.xmi, but GitHub does not allow me to upload .xmi files), which in turn cannot be opened again:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2672; Character reference "&#55357" is an invalid XML character.

(The same error occurs when trying to load that file in a different program with Java’s SAX parser for XML.)

There is only one Unicode character in the text file: 😂 U+1F602 FACE WITH TEARS OF JOY

This character is displayed correctly in the editor window after importing the text file; just saving it does not seem to work. Judging from the column number given in the error message, the problem lies in the sofaString of the following sofa:

<cas:Sofa xmi:id="12" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="&#55357;&#56834;"/>

Since U+1F602 is a code point outside the Basic Multilingual Plane (BMP), Java’s internal String representation (UTF-16) needs two chars to represent it. It looks like those two chars are escaped individually, which seems to be invalid in XML.

When using Java’s javax.xml.transform.Transformer to create an XML file for a org.w3c.dom.Document where the value of an attribute is set to U+1F602 (that is, to "\uD83D\uDE02"), that attribute value becomes "&#128514;", so I think the above sofa should look like this:

<cas:Sofa xmi:id="12" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="&#128514;"/>

Occurred in this release of CorefAnnotator with Java 13; the javax.xml.transform.Transformer test program delivered the above-mentioned output both when run with Java 13 and when run with Java 8.

Full stack trace of the exception:

java.io.IOException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2672; Character reference "&#55357" is an invalid XML character.
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
        at javax.swing.SwingWorker.get(SwingWorker.java:613) ~[?:?]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.done(JCasLoader.java:147) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at javax.swing.SwingWorker$5.run(SwingWorker.java:750) ~[?:?]
        at javax.swing.SwingWorker$DoSubmitAccumulativeRunnable.run(SwingWorker.java:847) ~[?:?]
        at sun.swing.AccumulativeRunnable.run(AccumulativeRunnable.java:112) ~[?:?]
        at javax.swing.SwingWorker$DoSubmitAccumulativeRunnable.actionPerformed(SwingWorker.java:857) ~[?:?]
        at javax.swing.Timer.fireActionPerformed(Timer.java:317) ~[?:?]
        at javax.swing.Timer$DoPostEvent.run(Timer.java:249) ~[?:?]
        at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:313) ~[?:?]
        at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:770) ~[?:?]
        at java.awt.EventQueue$4.run(EventQueue.java:721) ~[?:?]
        at java.awt.EventQueue$4.run(EventQueue.java:715) ~[?:?]
        at java.security.AccessController.doPrivileged(AccessController.java:391) [?:?]
        at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) [?:?]
        at java.awt.EventQueue.dispatchEvent(EventQueue.java:740) [?:?]
        at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) [?:?]
        at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) [?:?]
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) [?:?]
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) [?:?]
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) [?:?]
        at java.awt.EventDispatchThread.run(EventDispatchThread.java:90) [?:?]
Caused by: java.io.IOException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2672; Character reference "&#55357" is an invalid XML character.
        at de.unistuttgart.ims.coref.annotator.plugins.DefaultImportPlugin.getJCas(DefaultImportPlugin.java:87) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.readFile(JCasLoader.java:104) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.doInBackground(JCasLoader.java:139) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.doInBackground(JCasLoader.java:33) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at javax.swing.SwingWorker$1.call(SwingWorker.java:304) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at javax.swing.SwingWorker.run(SwingWorker.java:343) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:830) ~[?:?]
Caused by: org.xml.sax.SAXParseException: Character reference "&#55357" is an invalid XML character.
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at org.apache.uima.cas.impl.XmiCasDeserializer.deserialize(XmiCasDeserializer.java:2066) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at org.apache.uima.cas.impl.XmiCasDeserializer.deserialize(XmiCasDeserializer.java:1983) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.plugins.DefaultImportPlugin.getJCas(DefaultImportPlugin.java:84) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.readFile(JCasLoader.java:104) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.doInBackground(JCasLoader.java:139) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.doInBackground(JCasLoader.java:33) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at javax.swing.SwingWorker$1.call(SwingWorker.java:304) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at javax.swing.SwingWorker.run(SwingWorker.java:343) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:830) ~[?:?]

Unicode-Test.txt
Unicode-Test-xmi.txt

@benadelm
Copy link
Author

Some googling suggests that many people experience similar problems due to this bug in Xalan. Does your code use Xalan (maybe indirectly through UIMA)?

@nilsreiter
Copy link
Owner

Could be. Can you check if the problem is also in the current beta version of 2.0.0? I've updated the UIMA dependencies.

@benadelm
Copy link
Author

benadelm commented Oct 6, 2020

No, still the same exception.

@nilsreiter nilsreiter added the bug Something isn't working label Mar 20, 2021
@nilsreiter nilsreiter self-assigned this Mar 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants