-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Whitespaces for German #4
Comments
I tested various text samples for Spanish, English and Croatian language. The issue is reproducible only on some German text samples. The only logical cause of the error is the underlying UDPipe model for the German language. I am afraid there is not much I can do about that given
I am wondering have you by any chance encountered the same issue for any other languages? |
I don't think this is an UDPipe issue. Tokenisation works perfectly with plain UDPipe. Let me show this with the R wrapper to UDPipe instead:
If you want to reconstruct the original text, you need to take into account that misc information. |
Thank you for investigating this issue. |
Both spacy-stanfordnlp and spacy-udpipe construct a |
As you mentioned in your first response, the issue strangely happens for some German texts only. In the following code snippet, the first two samples of the
The output on my Ubuntu 16.04.6 machine with Python 3.6.9 is:
I thought maybe there is some formatting issue but I cannot find anything. |
Hello,
it seems like there exists an issue with the trailing whitespaces of tokens in case of, e.g., German.
Output:
Juliana kommt aus Paris . Das ist die Hauptstadt von Frankreich . In diesem Sommer macht sie einen Sprachkurs in Freiburg . Das ist eine Universitätsstadt in dem Süden von Deutschland . Es gefällt ihr hier sehr gut . Morgens um neun beginnt der Unterricht , um vierzehn Uhr ist er zu Ende . In ihrer Klasse sind außer Juliana noch 14 weitere Schüler , acht Mädchen und sechs Jungen . Sie kommen alle aus Frankreich , aber nicht aus Paris .
As one can see, the periods at the end of the sentences are put with one additional whitespace to the last token of a sentence. The same holds for other punctuation symbols while Spacy would detect whether there actually exists a trailing whitespace.
(Source of sample text: https://lingua.com/german/reading/)
The text was updated successfully, but these errors were encountered: