German ordinal numbers lead to over splitting #28

nickluger · 2022-04-13T11:23:41Z

This is a German text containing ordinal numbers. (The original text passed to syntok does not contain \n. Just added for readability here).

Ich habe am 3. Juni Geburtstag. 
Jonas ist Fan vom 1. FC Köln, und du? 
Meine Eltern haben 6 Kinder. 
Dies ist nun der 17. Versuch. 
Friedrich II. war der Sohn von Heinrich VI

is split into the following parts:

"Ich habe am 3.",
"Juni Geburtstag.",
"Jonas ist Fan vom 1.",
"FC Köln, und du?",
"Meine Eltern haben 6 Kinder.",
"Dies ist nun der 17.",
"Versuch.",
"Friedrich II. war der Sohn von Heinrich VI",

I understand that this is very difficult to get right in German, where an uppercase word can follow the ordinal number.

Dates like 3. Juni might be maybe detectable, though. Interestingly, the last part does not split at Friedrich II., like all other examples.

Besides that, syntok seems to be a sublime sentence splitter for German, thank you for this. 🙏

The text was updated successfully, but these errors were encountered:

fnl · 2022-04-19T20:48:49Z

Hi Nick, thanks for the kind words, and glad you find syntok useful.

Yes, agreed, for the German month-based date cases, setting up a few rules should be easy, I will try to do that asap.

For other ordinals, if we find enough hard data to support a sensible rule (like a number followed by a terminal and a sequence of upper case letters, e.g. "1. FC"), such rules could be added, in theory.
But for that, I would like to see some statistics that we are not hurting performance overall, especially for other languages than German.

You probably have figured out by now, but the last example you showed does not over-split simply because the terminal is followed by a lower-case letter.
However, in German, that particular use of ordinals typically only applies to the ordinals used in the names of noble people.
Normally, an ordinal is followed by a (proper) noun, not proceeded - in which case the following letter would be an upper case, leading to the bad outcome ("17. Versuch") :(

nickluger · 2022-04-20T05:00:48Z

Hey Florian,

thank you for your comprehensive explanation.

Agree on performance, was trying an ML based tool for this too, which worked out one or two of these, but was much slower. Yes, dates would be a low-hanging fruit, and handle many cases. Also, two succeeding uppercase letters mostly indicate a proper noun in all targeted languages, but it depends what happens more often, in German it's rather difficult to create a (false positive) sentence that ends with a number, while the next one start with a proper noun.

"Das ist nun Sieg Nr 3. FC-Köln-Fans sind außer sich." is possible, but sounds a bit made-up 😄

For our case it's not super important to get everything 100% right, as we're feeding an ML tool with raw masses of sentences anyway and a tiny number of wrongly splitted will not cause us any headaches.

Thanks!

fnl · 2022-04-20T21:12:03Z

I queried the English Wikipedia with the following regex: /[^0-9A-Za-z][0-9]\. [A-Z][A-Z]+/
That immediately surfaces the following cases that indeed are proper sentence terminal usages of this pattern and should be split:

"... channel 7. KBS also ..."
"... and HIV-2. HIV-1 is the virus ..."
"... with SSH-1. SSH-2 features ..."
"... higher than 2. CAP of depth 2 ..."

Overall in English, with this pattern, I can only find the "1. FC" case that should not be split, but more importantly, a number of cases that should be split.
Then I tried this pattern on the German Wikipedia, and found the following ordinal expressions that should not be split:

"1. PD"
"1. FFC"
"7. US Armee"
"4. ZK-Plenum"
"4. ATP"
"1. FDJ"
etc. etc.

Therefore, it might be worth elevating the specific expression "1. FC" to a special no-split rule, as well as handling the day of the month, dot, name of month case for German month names.
While it seems that preventing a segmentation on a number-dot-uppercase pattern is potentially going to lead to false negatives, even though in the German language (only?!), this pattern is pretty much always a no-split.

Any other thoughts or ideas? Any good ways to prove a different viewpoint?

fnl · 2022-04-20T21:17:34Z

Maybe, one would have to add a simple language detection algorithm to properly solve this case while still being open to any language that uses the Latin alphabet?

nickluger · 2022-04-22T10:10:37Z

Cool, didn't know one could regex search Wikipedia. The first sentences could easily appear in German too, though.

"... Kanal 7. KBS ist auch"
"... und HIV-2. HIV-1 ist der Virus"
"... funktioniert besser mit SSH-1. SSH-2 ist eher geeignet....".

Therefore, I think the suggested 2-uppercase-following-rule would cause false negatives in any language, that allows starting sentences with the subject without article (and being a proper noun).

The month names + dot, though, appear quite often in most Latin alphabet languages and should deserve special treatment. I have to admit, I'm not proficient enough in Python (currently) to write a PR myself.

fnl · 2022-04-23T11:45:35Z

No worries, I can do those changes. Only, I have my plate quite full right now, with multiple issues pending. So it might take a week or two until I have a fix for this out there. Hope that's not a problem for you!

nickluger · 2022-04-23T12:12:12Z

Of course not, I'm grateful this library exists at all!

zerogerc · 2022-09-12T13:28:31Z

Hi, just want to mention that I've also stumbled upon that issue.

I have a concern about "simple language detection" algorithm as it can be quite tricky to detect a language. i.e. langdetect library doesn't properly work on short sentences. Moreover, there are cases when one language is embedded into another.

I would prefer to pass a language as a parameter into sentence segmentation as I already know the language of the sentences I want to split.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German ordinal numbers lead to over splitting #28

German ordinal numbers lead to over splitting #28

nickluger commented Apr 13, 2022 •

edited

Loading

fnl commented Apr 19, 2022

nickluger commented Apr 20, 2022

fnl commented Apr 20, 2022

fnl commented Apr 20, 2022

nickluger commented Apr 22, 2022

fnl commented Apr 23, 2022

nickluger commented Apr 23, 2022

zerogerc commented Sep 12, 2022

German ordinal numbers lead to over splitting #28

German ordinal numbers lead to over splitting #28

Comments

nickluger commented Apr 13, 2022 • edited Loading

fnl commented Apr 19, 2022

nickluger commented Apr 20, 2022

fnl commented Apr 20, 2022

fnl commented Apr 20, 2022

nickluger commented Apr 22, 2022

fnl commented Apr 23, 2022

nickluger commented Apr 23, 2022

zerogerc commented Sep 12, 2022

nickluger commented Apr 13, 2022 •

edited

Loading