Detect multiple languages in mixed-language text #38

pemistahl · 2020-05-25T14:00:38Z

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input:
He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]

The text was updated successfully, but these errors were encountered:

Marcono1234 · 2022-08-29T19:38:18Z

I think specifically for Lingua the following approach could work. Some of the following points have footnotes describing further considerations. Note that this is not a scientific approach, there might be better and more performant solutions.

Split text into sections where language switches might occur¹, this includes:
- Unicode script changes ²³
- Quotation marks⁴⁵
- Colon (:)
- Line and page breaks⁵
- ...
For each section determine the set of languages by rules
1. Try to detect the language with LanguageDetector.detectLanguageWithRules
2. Otherwise, try to detect the possible languages with LanguageDetector.filterLanguagesByRules
Merge adjacent sections whose language set has the size 1 and which have the same language
For each section determine the confidence values
- For sections which from the previous steps only have a single language detected by rules the confidence value can be set to 1.0
- Because accuracy is not good for short texts, merge short texts with subsequent ones if the languages detected with rules permit this (i.e. there must be overlap)⁶
Merge adjacent sections whose most common languages are also quite common in the respectively other section⁷

I have implemented this in my fork (file MultiLanguageDetection.kt) and it seems to provide fairly reasonable results, though proper nouns are throwing it off sometimes. When you build with gradlew jarWithDependencies and then start the JAR from console you can also try this out with a GUI (might not follow Swing best practices though). Please let me know what you think, and what areas of it or the general approach outlined above you think could be improved. I would also be interested in how you would have approached this problem. I think it would also be possible to port this to Lingua without much changes (in case you are interested).

Might need to impose a minimum length (e.g. 3 letters) to avoid splitting too small sections which can cause issues later on. ↩
Requires special casing for languages which use more than one script, e.g. Japanese ↩
Not sure if detecting script changes is always desired, for example should a proper noun in Latin script (e.g. "GitHub") within a Chinese text be really considered a separate text section? ↩
Might need special casing for quotation marks which are also used as apostrophe (' and U+2019), otherwise this can cause issues for detection of section start and end. For example only consider those characters as quotation marks when not enclosed by letters, or ignore them completely. ↩
Checking only the char categories (e.g. CharCategory.INITIAL_QUOTE_PUNCTUATION or CharCategory.LINE_SEPARATOR) does not seem to suffice because they do not contain all relevant characters. Therefore the characters have to be hardcoded, e.g. based on https://en.wikipedia.org/wiki/Newline#Unicode and https://en.wikipedia.org/wiki/Quotation_mark#Unicode_code_point_table ↩ ↩²
Might also have to look forward to the next section. In case that section is long enough for reliable language detection, check whether current section rather belongs to previous one or next one, to avoid merging it erroneously with previous one. ↩
The confidence value threshold can be determined based on the number of letters in the section. For short texts languages with lower confidence values (e.g. starting at 0.6) should be considered, whereas for longer texts only languages with high confidence values (close to 1.0) should be considered. ↩

kargaranamir · 2024-02-11T15:37:15Z

Is it solved?
If yes, I would like to see the approach.
If not, I have a simple method that I can propose.

pemistahl · 2024-02-13T08:56:44Z

@kargaranamir I've implemented an algorithm for my other implementations of Lingua (Go, Rust, Python) already. I haven't found the time yet to implement it here. So yes, it's generally solved but not yet implemented.

kargaranamir · 2024-02-13T12:18:56Z

Thanks for the reply @pemistahl.

I just checked the Python version. In the example, LanguageDetectorBuilder selects languages from three languages, and then it predicts among them. I wonder, does it still work if I run it on all languages supported by Lingua and even if I pass monolingual sentences?

pemistahl · 2024-02-14T10:13:20Z

@kargaranamir This feature is still experimental. The more languages you add to the mix, the more inaccurate the result will be. If you can restrict the number of possible languages beforehand, then do it as it will produce better results in most cases.

pemistahl · 2024-10-01T18:15:52Z

Closed in favor of #214. The Rust implementation already contains this feature.

pemistahl added the enhancement New feature or request label May 25, 2020

pemistahl added this to the Lingua 1.1.0 milestone May 25, 2020

pemistahl mentioned this issue Jun 4, 2020

a priori reward for languages #45

Closed

pemistahl added new feature and removed enhancement New feature or request labels Jun 16, 2020

pemistahl modified the milestones: Lingua 1.1.0, Lingua 1.2.0 Apr 24, 2021

Marcono1234 mentioned this issue Jun 19, 2021

Most frequent alphabet detection causes inaccuracies when multiple alphabets have same occurrence count #105

Closed

pemistahl changed the title ~~Detect multiple languages in textual input~~ Detect multiple languages in mixed-language text Jan 22, 2022

pemistahl modified the milestones: Lingua 1.2.0, Lingua 1.3.0 May 27, 2022

pemistahl modified the milestones: Lingua 1.3.0, Lingua 1.4.0 Nov 5, 2022

pemistahl modified the milestones: Lingua 1.4.0, Lingua 1.3.0 Oct 25, 2023

pemistahl removed this from the Lingua 1.3.0 milestone Oct 1, 2024

pemistahl closed this as completed Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect multiple languages in mixed-language text #38

Detect multiple languages in mixed-language text #38

pemistahl commented May 25, 2020 •

edited

Loading

Marcono1234 commented Aug 29, 2022 •

edited

Loading

kargaranamir commented Feb 11, 2024 •

edited

Loading

pemistahl commented Feb 13, 2024

kargaranamir commented Feb 13, 2024

pemistahl commented Feb 14, 2024

pemistahl commented Oct 1, 2024

Detect multiple languages in mixed-language text #38

Detect multiple languages in mixed-language text #38

Comments

pemistahl commented May 25, 2020 • edited Loading

Marcono1234 commented Aug 29, 2022 • edited Loading

Footnotes

kargaranamir commented Feb 11, 2024 • edited Loading

pemistahl commented Feb 13, 2024

kargaranamir commented Feb 13, 2024

pemistahl commented Feb 14, 2024

pemistahl commented Oct 1, 2024

pemistahl commented May 25, 2020 •

edited

Loading

Marcono1234 commented Aug 29, 2022 •

edited

Loading

kargaranamir commented Feb 11, 2024 •

edited

Loading