Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect multiple languages in mixed-language text #38

Closed
pemistahl opened this issue May 25, 2020 · 6 comments
Closed

Detect multiple languages in mixed-language text #38

pemistahl opened this issue May 25, 2020 · 6 comments

Comments

@pemistahl
Copy link
Owner

pemistahl commented May 25, 2020

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input:
He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]
@pemistahl pemistahl added the enhancement New feature or request label May 25, 2020
@pemistahl pemistahl added this to the Lingua 1.1.0 milestone May 25, 2020
@pemistahl pemistahl added new feature and removed enhancement New feature or request labels Jun 16, 2020
@pemistahl pemistahl modified the milestones: Lingua 1.1.0, Lingua 1.2.0 Apr 24, 2021
@pemistahl pemistahl changed the title Detect multiple languages in textual input Detect multiple languages in mixed-language text Jan 22, 2022
@pemistahl pemistahl modified the milestones: Lingua 1.2.0, Lingua 1.3.0 May 27, 2022
@Marcono1234
Copy link
Contributor

Marcono1234 commented Aug 29, 2022

I think specifically for Lingua the following approach could work. Some of the following points have footnotes describing further considerations. Note that this is not a scientific approach, there might be better and more performant solutions.

  1. Split text into sections where language switches might occur1, this includes:
    • Unicode script changes 23
    • Quotation marks45
    • Colon (:)
    • Line and page breaks5
    • ...
  2. For each section determine the set of languages by rules
    1. Try to detect the language with LanguageDetector.detectLanguageWithRules
    2. Otherwise, try to detect the possible languages with LanguageDetector.filterLanguagesByRules
  3. Merge adjacent sections whose language set has the size 1 and which have the same language
  4. For each section determine the confidence values
    • For sections which from the previous steps only have a single language detected by rules the confidence value can be set to 1.0
    • Because accuracy is not good for short texts, merge short texts with subsequent ones if the languages detected with rules permit this (i.e. there must be overlap)6
  5. Merge adjacent sections whose most common languages are also quite common in the respectively other section7

I have implemented this in my fork (file MultiLanguageDetection.kt) and it seems to provide fairly reasonable results, though proper nouns are throwing it off sometimes. When you build with gradlew jarWithDependencies and then start the JAR from console you can also try this out with a GUI (might not follow Swing best practices though). Please let me know what you think, and what areas of it or the general approach outlined above you think could be improved. I would also be interested in how you would have approached this problem. I think it would also be possible to port this to Lingua without much changes (in case you are interested).

Footnotes

  1. Might need to impose a minimum length (e.g. 3 letters) to avoid splitting too small sections which can cause issues later on.

  2. Requires special casing for languages which use more than one script, e.g. Japanese

  3. Not sure if detecting script changes is always desired, for example should a proper noun in Latin script (e.g. "GitHub") within a Chinese text be really considered a separate text section?

  4. Might need special casing for quotation marks which are also used as apostrophe (' and U+2019), otherwise this can cause issues for detection of section start and end. For example only consider those characters as quotation marks when not enclosed by letters, or ignore them completely.

  5. Checking only the char categories (e.g. CharCategory.INITIAL_QUOTE_PUNCTUATION or CharCategory.LINE_SEPARATOR) does not seem to suffice because they do not contain all relevant characters. Therefore the characters have to be hardcoded, e.g. based on https://en.wikipedia.org/wiki/Newline#Unicode and https://en.wikipedia.org/wiki/Quotation_mark#Unicode_code_point_table 2

  6. Might also have to look forward to the next section. In case that section is long enough for reliable language detection, check whether current section rather belongs to previous one or next one, to avoid merging it erroneously with previous one.

  7. The confidence value threshold can be determined based on the number of letters in the section. For short texts languages with lower confidence values (e.g. starting at 0.6) should be considered, whereas for longer texts only languages with high confidence values (close to 1.0) should be considered.

@pemistahl pemistahl modified the milestones: Lingua 1.3.0, Lingua 1.4.0 Nov 5, 2022
@pemistahl pemistahl modified the milestones: Lingua 1.4.0, Lingua 1.3.0 Oct 25, 2023
@kargaranamir
Copy link

kargaranamir commented Feb 11, 2024

Is it solved?
If yes, I would like to see the approach.
If not, I have a simple method that I can propose.

@pemistahl
Copy link
Owner Author

@kargaranamir I've implemented an algorithm for my other implementations of Lingua (Go, Rust, Python) already. I haven't found the time yet to implement it here. So yes, it's generally solved but not yet implemented.

@kargaranamir
Copy link

Thanks for the reply @pemistahl.

I just checked the Python version. In the example, LanguageDetectorBuilder selects languages from three languages, and then it predicts among them. I wonder, does it still work if I run it on all languages supported by Lingua and even if I pass monolingual sentences?

@pemistahl
Copy link
Owner Author

@kargaranamir This feature is still experimental. The more languages you add to the mix, the more inaccurate the result will be. If you can restrict the number of possible languages beforehand, then do it as it will produce better results in most cases.

@pemistahl pemistahl removed this from the Lingua 1.3.0 milestone Oct 1, 2024
@pemistahl
Copy link
Owner Author

Closed in favor of #214. The Rust implementation already contains this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants