Skip to content

Creating Language Profiles

Fabian Kessler edited this page Oct 8, 2016 · 1 revision
//create text object factory:
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forIndexingCleanText();

//load your training text:
TextObject inputText = textObjectFactory.create()
        .append("this is my")
        .append("training text")

//create the profile:
LanguageProfile languageProfile = new LanguageProfileBuilder("en")
        .ngramExtractor(NgramExtractors.standard())
        .minimalFrequency(5) //adjust please
        .addText(inputText)
        .build();

//store it to disk if you like:
new LanguageProfileWriter().writeToDirectory(languageProfile, "c:/foo/bar");

For the profile name, use he ISO 639-1 language code if there is one, otherwise the ISO 639-3 code.

The training text should be rather clean; it is a good idea to remove parts written in other languages (like English phrases, or Latin script content in a Cyrillic text for example). Some also like to remove proper nouns like (international) place names in case there are too many. It's up to you how far you go. As a general rule, the cleaner the text is, the better is its profile. If you scrape text from Wikipedia then please only use the main content, without the left side navigation etc.

The profile size should be similar to the existing profiles for practical reasons. To compute the likeliness for an identified language, the index size is put in relation, therefore a language with a larger profile won't have a higher probability to be chosen.

Please contribute your new language profile to this project. The file can be added to the languages folder, and then referenced in the BuiltInLanguages class. Or else open a ticket, and provide a download link.

Also, it's a good idea to put the original text along with the modifying (cleaning) code into a new project on GitHub. This gives others the possibility to improve on your work. Or maybe even use the training text in other, non-Java software.

If you are stuck, ask using the bugtracker. Other users have recently added language profiles, so maybe someone is able and willing to help you out.

Clone this wiki locally