Skip to content

Creates a database containing inflections and associated lemmata

Notifications You must be signed in to change notification settings

Vuizur/czech-inflections-lemmatizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Czech lemmatizer and inflection finder

This project uses the data from https://ufal.mff.cuni.cz/morfflex to generate a SQLITE database that can then be used to:

  • Find the lemma of a word
  • Find all inflections of a word

Warning: The created database is more than 10 GB large. There is currently no support of lemmatization based on part-of-speech, but pull requests are welcome.

Install

You need to download the file under https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3186/czech-morfflex-2.0.tsv.xz, unpack it, and then run load_lemma_file.py (while editing the constants there to the correct paths).

Then you can use everything like this:

Usage

lemm = Lemmatizer("lemma_inflection.db")
print(lemm.find_inflections("červenat")) #['nečervenána', 'červenány', 'nečervenány', 'červenán', ...]
print(lemm.find_lemma("lesa")) #[les]

About

Creates a database containing inflections and associated lemmata

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages