Czech lemmatizer and inflection finder

This project uses the data from https://ufal.mff.cuni.cz/morfflex to generate a SQLITE database that can then be used to:

Find the lemma of a word
Find all inflections of a word

Warning: The created database is more than 10 GB large. There is currently no support of lemmatization based on part-of-speech, but pull requests are welcome.

Install

You need to download the file under https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3186/czech-morfflex-2.0.tsv.xz, unpack it, and then run load_lemma_file.py (while editing the constants there to the correct paths).

Then you can use everything like this:

Usage

lemm = Lemmatizer("lemma_inflection.db")
print(lemm.find_inflections("červenat")) #['nečervenána', 'červenány', 'nečervenány', 'červenán', ...]
print(lemm.find_lemma("lesa")) #[les]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
czech_inflections_lemmatizer		czech_inflections_lemmatizer
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Czech lemmatizer and inflection finder

Install

Usage

About

Releases

Packages

Languages

Vuizur/czech-inflections-lemmatizer

Folders and files

Latest commit

History

Repository files navigation

Czech lemmatizer and inflection finder

Install

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages