N-gram Language Models

This is the first assignment for NLP course. The task is to train and evaluate language models on an English corpus. We used the English part of Greek-English parallel corpus. We download corpus, split it in a train - test set and so did our train sentences are in file “europarl-v7.el-en.en.train” and test sentences in “europarl-v7.el-en.en.test”.

Special care is given when user is not running the code for first time so to give user the option to reprocess dataset and/or retrain the language models. In case you need access to the already trained language models please follow this link.

Results

We estimate the language Cross-Entropy and Perplexity of our models on part of the padded test set (100 sentences), treating it as a single sequence. Function perplexity() computes entropy and perplexity for two cases:

Including probabilities of the form P(start|...) (or P(start1|...) or P(start2|...)) and P(end|...) in the computation of perplexity.
Not including probabilities of the form P(start|...) (or P(start1|...) or P(start2|...)) in the computation of perplexity, but including probabilities of the form P(end|...).

In simple Linear Interpolation, we combine different order n-grams by linearly interpolating all the models. Here, we combine unigram, bigram and trigram maximum-likelihood estimations using linear interpolation and check if the combined model performs better. Best l1, l2, l3 parameters in perplexity_interpolated() were found after some trials on a validation set of 100 sentences. (l1 = 2/10, l2 = 8/10, l3 = 1/10)

Comparing with Table 1 results we observe that interpolated models have much better perplexity.

Acknowledgement

Natural Language Processing course is part of the MSc in Computer Science of the Department of Informatics, Athens University of Economics and Business. The course covers algorithms, models and systems that allow computers to process natural language texts and/or speech.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
kneser_ney.py		kneser_ney.py
lang_mod.ipynb		lang_mod.ipynb
table1.png		table1.png
table3.png		table3.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

N-gram Language Models

Results

Acknowledgement

About

Releases

Packages

Languages

soutsios/n-gram-language-models

Folders and files

Latest commit

History

Repository files navigation

N-gram Language Models

Results

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages