(re)MUDE

(re)Implementation of Learning Multi-level Dependencies for Robust Word Recognition

Summary

The original paper introduce a robust word recognition framework that captures multi-level sequential dependencies in noised sentence. Practical application of such framework addresses to a challenging task of Grammatical Error Correction and improving robustness of modern NLP setups.

Model architecture:

Why

Despite a clearly written paper the released original code lacks of structure, guidance and reproducibility. There are also critical bugs found by community members in original implementation. Here's my attempt to organize bits and pieces in intuitive way.

Details

It's a fully Pytorch based implementation, using torch.nn.TransformerEncoder, torch.nn.data.Dataset, torch.nn.data.Dataloader and blazing Ignite for personal and your convinience.

Data

Authors said:

Lastly, as this work primarily focuses on English, it would be very meaningful to experiment the proposed framework on other languages.

So I took it seriously and has trained/evaluated experimental runs on low size corpus of russian news texts. Preprocessed respective train, valid and test splits with vocabulary placed in data.

Train, evaluation

All hyperparameter values in those runs are copied from original code. Experiment runs evaluated wrt to noise type in terms of Word Recognition Accuracy on test split.

Result table is shown below.

PER	DEL	INS	SUB	W-PER	W-DEL	W-INS	W-SUB
0.998	0.976	0.987	0.974	0.998	0.956	0.987	0.965

Checkpoints are not included in repo because of them size. It's not that bad, you can train your own copy of MUDE easily. Also, I'm a not a big fan of huge number of arguments in usual train.py script, so their number overwhelmly small here, sorry for that.

python3 train_noise-type -n "NOISE-TYPE"

PER checkpoint in action:

original article

На всчрете с Птуиным Сичен сакзал, что в 2008 гдоу стотмосиь нфети в рбулях составялла поркдяа 1100 руб., сечйас — 1200 руб., при эотм траиф "Транснефти" на прочакку нтфеи в это же вермя ворыс с 822 до 2,1 тыс. руб. за тнноу на 100 км. Совеинтк перзидента "Трафсненти" Иогрь Димен овтетил на заявлеине гнавлого исполнитеньлого директроа "Росенфти" Иогря Счеина, который на вртсече с пзеридентом Воадимирлм Питуным во виорнтк, 12 мая, попсорил птчои вовде снзиить тафиры трубопрдвооной мпнооолии, поскокьлу, по его расаетчм, рхсаоды на тррнспоат счйеас сосватляют 32% от смоитости нтфеи. В резутьтале расхдоы на транспорт счйеас соютавляст 32% от стоитосми нтфеи, а это "чувствитлеьно", заюлкчил гвала "Росфенти".

...

На встрече с Путиным Сечин сказал, что в NUM году стоимость нефти в рублях составляла порядка NUM руб., сейчас — NUM руб., при этом тариф "Транснефти" на парковку нефти в это же время вырос с NUM до NUM, NUM тыс. руб. за тонну на NUM км. Советник президента "Транснефти" Игорь Днем ответил на заявление главного исполнительного директора "Роснефти" Игоря Сечина, который на встрече с президентом Владимиром Путиным во вторник, NUM мая, попросил почти вдвое снизить тарифы противовоздушной монополии, поскольку, по его расчетам, расходы на транспорт сейчас составляют NUM% от стоимости нефти. В результате расходы на транспорт сейчас составляют NUM% от стоимости нефти, а это "чувствовал", заключил глава "Роснефти".

to reproduce (having PER checkpoint)

python3 correction-example.py

Project structure

data/ # contains the dataset and vocab
src/  # contains the MUDE, dataset and vectorizer

train.py                # train model to solve for selected noise type
control.py              # visualize predictions on test set
correction-example.py   # to reproduce example above

Notes

No sliding windows. Variable length input with padding, packaging with pack_padded_sequence for top recurrent unit processing;
c0 token used to compute representation of characters sequence here - is a separate symbol;
All runs has fixed β (contribution of seq2seq loss) and does not change its value during training as opposed to original idea of gradually reducing it;
SUB (subtraction) type of noise implemented as replacing with randomly selected char from adjacent ones given QWERTY layout.
Most of the word prediction errors occurs because of the vocabulary size problem. Such predictions usually has low score. Are there any chances to build such system with BPE/subword-unit vocab compression?

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
utils		utils
README.md		README.md
control.py		control.py
correction-example.py		correction-example.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

(re)MUDE

(re)Implementation of Learning Multi-level Dependencies for Robust Word Recognition

Summary

Why

Details

Data

Train, evaluation

Project structure

Notes

About

Languages

eatsleepraverepeat/reMUDE

Folders and files

Latest commit

History

Repository files navigation

(re)MUDE

(re)Implementation of Learning Multi-level Dependencies for Robust Word Recognition

Summary

Why

Details

Data

Train, evaluation

Project structure

Notes

About

Topics

Resources

Stars

Watchers

Forks

Languages