Skip to content

How to train your Bicleaner

Jaume Zaragoza edited this page Feb 10, 2022 · 29 revisions

How to train your Bicleaner

(For Bicleaner v0.14 and above)

Content

Intro

In this article we'll develop an example to illustrate the recommended way to train Bicleaner from scratch. Of course you can follow your own way, but let us unveil our secrets on Bicleaner training (trust us, we have done this a zillion times before).

If after reading this guide you are still having questions or needing clarification, please don't hesitate to open a new issue.

Let's assume you'd like to train a Bicleaner for English-Icelandic (en-is)

What you will need

  • A probabilistic dictionary for English->Icelandic (is_word en_word prob)
  • A probabilistic dictionary for Icelandic->English (en_word is_word prob)
  • Word frequencies file for English (freq word)
  • Word frequencies file for Icelandic (freq word)
  • A porn-annotated monolingual dataset (__label__negative/__label__positive sentences) for English (or Icelandic) (optional)
  • A training corpus (ideally, around 100K of very clean en-is parallel sentences)

What you will get

  • A English-Icelandic classifier
  • A character model for English
  • A character model for Icelandic
  • A monolingual model of porn for English (or Icelandic)
  • A yaml file with metadata

If you already have all the ingredients (training corpus and dictionaries) beforehand, you won't need to do anything else before running the training command. If not, don't worry: below we'll show you how to get them all.

Data preparation

Starting point: a parallel corpus

Good news: You can build everything needed to train Bicleaner from a single parallel corpus.

  • If you don't have a corpus large enough to build probabilistic dictionaries (a few million lines), you can download smaller corpora from Opus and cat them to get a larger one.

  • If you have TMXs, you can convert them to plain text by using tmxt:

python3.7 tmxt/tmxt.py --codelist en,is smallcorpus.en-is.tmx smallcorpus.en-is.txt
  • If your corpora happens to be pre-tokenized (it happens sometimes when downloading from Opus), you need to detokenize:
cut -f1 smallcorpus.en-is.txt > smallcorpus.en-is.en
cut -f2 smallcorpus.en-is.txt > smallcorpus.en-is.is
moses/tokenizer/detokenizer.perl -l en < smallcorpus.en-is.en > smallcorpus.en-is.detok.en
moses/tokenizer/detokenizer.perl -l is < smallcorpus.en-is.is > smallcorpus.en-is.detok.is
paste smallcorpus.en-is.is  smallcorpus.en-is.detok.is > smallcorpus.en-is
  • If you do not have enough sentences in your source or target languages, you can try translating from another language by using Apertium. For example, if you want to translate an English-Swedish corpus for English-Icelandic:
cut -f1 corpus.en-sv > corpus.en-sv.en
cut -f2 corpus.en-sv > corpus.en-sv.sv
cat corpus.en-sv.sv | apertium-destxt -i | apertium -f none -u swe-isl | apertium-retxt > corpus.en-is.is
paste corpus.en-sv.en corpus.en-is.is > corpus.en-is

Probabilistic dictionaries

For this, you need a parallel corpus of several million of sentences. You want to have a broad vocabulary in this corpus, so it is a good option to mix smaller corpora from different domains, until you get around 10 million parallel sentences. Then, tokenize and lowerize the corpus:

cat  bigcorpus.en-is| cut -f1 > bigcorpus.en-is.en
cat  bigcorpus.en-is| cut -f2 > bigcorpus.en-is.is

moses/tokenizer/tokenizer.perl -l en -no-escape < bigcorpus.en-is.en > bigcorpus.en-is.tok.en
moses/tokenizer/tokenizer.perl -l is -no-escape < bigcorpus.en-is.is > bigcorpus.en-is.tok.is

sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.en > bigcorpus.en-is.tok.low.en
sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.is > bigcorpus.en-is.tok.low.is

mv bigcorpus.en-is.tok.low.en bigcorpus.en-is.clean.en
mv bigcorpus.en-is.tok.low.is bigcorpus.en-is.clean.is

And then, build the probabilistic dictionaries:

mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir /your/working/directory  --corpus bigcorpus.en-is.clean -e en  -f is --mgiza -mgiza-cpus=16 --parallel --first-step 1 --last-step 4 --external-bin-dir /your/path/here/mgiza/mgizapp/bin/

Your probabilistic dictionaries should contain this kind of entries:

  • lex.e2f: Probability of an English word translating into a given Icelandic word. In this example, rediscover can be translated as enduruppgötva or verðskuldið with the same probability (0.5)
...
enduruppgötva rediscover 0.5000000
verðskuldið rediscover 0.5000000
...
  • lex.f2e: Probability of an icelandic word translating into a given english word. In this example, rediscover can be the translation of enduruppgötva with a 0.33 probability, or the translation of verðskuldið with a probability of 0.12.
...
rediscover enduruppgötva 0.3333333
rediscover verðskuldið 0.1250000
...

At this point, you could just gzip your dictionaries:

gzip lex.e2f -c > dict-en.gz
gzip lex.f2e -c > dict-is.gz

and you'll have the required two probabilistic dictionaries fully compatible with Bicleaner, but we recommend to prune them to remove very uncommon dictionary entries (for example, those whose probability is less than 10 times lower than the maximum one).

python3.7 bicleaner/utils/dict_pruner.py lex.e2f dict-en.gz -n 10 -g 
python3.7 bicleaner/utils/dict_pruner.py lex.f2e dict-is.gz -n 10 -g 

Please note that both target and source words in probabilistic bilingual dictionaries must be single words.

Note also that the tokenization method you use for building probabilistic dictionaries must be the same you use when running Bicleaner. By default, Bicleaner uses Sacremoses with the escape=False option. This is equivalent to the MosesTokenizer -no-escape method showed in the example above. If any other tokenization methods are used, they must be explicitely indicated to Bicleaner by using the -S SOURCE_TOKENIZER_COMMAND and -T TARGET_TOKENIZER_COMMAND flags.

There's room for improvement for the probabilistic dictionaries using the method by @jgcb00.

Word frequency files

Two word frequency file are needed to train, one for English and another one for Icelandic. The format for this files is:

freq1 word1
freq2 word2
...

To build them, you just need to count the numbers of times a given words appears in a corpus. For this,you need a big monolingual corpus for each source language and target language. The simplest way is to reuse the bigcorpus.en-is from the step before:

$ cut -f1 bigcorpus.en-is \
    | sacremoses -l en tokenize -x \
    | awk '{print tolower($0)}' \
    | tr ' ' '\n' \
    | LC_ALL=C sort | uniq -c \
    | LC_ALL=C sort -nr \ \
    | grep -v "[[:space:]]*1" \
    | gzip > wordfreq-en.gz
$ cut -f2 bigcorpus.en-is \
    | sacremoses -l is tokenize -x \
    | awk '{print tolower($0)}' \
    | tr ' ' '\n' \
    | LC_ALL=C sort | uniq -c \
    | LC_ALL=C sort -nr \
    | grep -v "[[:space:]]*1" \
    | gzip > wordfreq-is.gz

Remember to tokenize with the same method you use in the rest of the process!

Porn-annotated monolingual dataset

An optional feature of Bicleaner since version 0.14 is filtering out sentences containing porn. If you don't want to remove these kind of sentences, this dataset is not required.

In order to train this feature, an annotated dataset for porn must be provided, containing around 200K sentences. Each sentence must contain at the beginning the __label__negative or __label__positive according to FastText convention. It should be lowercased and tokenized.

More elaborated strategies can be choosen but, for a naive approach, sentences containing "porny" words from the English side of your corpus can be selected by simply using grep (around 200K sentences is enough to train):

cat bigcorpus.en | grep -i "pornword1" | grep -i "pornword2" | ... | grep -i "pornwordn"  \
                 | awk '{if (toupper($0) != tolower($0)) print tolower($0);}'  > positive.lower.en.txt

In the same fashion, "safe" negative examples can be extracted using inverse grep:

cat bigcorpus.en | grep -iv "pornword1" | grep -iv "pornword2" | ... | grep -iv "pornwordn" \
                 | awk '{if (toupper($0) != tolower($0)) print tolower($0);}'  > negative.lower.en.txt

(a small awk filter is added to avoid having sentences not containing alphabetic characters)

Once you have obtained the positive and negative porn sentences, they need to be tokenized, and the label added:

cat positive.lower.en.txt | sacremoses -l en tokenize -x  \
                          | LC_ALL=C sort | uniq  \
                          | awk '{print "__label__positive "$0}'| shuf -n 200000 > positive.dedup.lower.tok.en.txt
cat negative.lower.en.txt | sacremoses -l en tokenize -x  \
                          | LC_ALL=C sort | uniq  \
                          | awk '{print "__label__negative "$0}' | shuf -n 200000 > negative.dedup.lower.tok.en.txt

Finally, they just need to be joined in a single file:

cat positive.dedup.lower.tok.en.txt negative.dedup.lower.tok.en.txt | shuf >  porn-annotated.txt.en

Training corpus

If you have a super clean parallel corpus, containing around 100K parallel sentences, you can skip this part. If not, you can build a cleaner corpus from a not-so-clean parallel corpus by using Bifixer and the Bicleaner Hardrules.

First, apply Bifixer:

python3.7 bifixer/bifixer/bifixer.py --scol 1 --tcol 2 --ignore_duplicates corpus.en-is corpus.en-is.bifixed en is

Then, apply the hardrules:

python3.7 bicleaner/bicleaner/bicleaner_hardrules.py corpus.en-is.bifixed corpus.en-is.annotated -s en -t is --scol 1 --tcol 2 --annotated_output --disable_lm_filter --disable_porn_removal

If any of your source or target languages is easily mistaken with other similar languages (for example, Norwegian and Danish, Galician and Portuguese...), you may need to use the --disable_lang_ident when running Hardrules. You can detect if this is happening by running:

cat corpus.en-is.annotated | awk -F'\t' '{print $4}' | sort | uniq -c | sort -nr

If language-related annotations are high (c_different_language, c_reliable_long_language(right, targetlang)and/or c_reliable_long_language(left, sourcelang)): you are probably experiencing this issue (so you really want to use the --disable_lang_ident flag also for training)

Once you have an annotated version of your corpus, you can get the cleaner parallel sentences and use these as a training corpus (100K sentences is a good number):

cat corpus.en-is.annotated  | grep "keep$" |  shuf -n 100000 | cut -f1,2 > trainingcorpus.en-is

Train Bicleaner

The most commonly used command (and the one you probably want to use) is the following:

python3.7 bicleaner/bicleaner_train.py \
    trainingcorpus.en-is \
    --normalize_by_length \
    -s en -t is \
    -d dict-en.gz -D dict-is.gz \
    -b 1000 -c en-is.classifier \
    -f wordfreq-en.gz -F wordfreq-is.gz \
    -m en-is.yaml \
    --lm_training_file_sl lmtrain.en-is.en --lm_training_file_tl lmtrain.en-is.is \
    --lm_file_sl model.en-is.en  --lm_file_tl model.en-is.is 
    --porn_removal_train porn-annotated.txt.en  --porn_removal_file porn-model.en

Remember to check in the Readme all the available options and choose those that are the most useful for you.

Tip: If you plan to distribute your 'language pack', you'll want to add the flag --relative_paths to use relative paths instead of absolute in the yaml file.

Bicleaning a corpus:

At this point, you probably want to try your freshly trained Bicleaner to clean an actual corpus. Just run:

python3.7 bicleaner/bicleaner/bicleaner_classifier_full.py testcorpus.en-is testcorpus.en-is.classified en-is.yaml --scol 1 --tcol 2

After running Bicleaner, you'll have a new file (testcorpus.en-is.classified), having the same content as the input file (testcorpus.en-is) plus an extra column. This new column contains the scores given by the classifier to each pair of parallel sentences. If the score is 0, it means that the sentence was discarded by the Hardrules filter or the language model. If the score is above 0, it means that it made it to the classifier, and the closer to 1 the better is the sentence. For most languages (and distributed language packs), we consider a sentence to be very likely a good sentence when its score is above 0.5 .

Software

Bicleaner

Bicleaner is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

Installation

Bicleaner works with Python3.6+ and can be installed with pip:

python3.7 -m pip install bicleaner

You can also download it from github :

git clone https://github.com/bitextor/bicleaner
cd bicleaner
python3.7 -m pip install -r requirements 

It also requires KenLM with support for 7-gram language models:

git clone https://github.com/kpu/kenlm
cd kenlm
python3.7 -m pip install . --install-option="--max_order 7"
cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/prefix/path
make -j all install

Mgiza

Mgiza is a word alignment tool, that we use to build probabilistic dictionaries.

Installation

git clone https://github.com/moses-smt/mgiza.git
cd mgiza/mgizapp
cmake .
make
make install
export PATH=$PATH:/your/path/here/mgiza/mgizapp/bin

Moses

Moses is a statistical machine translation system. We use it for tokenization and (together with Mgiza) for probabilistic dictionary building.

Installation

git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder
./bjam -j32
cp /your/path/here/mgiza/experimental/alignment-enabled/MGIZA/scripts/merge_alignment.py /your/path/here/mgiza/mgizapp/bin/

KenLM

We use KenLM to build the character language models needed in Bicleaner.

Installation

git clone https://github.com/kpu/kenlm
cd kenlm
python3.7 -m pip install . --install-option="--max_order 7"
cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/python/env/path/here/
make -j all install

tmxt

tmxt is a tool that extract plain text parallel corpora from TMX files.

Installation

git clone http://github.com/sortiz/tmxt
python3.7 -m pip install -r tmxt/requirements.txt

Two tools are available: tmxplore.py (that determines the language codes available inside a TMX file) and tmxt.py, that transforms the TMX to a tab-separated text file.

Apertium

Apertium is a platform for developing rule-based machine translation systems. It can be useful to translate to a given language when you do not have enough parallel text.

Installation

In Ubuntu and other Debian-like operating systems:

wget http://apertium.projectjj.com/apt/install-nightly.sh
sudo bash install-nightly.sh
sudo apt-get update
sudo apt-get install apertium-LANGUAGE-PAIR

(choose your apropiate apertium-LANGUAGE-PAIR from the list under apt search apertium)

For other systems, please read Apertium documentation.

Bifixer

Bifixer is a tool that fixes bitexts and tags near-duplicates for removal. It's useful to fix errors in our training corpus.

Installation

git clone https://github.com/bitextor/bifixer.git
python3.7 -m pip install -r bifixer/requirements.txt