Skip to content

bfsujason/aligner-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluation of Automatic Sentence Aligners

This repository consists of the datasets and Python scripts for the evaluation of automatic sentence aligners.

You can install AlignerEval and conduct the evaluation directly in a Google Colab notebook.

Systems Evaluated

Table 1 shows the sentence aligners being evaluated, where S and T means source and target text respectively, Td is the translation of source text using bilingual lexicons and Tm is the machine translation of source text. Se and Te are vector representations of source and target texts using sentence embedding techniques such as LASER and sentence-transformers.

Table 1. Systems evaluated and implementation details
System Type Input
Galechurch Length-based S <=> T
Hunalign Dictionary-based Td <=> T
Bleualign MT-based Tm <=> T
Bleurtalign MT-based Tm <=> T
Vecalign Embedding-based Se <=> Te
Bertalign Embedding-based Se <=> Te

Evaluation Corpora

Both literary and non-literary corpora are used to evaluate the performance of available sentence aligners.

We use the script stats.py to compute the corpus statistics:

python utils/stats.py -i data/mac -o ./stats

Literary Corpora

Table 2. Summary of Literary Corpora
Corpus srcLang tgtLang #srcSents #tgtSents #srcTokens #tgtTokens #1-1 (%)
MAC-Test zh en 4,799 5,573 73,635 105,407 2,628 (59.8)
Bible en zh 30,000 42,687 714,048 524,340 15,665 (56.6)
MAC

The MAC corpus is a manually aligned corpus of Chinese-English literary texts. The sampling scheme for the corpus can be found at the metadata. Please refer to the Github repository for more details about corpus compilation.

The gold alignments are created manually using Intertext and then converted to source and target texts using the script intertext2txt.py.

The source and target directories contain the sentence-split and tokenized source texts, target texts and the machine translations of source texts - Hunalign requires tokenized sentences for dictionary search while Bleualign uses MT translations to compute the BLEU similarity scores between source and target sentences.

We use Moses sentence splitter and Stanford CoreNLP English sentence splitting and tokenization, while pyltp and jieba are used to split and tokenize Chinese sentences. The MT of source texts are generated by Google Translate.

Bible

The Bible corpus, consisting of 30,000 English and 42,687 Chinese sentences, is selected from the public multilingual Bible corpus. This corpus is mainly used to compare the running time of various aligners.

The directory makeup is similar to Fiction corpus, except that the gold alignments for the Bible corpus are generated automatically from the original verse-aligned Bible corpus.

In order to compare the sentence-based alignments returned by various aligners with the verse-based gold alignments, we put the verse ID for each sentence in the files src.verse and zh.verse which are used to merge consecutive sentences in the automatic alignments if they belong to the same verse.

Non-literary Corpus

Table 3. Summary of Non-literary Corpus
Corpus srcLang tgtLang #srcSents #tgtSents #srcTokens #tgtTokens #1-1 (%)
Academic Texts en zh 1,126 1,111 26,022 24,401 965 (90.8)
Political Texts zh en 1,037 1,346 23,929 34,075 770 (75.6)
Magazine Articles en zh 1,027 1,128 18,323 18,838 891 (88.3)

The Non-literary corpus is made up of three sub-corpora:

For all of the corpora above, the original bitexts were firstly split into sentences and then checked or aligned manually using the alignment tool Intertext. Please refer to the metadata of academic texts, political texts and magazine articles for the specific titles of selected source and target texts.

Experiments on the MAC corpus

The following experiments show evaluation results on MAC corpus with LaBSE embeddings. For experiments with LASER embeddings, please refer to the Google Colab notebook.

Installation

# Install faiss-gpu.
pip install faiss-gpu

# Install sentence-transformers.
pip install sentence-transformers

Embedding

# Generate source sentence embeddings.
python utils/overlap.py \
  -i data/mac/src \
  -o data/mac/src/overlap \
  -n 8

python utils/embed.py \
  -i data/mac/src/overlap \
  -o data/mac/src/overlap.labse.emb
# Generate target sentence embeddingss.
python utils/overlap.py \
  -i data/mac/tgt \
  -o data/mac/tgt/overlap \
  -n 8

python utils/embed.py \
  -i data/mac/tgt/overlap \
  -o data/mac/tgt/overlap.labse.emb

Evaluation

# GaleChurch: length-based aligner.
python bin/gale_align.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold
# Hunalign: Dictionary-based aligner.
python bin/hunalign/hunalign.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto \
  -m data/mac/meta_data.tsv

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold
# Bleualign: MT-based aligner using BLEU metric.
python bin/bleualign/bleualign.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto \
  -m data/mac/meta_data.tsv

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold
# Bleurtalign: MT-based aligner using BLEURT metric.
python bin/bleualign/bleualign.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto \
  -m data/mac/meta_data.tsv \
  --bleurt /content/bleurt/BLEURT-20

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold
# Vecalign with LaBSE embeddings.

python bin/vecalign/vecalign.py \
 -s data/mac/src \
 -t data/mac/tgt \
 -o data/mac/auto \
 -m labse -a 8 -v 
python utils/eval.py \
 -t data/mac/auto \
 -g data/mac/gold
# Bertalign with LaBSE embeddings and modified cosine metric.
python bin/bertalign/bert_align.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto \
  -m labse --max_align=8 --margin

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold

Results

Systems Precision Recall F1
Gale-Church 0.442 0.470 0.455
Hunalign 0.566 0.656 0.607
Bleualign 0.711 0.644 0.676
Bleurtalign 0.786 0.799 0.792
Vecalign 0.860 0.886 0.873
Bertalign 0.906 0.912 0.909

Visualization

You can run the Python script demo_vis.py for visualization of the Bertalign's two-step algorithm using Mathplotlib:

python bin/bertalign/demo_vis.py \
   -s data/demo/demo.zh \
   -t data/demo/demo.en \
   --max_align=8 --margin

demo_vis

In the first-pass alignment, Bertalign finds the 1-1 links for approximate anchor points. The second-pass alignment limits the search path to the anchor points and extracts all the valid alignments with 1-to-many, many-to-1 or many-to-many relations between the source and target sentences.

Experiments on the Bible corpus

The experiments settings are similar to the MAC corpus. Please see the the Google Colab notebook for more information.

Results

We do not include Bleualign and Bleurtalign on the Bible corpus because they run out of memory when the document size increases to 25,000 sentences.

Systems Precision Recall F1
Gale-Church 0.561 0.574 0.567
Hunalign 0.804 0.832 0.818
Vecalign 0.957 0.958 0.957
Bertalign 0.974 0.973 0.973

Experiments on the non-literary corpus

The experiments settings are similar to the MAC corpus. Please see the the Google Colab notebook for more information.

Results

Systems Precision Recall F1
Gale-Church 0.852 0.852 0.852
Hunalign 0.884 0.917 0.900
Bleualign 0.923 0.900 0.911
Bleurtalign 0.955 0.957 0.966
Vecalign 0.979 0.980 0.979
Bertalign 0.987 0.987 0.987

About

Evaluation of Automatic Sentence Aligners

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published