Evaluation of Automatic Sentence Aligners

This repository consists of the datasets and Python scripts for the evaluation of automatic sentence aligners.

You can install AlignerEval and conduct the evaluation directly in a Google Colab notebook.

Systems Evaluated

Table 1 shows the sentence aligners being evaluated, where S and T means source and target text respectively, T_d is the translation of source text using bilingual lexicons and T_m is the machine translation of source text. S_e and T_e are vector representations of source and target texts using sentence embedding techniques such as LASER and sentence-transformers.

Table 1. Systems evaluated and implementation details

System	Type	Input
Galechurch	Length-based	S <=> T
Hunalign	Dictionary-based	T_d <=> T
Bleualign	MT-based	T_m <=> T
Bleurtalign	MT-based	T_m <=> T
Vecalign	Embedding-based	S_e <=> T_e
Bertalign	Embedding-based	S_e <=> T_e

Evaluation Corpora

Both literary and non-literary corpora are used to evaluate the performance of available sentence aligners.

We use the script stats.py to compute the corpus statistics:

python utils/stats.py -i data/mac -o ./stats

Literary Corpora

Table 2. Summary of Literary Corpora

Corpus	srcLang	tgtLang	#srcSents	#tgtSents	#srcTokens	#tgtTokens	#1-1 (%)
MAC-Test	zh	en	4,799	5,573	73,635	105,407	2,628 (59.8)
Bible	en	zh	30,000	42,687	714,048	524,340	15,665 (56.6)

MAC

The MAC corpus is a manually aligned corpus of Chinese-English literary texts. The sampling scheme for the corpus can be found at the metadata. Please refer to the Github repository for more details about corpus compilation.

The gold alignments are created manually using Intertext and then converted to source and target texts using the script intertext2txt.py.

The source and target directories contain the sentence-split and tokenized source texts, target texts and the machine translations of source texts - Hunalign requires tokenized sentences for dictionary search while Bleualign uses MT translations to compute the BLEU similarity scores between source and target sentences.

We use Moses sentence splitter and Stanford CoreNLP English sentence splitting and tokenization, while pyltp and jieba are used to split and tokenize Chinese sentences. The MT of source texts are generated by Google Translate.

Bible

The Bible corpus, consisting of 30,000 English and 42,687 Chinese sentences, is selected from the public multilingual Bible corpus. This corpus is mainly used to compare the running time of various aligners.

The directory makeup is similar to Fiction corpus, except that the gold alignments for the Bible corpus are generated automatically from the original verse-aligned Bible corpus.

In order to compare the sentence-based alignments returned by various aligners with the verse-based gold alignments, we put the verse ID for each sentence in the files src.verse and zh.verse which are used to merge consecutive sentences in the automatic alignments if they belong to the same verse.

Non-literary Corpus

Table 3. Summary of Non-literary Corpus

Corpus	srcLang	tgtLang	#srcSents	#tgtSents	#srcTokens	#tgtTokens	#1-1 (%)
Academic Texts	en	zh	1,126	1,111	26,022	24,401	965 (90.8)
Political Texts	zh	en	1,037	1,346	23,929	34,075	770 (75.6)
Magazine Articles	en	zh	1,027	1,128	18,323	18,838	891 (88.3)

The Non-literary corpus is made up of three sub-corpora:

The corpus of academic texts consists of 13 English academic texts and their translations from the category of learned texts in Yiyan English-Chinese Parallel Corpus.
The corpus of political texts is made up of 19 speeches and writings from the book Xijinping: The Governance of China, Volume 1.
The corpus of magazine articles is sampled from the Babel English-Chinese Parallel Corpus, containing 27 English articles from the Time Magazine and their Chinese translations.

For all of the corpora above, the original bitexts were firstly split into sentences and then checked or aligned manually using the alignment tool Intertext. Please refer to the metadata of academic texts, political texts and magazine articles for the specific titles of selected source and target texts.

Experiments on the MAC corpus

The following experiments show evaluation results on MAC corpus with LaBSE embeddings. For experiments with LASER embeddings, please refer to the Google Colab notebook.

Installation

# Install faiss-gpu.
pip install faiss-gpu

# Install sentence-transformers.
pip install sentence-transformers

Embedding

# Generate source sentence embeddings.
python utils/overlap.py \
  -i data/mac/src \
  -o data/mac/src/overlap \
  -n 8

python utils/embed.py \
  -i data/mac/src/overlap \
  -o data/mac/src/overlap.labse.emb

# Generate target sentence embeddingss.
python utils/overlap.py \
  -i data/mac/tgt \
  -o data/mac/tgt/overlap \
  -n 8

python utils/embed.py \
  -i data/mac/tgt/overlap \
  -o data/mac/tgt/overlap.labse.emb

Evaluation

# GaleChurch: length-based aligner.
python bin/gale_align.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold

# Hunalign: Dictionary-based aligner.
python bin/hunalign/hunalign.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto \
  -m data/mac/meta_data.tsv

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold

# Bleualign: MT-based aligner using BLEU metric.
python bin/bleualign/bleualign.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto \
  -m data/mac/meta_data.tsv

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold

# Bleurtalign: MT-based aligner using BLEURT metric.
python bin/bleualign/bleualign.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto \
  -m data/mac/meta_data.tsv \
  --bleurt /content/bleurt/BLEURT-20

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold

# Vecalign with LaBSE embeddings.

python bin/vecalign/vecalign.py \
 -s data/mac/src \
 -t data/mac/tgt \
 -o data/mac/auto \
 -m labse -a 8 -v 
python utils/eval.py \
 -t data/mac/auto \
 -g data/mac/gold

# Bertalign with LaBSE embeddings and modified cosine metric.
python bin/bertalign/bert_align.py \
  -s data/mac/src \
  -t data/mac/tgt \
  -o data/mac/auto \
  -m labse --max_align=8 --margin

python utils/eval.py \
  -t data/mac/auto \
  -g data/mac/gold

Results

Systems	Precision	Recall	F₁
Gale-Church	0.442	0.470	0.455
Hunalign	0.566	0.656	0.607
Bleualign	0.711	0.644	0.676
Bleurtalign	0.786	0.799	0.792
Vecalign	0.860	0.886	0.873
Bertalign	0.906	0.912	0.909

Visualization

You can run the Python script demo_vis.py for visualization of the Bertalign's two-step algorithm using Mathplotlib:

python bin/bertalign/demo_vis.py \
   -s data/demo/demo.zh \
   -t data/demo/demo.en \
   --max_align=8 --margin

In the first-pass alignment, Bertalign finds the 1-1 links for approximate anchor points. The second-pass alignment limits the search path to the anchor points and extracts all the valid alignments with 1-to-many, many-to-1 or many-to-many relations between the source and target sentences.

Experiments on the Bible corpus

The experiments settings are similar to the MAC corpus. Please see the the Google Colab notebook for more information.

Results

We do not include Bleualign and Bleurtalign on the Bible corpus because they run out of memory when the document size increases to 25,000 sentences.

Systems	Precision	Recall	F₁
Gale-Church	0.561	0.574	0.567
Hunalign	0.804	0.832	0.818
Vecalign	0.957	0.958	0.957
Bertalign	0.974	0.973	0.973

Experiments on the non-literary corpus

The experiments settings are similar to the MAC corpus. Please see the the Google Colab notebook for more information.

Results

Systems	Precision	Recall	F₁
Gale-Church	0.852	0.852	0.852
Hunalign	0.884	0.917	0.900
Bleualign	0.923	0.900	0.911
Bleurtalign	0.955	0.957	0.966
Vecalign	0.979	0.980	0.979
Bertalign	0.987	0.987	0.987

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
bin		bin
data		data
demo		demo
utils		utils
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of Automatic Sentence Aligners

You can install AlignerEval and conduct the evaluation directly in a Google Colab notebook.

Systems Evaluated

Table 1. Systems evaluated and implementation details

Evaluation Corpora

Literary Corpora

Table 2. Summary of Literary Corpora

MAC

Bible

Non-literary Corpus

Table 3. Summary of Non-literary Corpus

Experiments on the MAC corpus

Installation

Embedding

Evaluation

Results

Visualization

Experiments on the Bible corpus

Results

Experiments on the non-literary corpus

Results

About

Releases

Packages

Languages

License

bfsujason/aligner-eval

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Automatic Sentence Aligners

You can install AlignerEval and conduct the evaluation directly in a Google Colab notebook.

Systems Evaluated

Table 1. Systems evaluated and implementation details

Evaluation Corpora

Literary Corpora

Table 2. Summary of Literary Corpora

MAC

Bible

Non-literary Corpus

Table 3. Summary of Non-literary Corpus

Experiments on the MAC corpus

Installation

Embedding

Evaluation

Results

Visualization

Experiments on the Bible corpus

Results

Experiments on the non-literary corpus

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages