Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce the PhoMT results with the HuggingFace Model #2

Closed
justinphan3110 opened this issue Oct 10, 2022 · 5 comments
Closed

Comments

@justinphan3110
Copy link

justinphan3110 commented Oct 10, 2022

Hi, I'm trying to reproduce the En2Vi result described in the paper on the PhoMT Test Set.
I used the generation type as showed in the example

model_name = 'vinai/vinai-translate-en2vi'

tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="en_XX")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
.....
outputs = model.generate(
        input_ids=batch['input_ids'].to('cuda'),
        max_length=max_target_length,
        do_sample=True,
        top_k=100,
        top_p=0.8,
        decoder_start_token_id=tokenizer.lang_code_to_id["vi_VN"],
        num_return_sequences=1,
    )

Yet, the testing result I got from the HuggingFace model was around 42.2 (The result showed in the paper is 44.29).

Do you plan to release the eval code/pipeline to reproduce the result discussed in the paper?

@datquocnguyen
Copy link
Collaborator

Are you using sacreBLEU?

@justinphan3110
Copy link
Author

I'm using the sacreBLEU from HuggingFace Metric. Is this different than the sacreBLEU you used in the paper? If so, can you share the Command-line that you used with the sacreBLEU?

@datquocnguyen
Copy link
Collaborator

datquocnguyen commented Oct 10, 2022

Our training and inference stages (an example below) were originally performed by using fairseq. We then computed the detokenized and case-sensitive BLEU score using SacreBLEU (with the signature “BLEU+case.mixed+numrefs.1+smooth.exp+tok.1- 3a+version.1.5.1”).
The huggingface versions are just variants converted from our original fairseq models. So, I am not sure what makes differences in obtained scores between the two libraries atm.

SOURCE_LANG=vi_VN
TARGET_LANG=en_XX
LANGS=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN

fairseq-generate $DATA_DIR \
   --path $MODEL_DIR/checkpoint_best.pt\
   --task translation_from_pretrained_bart \
   --gen-subset valid\
   -t $TARGET_LANG -s $SOURCE_LANG \
   --bpe 'sentencepiece' --sentencepiece-model $MODEL_DIR/sentence.bpe.model \
   --sacrebleu --remove-bpe 'sentencepiece' \
   --batch-size 32 --langs $LANGS > vi_en

cp $SOURCE_DATA_DIR/val_tourism_finance.en_XX vi_en.ref
#cp $SOURCE_DATA_DIR/test_tourism.en_XX vi_en.ref

cat vi_en| grep -P "^H" |sort -V |cut -f 3- | sed 's/\[en_XX\]//g' > vi_en.hyp

@datquocnguyen
Copy link
Collaborator

datquocnguyen commented Oct 30, 2023

@justinphan3110 I just had a bit of time to redo the evaluation. Using the simple script below, you'd obtain the scarebleu score at 44.2.

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer_en2vi = AutoTokenizer.from_pretrained(
    "vinai/vinai-translate-en2vi", src_lang="en_XX"
)
model_en2vi = AutoModelForSeq2SeqLM.from_pretrained("vinai/vinai-translate-en2vi")
device_en2vi = torch.device("cuda")
model_en2vi.to(device_en2vi)

def translate_en2vi(en_texts: str) -> str:
    input_ids = tokenizer_en2vi(en_texts, padding=True, return_tensors="pt").to(
        device_en2vi
    )
    output_ids = model_en2vi.generate(
        **input_ids,
        decoder_start_token_id=tokenizer_en2vi.lang_code_to_id["vi_VN"],
        num_return_sequences=1,
        num_beams=5,
        early_stopping=True
    )
    vi_texts = tokenizer_en2vi.batch_decode(output_ids, skip_special_tokens=True)
    return vi_texts

with open("PhoMT-detokenization-test/test.en", "r") as input_file:
    lines = [line.strip() for line in input_file.readlines()]
    index = 0
    writer = open("PhoMT-detokenization-test/test.vi_generated.v1", "w")
    while index < len(lines):
        texts = lines[index : index + 8]
        outputs = translate_en2vi(texts)
        print(outputs)
        for output in outputs:
            writer.write(output.strip() + "\n")
        index = index + 8
    writer.close()
    
import evaluate
references = [[line.strip()] for line in open("PhoMT-detokenization-test/test.vi", "r").readlines()]
predictions = [
    line.strip() for line in open("PhoMT-detokenization-test/test.vi_generated.v1", "r").readlines()
]
sacrebleu = evaluate.load("sacrebleu")
results = sacrebleu.compute(predictions=predictions, references=references)
print(results)

@datquocnguyen
Copy link
Collaborator

Evaluation for VietAI/envit5-translation:

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("VietAI/envit5-translation")
model = AutoModelForSeq2SeqLM.from_pretrained("VietAI/envit5-translation")
device = torch.device("cuda")
model.to(device)


def translate(vi_texts: str) -> str:
    input_ids = tokenizer(vi_texts, padding=True, return_tensors="pt").to(device)
    output_ids = model.generate(
        **input_ids,
        num_return_sequences=1,
        num_beams=5,
        early_stopping=True,
        max_length=512
    )
    en_texts = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    return en_texts


with open("PhoMT-detokenization-test/test.vi", "r") as input_file:
    lines = ["vi: " + line.strip() for line in input_file.readlines()]
    index = 0
    writer = open("PhoMT-detokenization-test/test.en_generated.vietai", "w")
    while index < len(lines):
        texts = lines[index : index + 8]
        outputs = translate(texts)
        print(outputs)
        for output in outputs:
            writer.write(output[4:].strip() + "\n")

        index = index + 8

    writer.close()

with open("PhoMT-detokenization-test/test.en", "r") as input_file:
    lines = ["en: " + line.strip() for line in input_file.readlines()]
    index = 0
    writer = open("PhoMT-detokenization-test/test.vi_generated.vietai", "w")
    while index < len(lines):
        texts = lines[index : index + 8]
        outputs = translate(texts)
        print(outputs)
        for output in outputs:
            writer.write(output[4:].strip() + "\n")

        index = index + 8

    writer.close()
    
references = [
    [line.strip()]
    for line in open("PhoMT-detokenization-test/test.en", "r").readlines()
]
predictions = [
    line.strip()
    for line in open(
        "PhoMT-detokenization-test/test.en_generated.vietai", "r"
    ).readlines()
]
sacrebleu = evaluate.load("sacrebleu")
results = sacrebleu.compute(predictions=predictions, references=references)
print(results)

references = [
    [line.strip()]
    for line in open("PhoMT-detokenization-test/test.vi", "r").readlines()
]
predictions = [
    line.strip()
    for line in open(
        "PhoMT-detokenization-test/test.vi_generated.vietai", "r"
    ).readlines()
]
sacrebleu = evaluate.load("sacrebleu")
results = sacrebleu.compute(predictions=predictions, references=references)
print(results)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants