SimulSeamless

Code for the paper: "SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation" published at IWSLT 2024.

📎 Requirements

To run the agent, please make sure that SimulEval v1.1.0 and HuggingFace Transformers are installed.

In the case of 💬 Inference using docker, use commit f1f5b9a69a47496630aa43605f1bd46e5484a2f4 for SimulEval.

🤖 Inference using your environment

Please, set --source, and --target as described in the Fairseq Simultaneous Translation repository: ${LIST_OF_AUDIO} is the list of audio paths and ${TGT_FILE} the segment-wise references in the target language.

Set ${TGT_LANG} as the target language code in 3 characters. The list of supported language codes is available here. For the source language, no language code has to be specified.

Depending on the target language, set ${LATENCY_UNIT} to either word (e.g., for German) or char (e.g., for Japanese), and ${BLEU_TOKENIZER} to either 13a (i.e., the standard sacreBLEU tokenizer used, for example, to evaluate German) or char (e.g., to evaluate character-level languages such as Chinese or Japanese).

The simultaneous inference of SimulSeamless is based on AlignAtt, thus the f parameter (${FRAME}) and the layer from which to extract the attention scores (${LAYER}) have to be set accordingly.

Instruction to replicate IWSLT 2024 results ↙️

To replicate the results obtained to achieve 2 seconds of latency (measured by AL) on the test sets used by the IWSLT 2024 Simultaneous track, use the following values:

en-de: ${TGT_LANG}=deu, ${FRAME}=6, ${LAYER}=3, ${SEG_SIZE}=1000
en-ja: ${TGT_LANG}=jpn, ${FRAME}=1, ${LAYER}=0, ${SEG_SIZE}=400
en-zh: ${TGT_LANG}=cmn, ${FRAME}=1, ${LAYER}=3, ${SEG_SIZE}=800
cs-en: ${TGT_LANG}=eng, ${FRAME}=9, ${LAYER}=3, ${SEG_SIZE}=1000

❗️Please notice that ${FRAME} can be adjusted to achieve lower/higher latency.

The SimulSeamless can be run with:

simuleval \
    --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_alignatt_seamlessm4t.AlignAttSeamlessS2T \
    --source ${LIST_OF_AUDIO} \
    --target ${TGT_FILE} \
    --data-bin ${DATA_ROOT} \
    --model-size medium --target-language ${TGT_LANG} \
    --extract-attn-from-layer ${LAYER} --num-beams 5 \
    --frame-num ${FRAME} \
    --source-segment-size ${SEG_SIZE} \
    --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware \
    --eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
    --output ${OUT_DIR} \
    --device cuda:0

If not already stored in your system, the SeamlessM4T model will be downloaded automatically when running the script. The output will be saved in ${OUT_DIR}.

We suggest to run the inference using a GPU to speed up the process but the system can be run on any device (e.g., CPU) supported by SimulEval and HuggingFace.

💬 Inference using docker

To run SimulSeamless using docker, as required by the IWSLT 2024 Simultaneous track, follow the steps below:

Download the docker file simulseamless.tar
Load the docker image:

docker load -i simulseamless.tar

Start the SimulEval standalone with GPU enabled:

docker run -e TGTLANG=${TGT_LANG} -e FRAME=${FRAME} -e LAYER=${LAYER} \
    -e BLEU_TOKENIZER=${BLEU_TOKENIZER} -e LATENCY_UNIT=${LATENCY_UNIT} \
    -e DEV=cuda:0 --gpus all --shm-size 32G \
    -p 2024:2024 simulseamless:latest

Start the remote evaluation with:

simuleval \
    --remote-eval --remote-port 2024 \
    --source ${LIST_OF_AUDIO} --target ${TGT_FILE} \
    --source-type speech --target-type text \
    --source-segment-size ${SEG_SIZE} \
    --eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
    --output ${OUT_DIR}

To set, ${TGT_LANG}, ${FRAME}, ${LAYER}, ${BLEU_TOKENIZER}, ${LATENCY_UNIT}, ${LIST_OF_AUDIO}, ${TGT_FILE}, ${SEG_SIZE}, and ${OUT_DIR} refer to 🤖 Inference using your environment.

Instruction to recreate the docker images

To recreate the docker images, follow the steps below.

Download SimulEval and this repository.
Create a Dockerfile with the following content:

FROM python:3.9
RUN pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
ADD /SimulEval /SimulEval
WORKDIR /SimulEval
RUN pip install -e .
WORKDIR ../
ADD /fbk-fairseq /fbk-fairseq
WORKDIR /fbk-fairseq
RUN pip install -e .
RUN pip install -r speech_requirements.txt
WORKDIR ../
RUN pip install sentencepiece
RUN pip install transformers

ENTRYPOINT simuleval --standalone --remote-port 2024 \
        --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_alignatt_seamlessm4t.AlignAttSeamlessS2T \
        --model-size medium --num-beams 5 --user-dir fbk-fairseq/examples \
        --target-language $TGTLANG --frame-num $FRAME --extract-attn-from-layer $LAYER --device $DEV \
        --sacrebleu-tokenizer ${BLEU_TOKENIZER} --eval-latency-unit ${LATENCY_UNIT}

Build the docker image:

docker build -t simulseamless .

Save the docker image:

docker save -o simulseamless.tar simulseamless:latest

📍Citation

@inproceedings{papi-et-al-2024-simulseamless,
title = "SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation",
author = {Papi, Sara and Gaido, Marco and Negri, Matteo and Bentivogli, Luisa},
booktitle = "Proceedings of the 21th International Conference on Spoken Language Translation (IWSLT)",
year = "2024",
address = "Bangkok, Thailand",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMULSEAMLESS.md

SIMULSEAMLESS.md

SimulSeamless

📎 Requirements

🤖 Inference using your environment

Instruction to replicate IWSLT 2024 results ↙️

💬 Inference using docker

Instruction to recreate the docker images

📍Citation

Files

SIMULSEAMLESS.md

Latest commit

History

SIMULSEAMLESS.md

File metadata and controls

SimulSeamless

📎 Requirements

🤖 Inference using your environment

Instruction to replicate IWSLT 2024 results ↙️

💬 Inference using docker

Instruction to recreate the docker images

📍Citation