How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation (CLiC-it 2023)

Instructions to reproduce the paper "How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation".

📍 Preprocess and Setup

Download all the corpora listed in our paper and preprocess them as explained here.

🏃 Training

The models of the paper have been trained with the following scripts. All the scripts below assume that 4 GPUs are used with at least 16GB of VRAM. On different hardware, you may need to adjust the parameters --max-tokens (e.g. lower it if you have lower VRAM) and --update-freq so that the product num_gpus * max_tokens * update_freq remains the same.

Multi-gender Baseline

To train multi-gender models, you first need to edit the YAML config file generated by the preprocessing script, so as to have:

audio_root: $YOUR_AUDIO_ROOT_DIR
bpe_tokenizer:
  bpe: sentencepiece
  sentencepiece_model: $YOUR_TGTLANG_SENTENCEPIECE_MODEL
bpe_tokenizer_src:
  bpe: sentencepiece
  sentencepiece_model: $YOUR_ENGLISH_SENTENCEPIECE_MODEL
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
prepend_tgt_lang_tag: True
specaugment:
  freq_mask_F: 27
  freq_mask_N: 1
  time_mask_N: 1
  time_mask_T: 100
  time_mask_p: 1.0
  time_wrap_W: 0
transforms:
  '*':
  - utterance_cmvn
  _train:
  - utterance_cmvn
  - specaugment
vocab_filename: $YOUR_TGTLANG_SENTENCEPIECE_TOKENS_TXT
vocab_filename_src: $YOUR_ENGLISH_SENTENCEPIECE_TOKENS_TXT

which we name config_st_mix_multigender.yaml hereinafter. Mind the prepend_tgt_lang_tag: True.

Your SentencePiece models should contain tags for the two genders as the special tokens <lang:He> and <lang:She>. In addition, the TSV you have obtained from the preprocessing of your data must be enriched with a tgt_lang column containing either He or She according to the gender of the speaker (in the following, we assume the TSV is named train_st_src_gender_multilang.tsv. To know the gender of each speaker, please refer to MuST-Speakers.

Then, train multi-gender models with the following command:

python train.py ${DATA_ROOT} \
    --train-subset train_st_src_gender_multilang \
    --valid-subset dev_with_gender_lang \
    --save-dir ${ST_SAVE_DIR} \
    --num-workers 5 --max-update 50000 \
    --max-tokens 10000 --adam-betas '(0.9, 0.98)' \
    --user-dir examples/speech_to_text \
    --task speech_to_text_aux_classification --config-yaml config_st_mix_multigender.yaml  \
    --ignore-prefix-size 1 \
    --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --arch conformer \
    --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
    --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
    --warmup-updates 25000 \
    --clip-norm 10.0 \
    --seed 1 --update-freq 8 \
    --skip-invalid-size-inputs-valid-test \
    --log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err

Finetuned Multi-gender Baseline

To obtain a multi-gender model that is fine-tuned from the base ST one, add to the training command above --allow-extra-tokens --finetune-from-model $BASE_ST_MODEL_CHECKPOINT, change the learning rate to 5e-4, and the lr-scheduler to fixed.

Multi-gender Gradient Reversal

First, you need to add the following lines to the YAML config file, so as to obtain config_st_mix_multigender_with_aux.yaml:

aux_classes:
  - He
  - She

Then, you need to duplicate the tgt_lang column in the TSV files, naming the new column as auxiliary_target.

The training can be executed with the following script:

python train.py ${DATA_ROOT} \
    --train-subset train_st_src_gender_multilang \
    --valid-subset dev_with_gender_lang \
    --save-dir ${ST_SAVE_DIR} \
    --num-workers 5 --max-update 50000 --keep-last-epochs 10 \
    --max-tokens 10000 --adam-betas '(0.9, 0.98)' \
    --user-dir examples/speech_to_text \
    --task speech_to_text_aux_classification --config-yaml config_st_mix_multigender_with_aux.yaml  \
    --ignore-prefix-size 1 \
    --criterion ctc_multi_loss --underlying-criterion cross_entropy_multi_task --label-smoothing 0.1 \
    --arch multitask_conformer --reverted-classifier --auxiliary-loss-weight 0.5 --reverted-lambda 0.5 \
    --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
    --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
    --warmup-updates 25000 \
    --clip-norm 10.0 \
    --seed 1 --update-freq 8 \
    --skip-invalid-size-inputs-valid-test \
    --log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err

To obtain the weighted variant, add --auxiliary-loss-class-weights 0.8 1.4 to the command above.

Finetuned Multi-gender Gradient Reversal

To fine-tune from a pre-trained multi-gender model, the procedure is the same as above, but the script is the following:

python train.py ${DATA_ROOT} \
    --train-subset train_st_src_gender_multilang \
    --valid-subset dev_with_gender_lang \
    --save-dir ${ST_SAVE_DIR} \
    --num-workers 5 --max-update 50000 \
    --max-tokens 10000 --adam-betas '(0.9, 0.98)' \
    --user-dir examples/speech_to_text \
    --task speech_to_text_aux_classification --config-yaml config_st_mix_multigender.yaml  \
    --ignore-prefix-size 1 \
    --criterion ctc_multi_loss --underlying-criterion cross_entropy_multi_task --label-smoothing 0.1 \
    --arch multitask_conformer --reverted-classifier --auxiliary-loss-weight 0.5 --reverted-lambda 10 \
    --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg --auxiliary-loss-class-weights 0.8 1.4 \
    --allow-extra-tokens --allow-partial-loading --finetune-from-model $PATH_TO_PRETRAINED_MULTIGENDER_MODEL \
    --optimizer adam --lr 5e-4 --lr-scheduler fixed \
    --clip-norm 10.0 \
    --seed 1 --update-freq 8 \
    --skip-invalid-size-inputs-valid-test \
    --log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err

Similarly, the weighted variant is obtained by adding --auxiliary-loss-class-weights 0.8 1.4 to the command above.

Audio Manipulation

To enable the audio manipulation that converts speakers' vocal traits into the opposite gender, edit the config_st_mix_multigender.yaml file adding:

opposite_pitch:
  gender_tsv: /home/ubuntu/disk2/corpora/MuST-Speakers_v1.1/MuST-Speakers_v1.1.tsv
  sampling_rate: 16000
  p_male: $PROB_MANIP
  p_female: $PROB_MANIP
raw_transforms:
  _train:
  - opposite_pitch
waveform_sample_rate: 16000
is_input_waveform: True

where $PROB_MANIP has been set to 0.5 and 0.8 in the experiments reported in the paper.

🔍 Evaluation

Evaluation of the system outputs has been performed with SacreBLEU v2.0 and the MuST-SHE Gender Accuracy Script v1.1.

⭐ Citation

If you use this work, please cite:

@inproceedings{gaido-et-al-multigender,
  title={{How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation}},
  author={Gaido, Marco and Fucci, Dennis and Negri, Matteo and Bentivogli, Luisa},
  year={2023},
  booktitle="Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)",
  address="Venice, Italy"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MULTIGENDER_CLIC_2023.md

MULTIGENDER_CLIC_2023.md

How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation (CLiC-it 2023)

📍 Preprocess and Setup

🏃 Training

Multi-gender Baseline

Finetuned Multi-gender Baseline

Multi-gender Gradient Reversal

Finetuned Multi-gender Gradient Reversal

Audio Manipulation

🔍 Evaluation

⭐ Citation

Files

MULTIGENDER_CLIC_2023.md

Latest commit

History

MULTIGENDER_CLIC_2023.md

File metadata and controls

How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation (CLiC-it 2023)

📍 Preprocess and Setup

🏃 Training

Multi-gender Baseline

Finetuned Multi-gender Baseline

Multi-gender Gradient Reversal

Finetuned Multi-gender Gradient Reversal

Audio Manipulation

🔍 Evaluation

⭐ Citation