How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation (CLiC-it 2023)
Instructions to reproduce the paper "How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation".
Download all the corpora listed in our paper and preprocess them as explained here.
The models of the paper have been trained with the following scripts.
All the scripts below assume that 4 GPUs are used with at least 16GB of VRAM.
On different hardware, you may need to adjust the parameters --max-tokens
(e.g. lower it if you have lower VRAM)
and --update-freq
so that the product num_gpus * max_tokens * update_freq
remains the same.
To train multi-gender models, you first need to edit the YAML config file generated by the preprocessing script, so as to have:
audio_root: $YOUR_AUDIO_ROOT_DIR
bpe_tokenizer:
bpe: sentencepiece
sentencepiece_model: $YOUR_TGTLANG_SENTENCEPIECE_MODEL
bpe_tokenizer_src:
bpe: sentencepiece
sentencepiece_model: $YOUR_ENGLISH_SENTENCEPIECE_MODEL
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
prepend_tgt_lang_tag: True
specaugment:
freq_mask_F: 27
freq_mask_N: 1
time_mask_N: 1
time_mask_T: 100
time_mask_p: 1.0
time_wrap_W: 0
transforms:
'*':
- utterance_cmvn
_train:
- utterance_cmvn
- specaugment
vocab_filename: $YOUR_TGTLANG_SENTENCEPIECE_TOKENS_TXT
vocab_filename_src: $YOUR_ENGLISH_SENTENCEPIECE_TOKENS_TXT
which we name config_st_mix_multigender.yaml
hereinafter.
Mind the prepend_tgt_lang_tag: True
.
Your SentencePiece models should contain tags for the two genders as the special tokens
<lang:He>
and <lang:She>
. In addition, the TSV you have obtained from the preprocessing
of your data must be enriched with a tgt_lang
column containing either He
or She
according to
the gender of the speaker (in the following, we assume the TSV is named train_st_src_gender_multilang.tsv
.
To know the gender of each speaker, please refer to
MuST-Speakers.
Then, train multi-gender models with the following command:
python train.py ${DATA_ROOT} \
--train-subset train_st_src_gender_multilang \
--valid-subset dev_with_gender_lang \
--save-dir ${ST_SAVE_DIR} \
--num-workers 5 --max-update 50000 \
--max-tokens 10000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_aux_classification --config-yaml config_st_mix_multigender.yaml \
--ignore-prefix-size 1 \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 25000 \
--clip-norm 10.0 \
--seed 1 --update-freq 8 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err
To obtain a multi-gender model that is fine-tuned from the base ST one,
add to the training command above --allow-extra-tokens --finetune-from-model $BASE_ST_MODEL_CHECKPOINT
,
change the learning rate to 5e-4
, and the lr-scheduler
to fixed
.
First, you need to add the following lines to the YAML config file, so as to obtain config_st_mix_multigender_with_aux.yaml
:
aux_classes:
- He
- She
Then, you need to duplicate the tgt_lang
column in the TSV files,
naming the new column as auxiliary_target
.
The training can be executed with the following script:
python train.py ${DATA_ROOT} \
--train-subset train_st_src_gender_multilang \
--valid-subset dev_with_gender_lang \
--save-dir ${ST_SAVE_DIR} \
--num-workers 5 --max-update 50000 --keep-last-epochs 10 \
--max-tokens 10000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_aux_classification --config-yaml config_st_mix_multigender_with_aux.yaml \
--ignore-prefix-size 1 \
--criterion ctc_multi_loss --underlying-criterion cross_entropy_multi_task --label-smoothing 0.1 \
--arch multitask_conformer --reverted-classifier --auxiliary-loss-weight 0.5 --reverted-lambda 0.5 \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 25000 \
--clip-norm 10.0 \
--seed 1 --update-freq 8 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err
To obtain the weighted variant, add --auxiliary-loss-class-weights 0.8 1.4
to the command above.
To fine-tune from a pre-trained multi-gender model, the procedure is the same as above, but the script is the following:
python train.py ${DATA_ROOT} \
--train-subset train_st_src_gender_multilang \
--valid-subset dev_with_gender_lang \
--save-dir ${ST_SAVE_DIR} \
--num-workers 5 --max-update 50000 \
--max-tokens 10000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_aux_classification --config-yaml config_st_mix_multigender.yaml \
--ignore-prefix-size 1 \
--criterion ctc_multi_loss --underlying-criterion cross_entropy_multi_task --label-smoothing 0.1 \
--arch multitask_conformer --reverted-classifier --auxiliary-loss-weight 0.5 --reverted-lambda 10 \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg --auxiliary-loss-class-weights 0.8 1.4 \
--allow-extra-tokens --allow-partial-loading --finetune-from-model $PATH_TO_PRETRAINED_MULTIGENDER_MODEL \
--optimizer adam --lr 5e-4 --lr-scheduler fixed \
--clip-norm 10.0 \
--seed 1 --update-freq 8 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err
Similarly, the weighted variant is obtained by adding
--auxiliary-loss-class-weights 0.8 1.4
to the command above.
To enable the audio manipulation that converts speakers' vocal traits into the opposite gender,
edit the config_st_mix_multigender.yaml
file adding:
opposite_pitch:
gender_tsv: /home/ubuntu/disk2/corpora/MuST-Speakers_v1.1/MuST-Speakers_v1.1.tsv
sampling_rate: 16000
p_male: $PROB_MANIP
p_female: $PROB_MANIP
raw_transforms:
_train:
- opposite_pitch
waveform_sample_rate: 16000
is_input_waveform: True
where $PROB_MANIP
has been set to 0.5 and 0.8 in the experiments reported in the paper.
Evaluation of the system outputs has been performed with SacreBLEU v2.0 and the MuST-SHE Gender Accuracy Script v1.1.
If you use this work, please cite:
@inproceedings{gaido-et-al-multigender,
title={{How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation}},
author={Gaido, Marco and Fucci, Dennis and Negri, Matteo and Bentivogli, Luisa},
year={2023},
booktitle="Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)",
address="Venice, Italy"
}