Skip to content

xincanfeng/vitsGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

In our recent paper, we propose Llama-VITS for enhanced TTS synthesis with semantic awareness extracted from a large-scale language model. This repository is the PyTorch implementation of Llama-VITS. Please visit our demo or demo github for audio samples.

alt text

Implemented Features:

Model with Weights:

  • Llama-VITS
  • BERT-VITS
  • ORI-VITS

Evaluation Metrics:

  • ESMOS
  • UTMOS
  • MCD
  • ASR (CER, WER)

Datasets:

  • full LJSpeech
  • 1-hour LJSpeech
  • EmoV_DB_bea_sem

Pre-requisites

  1. Clone this repository
    git clone git@github.com:xincanfeng/vitsGPT.git
  2. Install requirements.
    cd vitsGPT
    pip install pdm
    pdm install 
    1. You may need to install espeak first:
      sudo apt-get update
      sudo apt-get install espeak
  3. Download datasets
    1. Download and extract the LJSpeech dataset from its official page, then rename or use absolute paths to create soft links to your data to make it easier to access:
      ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/DUMMY1
      ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/DUMMY1
      ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/ori_vits/DUMMY1
      ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/emo_vits/DUMMY1
      ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/sem_vits/DUMMY1
    2. Download and extract our EmoV_DB_bea_sem dataset from here, then rename or create a link to the dataset folder:
      ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/DUMMY5
      ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/DUMMY5
      ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/ori_vits/DUMMY5
      ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/emo_vits/DUMMY5
      ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/sem_vits/DUMMY5
    3. You can also download EmoV_DB and filter it yourself by refering to our preprocess_EmoV_DB_bea_filter.py. We also provide code for other necessary data preprocessing in the folder datasets. Note that the gt_test_wav folder includes all test audios that we have processed to the same sampling rate with those generated by {method}_VITS methods. You can also process it by your own if using other datasets.
    4. We do not provide 1-hour LJSpeech dataset explicitly. Because after the full LJSpeech is downloaded, you will be able to train on 1-hour LJSpeech by directly using its correspoding filtered filelist in our filelists folder. Or you can also randomly filter it yourself from the full LJSpeech.
  4. Download filelists which contains semantic embeddings extracted from Llama and various BERT models.
    vitsGPT/vits/filelists folder contains exact training information for every dataset along with corresponding semantic embeddings in our experiments.
  5. Create more soft links to facilitate access to common configurations among different {method}_VITS methods.
    1. Create soft links to the vitsGPT/vits/configs for each {method}_VITS method.
      ln -s vitsGPT/vits/configs vitsGPT/vits/ori_vits/
      ln -s vitsGPT/vits/configs vitsGPT/vits/emo_vits/
      ln -s vitsGPT/vits/configs vitsGPT/vits/sem_vits/
    2. Create soft links to the filelists/ for each {method}_VITS method.
      ln -s vitsGPT/vits/filelists vitsGPT/vits/ori_vits/
      ln -s vitsGPT/vits/filelists vitsGPT/vits/emo_vits/
      ln -s vitsGPT/vits/filelists vitsGPT/vits/sem_vits/
  6. Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
    # Cython-version Monotonoic Alignment Search
    cd monotonic_align
    python setup.py build_ext --inplace
    
    # Preprocessing (g2p) for your own datasets. 
    # python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
    Please refer to preprocess_own_data.sh for configurations on different datasets.
    Note that we have provided preprocessed phonemes for LJSpeech, 1-hour LJSpeech, and EmoV_DB_bea_sem in filelists named as {dataset}_audio_text_{train/val/test/all}_filelist.txt.cleaned.

Extracting Semantic Embeddings

Note that we have provided all extracted semantic embeddings from Llama or various BERT models in filelists named as {dataset}_audio_{token}_{dimension}.pt. But if you want to process your own data, we also provide the code to extract semantic embeddings from Llama or various BERT models as below.

Extracting Semantic Embeddings From Llama

  1. Use the Llama implementation in our repository which includes codes to extract the semantic embeddings in the final hidden layer. But you can always refer to Llama repository if there are further related questions.

  2. First, in the vitsGPT/llama directory run:

    cd vitsGPT/llama
    pip install -e .
  3. Then, download the Llama weights and tokenizer from Meta website and accept their License.

  4. Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download. (Pre-requisites: Make sure you have wget and md5sum installed. Then run the script: ./download.sh.)

    • Make sure to grant execution permissions to the download.sh script
    • During this process, you will be prompted to enter the URL from the email.
    • Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.
    • Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.
  5. Once the models you want have been downloaded, you can run the models locally. Below is one example command:

    torchrun --nproc_per_node 1 example_chat_completion.py \
        --ckpt_dir llama-2-7b-chat/ \
        --tokenizer_path tokenizer.model \
        --max_seq_len 512 --max_batch_size 6
  6. You can refer to inference.sh to learn more examples we created to run Llama inference. You can use inference_ave.sh, inference_last.sh, inference_pca.sh, inference_mat_phone.sh, inference_mat_text.sh, inference_sentence.sh, and inference_word.sh scripts to infer and extract corresponding specific semantic embeddings in our paper.

    As you can read from the inference_{token}.sh script, example_{llama-model}_{method}_{token}.py in the llama/examples/{dataset}_examples folder is used to tell Llama how to extract different semantic embeddings, what input transcripts to follow, and where to output. So, remember to check the corresponding example_{llama-model}_{method}_{token}.py file for configurations of the variable input_file, output_file, and audiopath that you want to process.

Extracting Semantic Embeddings From various BERT models

You can configure in get_embedding.sh to extract BERT embedding. When configuring, don't forget to set correct filelist_dir in corresponding get_embedding_{token}.py files.

Training

You can train the VITS model w/ or w/o semantic tokens using the scripts below.
Note that we also provide part of our pretrained models.

Training VITS with no semantic tokens

cd ori_vits
python train.py -c configs/ljs_base.json -m ljs_base

Please refer to train.sh for specific configurations of different datasets.

Training VITS with global semantic tokens

cd emo_vits
python emo_train.py -c configs/ljs_sem_ave.json -m ljs_emo_add_ave

Please refer to emo_train.sh for specific configurations of different datasets and global tokens.

Training VITS with sequential semantic tokens

cd sem_vits
python sem_train.py -c configs/ljs_sem_mat_text.json -m ljs_sem_mat_text

Please refer to sem_train.sh for specific configurations of different datasets and sequential tokens.

(In case you are interested in naming details, "mat" in the sequential tokens' file name means "matrix", because compared to global token which is mathematically represented by a single vector, sequential token is represented by a matrix for each sentence transcript.)

Inferencing

See inference.ipynb as an easy example to understand how to inference on any text.

Configure the model weights w/ or w/o extracted semantic tokens in the files below for inference according to specific model. Then you can inference on test data transcripts and generate a folder named after the checkpoint, e.g., G_100000, including a folder named source_model_test_wav which saves all the generated audios in the correspoding checkpoint directory. Specifically,
Use infer_test.ipynb for inferencing with no semantic tokens on test data transcripts.
Use emo_infer_test.ipynb for inferencing with global semantic tokens on test data transcripts.
Use sem_infer_test.ipynb for inferencing with sequential semantic tokens on test data transcripts.

Note that, in the source_model_test_wav file, the saved audio samples are named in the generation order instead of the corresponding transcript key for convenience.

Evaluation

Eval MCD and ASR (CER, WER) using ESPnet, eval UTMOS using SpeechMOS

  1. Clone and install ESPnet according to its repository.
  2. Copy and configure eval.sh into espnet/egs2/libritts/tts1/eval.sh.
  3. install whisper for calculating ASR (CER, WER)
    pip install git+https://github.com/openai/whisper.git
  4. Use run_eval_ljs.sh and run_eval_emovdb.sh, respectively, for evaluation on LJSpeech or EmoV_DB or their subsets. As you can learn from run_eval_{dataset}.sh, for example, not only eval.sh are used, but also eval_1_make_kaldi_style_files.py and other processes in eval_datasets are used to process and eval on inferenced audio. Specifically,
    1. Run eval_1_make_kaldi_style_files.py to rename the generated audio samples in the source_model_test_wav file corresponding to its transcript key. And generate related scp files.

      python3 vits/eval_datasets/eval_{dataset}/eval_1_make_kaldi_style_files.py ${method} ${model} ${step}
    2. Run eval_2_unify_and_eval.sh to downsample both model generated audios and ground truth audios to ensure they have the the same sampling rate.

      . vits/eval_datasets/eval_{dataset}/eval_2_unify_and_eval.sh ${method} ${model} ${step}
    3. Run eval.sh to evaluate MCD,ASR,F0 using the ESPnet framework. (You can also run this step after the step 4.)

      CUDA_VISIBLE_DEVICES=0 . espnet/egs2/libritts/tts1/eval.sh ${method} ${model} ${step} 

      Because this step may take some time, it is recommended to run this process in the background using:

      CUDA_VISIBLE_DEVICES=0 nohup espnet/egs2/libritts/tts1/eval.sh ${method} ${model} ${step} > eval.log 2>&1 & 
    4. Run eval_3_mos.py to evaluate UTMOS using the SpeechMOS framework.

      CUDA_VISIBLE_DEVICES=0 python3 vits/eval_datasets/eval_{dataset}/eval_3_mos.py ${method} ${model} ${step}

Eval ESMOS using Amazon Mechanical Turk (AMT)

We made paired random examples to receive ESMOS score using AMT. You can refer to human_evaluation to check out how we prepared for this evaluation.

Citation

If our work is useful to you, please cite our paper: "Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness". paper

@misc{feng2024llamavits,
      title={Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness}, 
      author={Xincan Feng and Akifumi Yoshimoto},
      year={2024},
      eprint={2404.06714},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published