Skip to content

TP-GST-BERT Tacotron2 is a voice synthesis model, based on Tacotron2 GST that can predict Style Embedding only on text, using BERT sentence embedding

License

Notifications You must be signed in to change notification settings

lightbooster/TP-GST-BERT-Tacotron2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TacoBERTron

TP-GST-BERT-Tacotron2

This is a realization of the model proposed by SberDevices team and extended with quicker TP-GST module by me
The training process has been carried out on a russian language dataset

Model contains:

  • Tacotron2 Encoder + Decoder
  • Global Style Tokens module
  • 3 Text-predicting style embedding models
  • BERT model

Pre-requisites

  1. NVIDIA GPU + CUDA cuDNN

Setup

  1. Clone this repo: git clone https://github.com/lightbooster/TP-GST-BERT-Tacotron2.git
  2. CD into this repo: TP-GST-BERT-Tacotron2
  3. Initialize submodule: git submodule init; git submodule update
  4. Install [PyTorch]
  5. Install [Apex]
  6. Install python requirements or build docker image
    • Install python requirements: pip install -r requirements.txt
      NOTE: elaborated example of SetUp in notebook demo.ipynb

Prepare BERT

  1. Download BERT checkpoint (I used RuBERT from deeppavlov.ai)
  2. Move BERT checkpoint, config and vocabulary into /bert folder or setup related paths in hparams.py
  3. Modify BERT hyper parameters in hparams.py if those are needed

Training

  1. Update the filelists inside the filelists folder to point to your data
  2. python train.py --output_directory=outdir --log_directory=logdir
  3. (OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the speaker embedding layer is [ignored]

  1. Download my pretrained model checkpoint on a russian language dataset NOTE: checkpoint doesn't contain BERT model weights, use ceparate checkpoint for it
  2. python train.py --output_directory=outdir --log_directory=logdir -c {PATH_TO_CHECKPOINT} --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

  1. python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Training and Inference demo

M-AILABS data preprocessing, train configuration and inference demos are represented in the notebook demo.ipynb

Related repos

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis.

References

About

TP-GST-BERT Tacotron2 is a voice synthesis model, based on Tacotron2 GST that can predict Style Embedding only on text, using BERT sentence embedding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 71.4%
  • Jupyter Notebook 28.6%