Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-Tuning with a small dataset #296

Closed
OscarVanL opened this issue Oct 11, 2020 · 127 comments
Closed

Fine-Tuning with a small dataset #296

OscarVanL opened this issue Oct 11, 2020 · 127 comments
Assignees
Labels
question ❓ Further information is requested

Comments

@OscarVanL
Copy link
Contributor

OscarVanL commented Oct 11, 2020

Hello!

I'm trying to evaluate ways to achieve TTS for individuals that have lost their ability to speak, the idea is to allow them to regain speech via TTS but using the voice they had prior to losing their voice. This could happen from various causes such as cancer of the larynx, motor neurone disease, etc.

These patients have recorded voice banks, a small dataset of phrases recorded prior to losing their ability to speak.

Conceptually, I wanted to take a pre-trained model and fine-tune it with the individual's voice bank data.

I'd love some guidance.

There are a few constraints:

  1. The patient-specific data bank is not a large dataset, it's approximately 100 recorded phrases.
  2. Latency must be low, we hope for real-time TTS. Some approaches use a pre-trained model followed by vocoders, in our experience, this has been too slow, with latencies of about 5 seconds.
  3. The trained model must work on an Android app (I see there is already an Android example, which has been helpful)

I'd love your guidance on the steps required to achieve this, and any recommendations on which choices would give good results...

  • Which model architectures will tolerate tuning with a small dataset?
  • The patients have British accents, whereas most pre-trained models have American accents. Will this be a problem?

Do you have any tutorials or examples that show how to achieve a customised voice via fine-tuning?

@dathudeptrai dathudeptrai self-assigned this Oct 12, 2020
@dathudeptrai dathudeptrai added the question ❓ Further information is requested label Oct 12, 2020
@dathudeptrai
Copy link
Collaborator

@OscarVanL Hi, great idea :D. I have some guidances for you to customized voice via fine-tuning bellow:

  • About latency, fastspeech2 + mb-melgan is enough for you in this case, it can run in real-time on mobile devices with a good generated voice.
  • You can use a LJSPeech pretrained model to fine-tune on ur patient-specific data. Sice ur dataset is small (100 recorđe phrases) then there are many words missing so you just need fine-tune speaker-embedding layers and add some FC layers in the end of FastSpeech2 (you can also fine-tune PostNet in FastSpeech2) model to let model transfer the American accent to British accent. I will make a PR to let the model training only some layers rather than all layers :D.
  • About Mb-melgan, you can train on a larger dataset with many speakers to achieves a universal vocoder so you can use this universal version for your FastSpeech2 without fine-tuning.

@ZDisket can you share some ur experiences when finetune a voice from female->male in ur small dataset :D .

@ZDisket
Copy link
Collaborator

ZDisket commented Oct 12, 2020

@OscarVanL @dathudeptrai
FastSpeech2 is definitely the right architecture, it's very tolerant of small datasets (my guess is because it doesn't have to learn to align them); I've had success finetuning on even 80 seconds of audio, although that was female -> female, but there shouldn't be a problem with male voices, which I've also had success on.
Although I've had little success when finetuning mb-melgan, as there is always a lot of loss or background noise (which is why I integrated RNNoise into my frontend), so universal vocoder is the way to go.

@OscarVanL
Copy link
Contributor Author

Wow, thank you both for the detailed replies. That's really helpful!

@dathudeptrai Thank you for offering to make a PR to help train selected layers.

@ZDisket It's great to hear your success even with a limited dataset. Fortunately we have much more than 80 seconds of audio even in the worst cases.

Could you explain the idea of a universal vocoder to me? How is it possible to get a customised voice using a universal vocoder without fine tuning?

This is all very new to me, but very exciting.

@ZDisket
Copy link
Collaborator

ZDisket commented Oct 12, 2020

@OscarVanL Conventional text2speech works with a text2mel model, which converts text to spectrograms, and vocoder, which turns spectrograms into audio. Training a vocoder on many, many different voices can achieve a "universal vocoder" which can adapt to almost any speaker. I know the owner of vo.codes uses a (MelGAN) universal vocoder. You'll still have to finetune the text2mel though.

@OscarVanL
Copy link
Contributor Author

Thank you for the explanation.

So my understanding is that I will have to train a FastSpeech2 text2mel model to create patient-specific mel spectrograms. This will involve me taking a LJSpeech pretrained model, then fine-tuning as described by @dathudeptrai with patient voice data.

After this, are there pre-trained MelGAN Universal Vocoders available to download that have already been trained on many voices, or is this something I would need to do myself?

Finally, are Universal Vocoders tied to a specific text2mel architecture (Tacotron, FastSpeech, etc), or can a Universal Vocoder take any mel spectrogram generated by any text2mel architecture?

@ZDisket
Copy link
Collaborator

ZDisket commented Oct 12, 2020

@OscarVanL

After this, are there pre-trained MelGAN Universal Vocoders available to download that have already been trained on many voices, or is this something I would need to do myself?

There are three MelGANs: regular MelGAN (lowest quality), ditto + STFT loss (somewhat better), and Multi-Band (best quality and faster inference), you can hear the differences in the demo page. There's also ParallelWaveGAN, but it's too slow on CPU to consider.

As for pretrained models, there are none trained natively with this repo on large multispeaker datasets(I have two trained on about 200 speakers, one 32KHz and other 48KHz, but it doesn't work well outside of them), but there are notebooks to convert trained models from kan-bayashi's repo: https://github.com/kan-bayashi/ParallelWaveGAN (which has a lot) to this one's. I forgot where they were, so you'll have to ask @dathudeptrai.

Finally, are Universal Vocoders tied to a specific text2mel architecture (Tacotron, FastSpeech, etc), or can a Universal Vocoder take any mel spectrogram generated by any text2mel architecture?

A mel spectrogram is a mel spectrogram no matter where it comes from, so yes, as long as the text2mel and vocoder's data is processed the same (same normalization method, mel frequency range, etc).

@OscarVanL
Copy link
Contributor Author

Thank you once again for helping with my noob questions! I'll definitely check out that resource with trained models.

@Zak-SA
Copy link

Zak-SA commented Oct 13, 2020

that's interesting subject,
Any example how to fine tune mb-melgan with a pretraind model? in readme it only says Just load pretrained model and training from scratch with other languages. can you explain more?
thanks

@dathudeptrai
Copy link
Collaborator

@OscarVanL i just make a PR for custom trainable layers here (#299).

@Zak-SA You can try to train universal vocoder or load the weight from pretrained model list then training as normal (follow README.).

@OscarVanL
Copy link
Contributor Author

OscarVanL commented Oct 13, 2020

Amazing, thank you to both of you for going above and beyond to help!

A few more questions as I didn't see any documentation on preparing the dataset, I'm looking to prepare some data for fine-tuning.

Do I need to strip grammar from the text? Eg: ()`';"-

Are there any other similar cases I should consider when preparing the transcriptions?

Does the audio filetype matter? I have 44100Hz Signed 16-bit PCM WAVs. (Edit: These files produced no errors during preprocessing/normalisation, but they should be mono, not stereo)

@OscarVanL
Copy link
Contributor Author

OscarVanL commented Oct 14, 2020

Some early observations going through the steps in examples/mfa_extraction/README.md and examples/fastspeech2_libritts/README.md with my own dataset...

  • Your dataset should be in mono, or else during one of these steps the script will fail.

  • Your dataset should not use dashes in the name. My dataset was named as audio-1.wav, audio-2.wav. In fix_mismatch.py this will cause the script to fail.

  • The sampling rate will automatically be down-sampled from 44100MHz to the required 24000Hz.

  • 16-bit PCMs are fine.

  • Audio clips should not exceed 15 seconds in duration, or you will run out of memory when training the model.

@OscarVanL
Copy link
Contributor Author

OscarVanL commented Oct 16, 2020

Hi,

I've begun fine-tuning with the guidance given by @dathudeptrai :)

I've taken the LJSpeech pretrained model "fastspeech2.v1" to fine-tune.

I took the fastspeech2.v1.yaml config (designed for LJSpeech dataset), and made only one change, I set var_train_expr: embeddings from the PR dathudeptrai made. I was unsure what other hyperparameters to change.

Here you can see the TensorBoard results for training the embedding layers...
image

Using the fastspeech2_inference notebook, followed by the multiband_melgan_inference notebook using the libritts_24k.h5 Universal vocoder I got these results...

At 5000 steps: audio, spectogram

At 15000 steps: audio, spectogram

At 80000 steps: audio, spectogram

Obviously, this sounds bad because I have only trained embedding layers.

I would now like to add some FC layers at the end, as you suggested, but am not sure how I do this.

Based on my tensorboard results, how many steps do you think I should tune the embedding layers before I stop and begin to train the FC layers?

Do you advise making any changes to the hyperparameters in fastspeech2.v1.yaml?

@dathudeptrai
Copy link
Collaborator

@OscarVanL can you try to train all network ?, var_train_expr: null and report the tensorboard here then I can give you the right way to go :D.

@OscarVanL
Copy link
Contributor Author

@dathudeptrai Here's my tensorboard with 120k steps with train_vr_expr: null.
image

@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Oct 17, 2020

@dathudeptrai Here's my tensorboard with 120k steps with train_vr_expr: null.
image

ok i can understand what is the problem with ur dataset :D. I want you try to train the model with var_train_expr: "speaker|embeddings|f0_predictor|energy_predictor|duration_predictor|f0_embeddings|energy_embeddings|mel_before|postnet|decoder/layer_._0", that mean you should fine-tune:

  1. all speaker specific layer (because ur dataset has a different speaker with a pre-trained model).
  2. phoneme embeddings (note that ur pre-trained model is used character rather than phoneme so you should train the phoneme embeddings from scratch).
  3. F0/energy/duration should be retrained also because this is a speaker characteristic.
  4. mel_before/postnet should and first layer of decoder should retrained also.

I do not know if it work or not because the pretrained u are using is charactor-based, you should find phoneme pretrained then you do not need fine-tune phoneme embeddings. @ZDisket do you have any FS2 phoneme pretrained ?

@OscarVanL
Copy link
Contributor Author

Ok I will try that now 👍 Thanks!

@dathudeptrai
Copy link
Collaborator

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

@OscarVanL
Copy link
Contributor Author

OscarVanL commented Oct 17, 2020

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

I wanted to ask a question about mfa duration...

My recordings are 44100Hz. For txt_grid_parser --sample_rate, do I use 44100 or 24000? Because later on the preprocessing stage downsamples to 24000, but txt_grid_parser is ran before downsampling.

@dathudeptrai
Copy link
Collaborator

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

I wanted to ask a question about mfa duration...

My recordings are 44100Hz. For txt_grid_parser --sample_rate, do I use 44100 or 24000? Because later on the preprocessing stage downsamples to 24000, but txt_grid_parser is ran before downsampling.

I think it is 44100 but we may need ask @machineko

@machineko
Copy link
Contributor

Either methods should works u just need to later change sample rate for calculation in preprocessing ( downsampling first should works better but results shouldnt be noticable as small diff in durations shouldnt affect fs2 according to paper )

@dathudeptrai
Copy link
Collaborator

Either methods should works u just need to later change sample rate for calculation in preprocessing ( downsampling first should works better but results shouldnt be noticable as small diff in durations shouldnt affect fs2 according to paper )

1 vote for downsampling first :))) @OscarVanL

@OscarVanL
Copy link
Contributor Author

I agree. I think downsampling first will avoid any confusion or mistakes.

@ZDisket
Copy link
Collaborator

ZDisket commented Oct 17, 2020

@dathudeptrai I have two phoneme LJSpeeches, 22KHz and (upsampled) 24KHz with LibriTTS preprocessing settings like in kan-bayashi repo. But the phoneme IDs might differ

@OscarVanL
Copy link
Contributor Author

OK, I have downsampled to 24000Hz, redone all of the mfa extraction, preprocessing, normalisation, and changed hop_size to 240. I am training the layers you suggested. I will update you with a new tensorboard tomorrow :) Thank you for all your comments.

@OscarVanL
Copy link
Contributor Author

@dathudeptrai Here's my TensorBoard for that last attempt.
image

@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Oct 19, 2020

@dathudeptrai Here's my TensorBoard for that last attempt.
image

the model overfits too much, in this case, i think you should pretrained ur model by libriTTS dataset then you do not need retrained embeddings layers. Seems in ur validation data, there are many words/phoneme that the model has not seen in the training data (you can check this statement), that is why the valid loss increase why training loss decrease.

@OscarVanL
Copy link
Contributor Author

Yes that would probably help. I will have a look at which British speaker corpuses are available.

I see "M-AILABS Queen's English corpus", however the Queen's English likely does not represent how normal British people actually speak 😆

@GavinStein1
Copy link

Hi there, quick question about your speech inference.

Where do you pass in the speaker_id? I am going through the colab and fastspeech2 notebooks and I can't see reference to it anywhere...

@OscarVanL
Copy link
Contributor Author

OscarVanL commented Nov 9, 2020

@GavinStein1
Inference looks like this:

mel_before, mel_outputs, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32)
)

See the part speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32), that number is changed to the speaker you desire.

After processing your dataset (assuming you use LibriTTS), you will see libritts_mapper.json, inside this there is a mapping from the speaker's folder name to the speaker ID you need to pass into the inference.

@GavinStein1
Copy link

@GavinStein1
Inference looks like this:

mel_before, mel_outputs, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32)
)

See the part speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32), that number is changed to the speaker you desire.

After processing your dataset (assuming you use LibriTTS), you will see libritts_mapper.json, inside this there is a mapping from the speaker's folder name to the speaker ID you need to pass into the inference.

I see what you are referring to, however when processing libriTTS, this libritts_mapper.json was not generated/editted. It is still the same as the file I cloned from this repo (i.e. only 20 speakers and the same 20 speakers). I followed the libriTTS setup steps and preproccessing steps. Am I missing something?

@OscarVanL
Copy link
Contributor Author

OscarVanL commented Nov 10, 2020

Maybe you're looking in the wrong folder, the one in TensorFlowTTS/tensorflow_tts/processor/pretrained/ is for the pretrained models on this repo.

Instead, when you run the preprocessing stages, a new libritts_mapper.json file should appear in your dump folder (the one you set as tensorflow-tts-preprocess --outdir). This is specific to the dataset you're training with.

@GavinStein1
Copy link

Maybe you're looking in the wrong folder, the one in TensorFlowTTS/tensorflow_tts/processor/pretrained/ is for the pretrained models on this repo.

Instead, when you run the preprocessing stages, a new libritts_mapper.json file should appear in your dump folder (the one you set as tensorflow-tts-preprocess --outdir). This is specific to the dataset you're training with.

Now I feel embarrassed for not seeing that earlier... Thank you

@OscarVanL
Copy link
Contributor Author

Happens to the best of us :)

@OscarVanL
Copy link
Contributor Author

OscarVanL commented Nov 10, 2020

@OscarVanL It seems to me by looking the layer naming, the speaker embedding and speaker fc are the layers you need to retrain or fine tune. Your current fine-tuning seems involve many layers and parameters, it should be easy to overfit with that small amount of data.

@ronggong I tried tuning only these layers, but now the model does not converge. After 3 hours there is no improvement and the model still sounds American. (red is tuning job, grey is the model I am tuning from 110k steps)
image

I think I will retrain the LJSpeech model with a mixture of LibriTTS and some British speaker dataset (as machineko said), hopefully, this will help it generalise to my British speaker better.

@OscarVanL
Copy link
Contributor Author

Add few British speaking people to dataset :P

@machineko's suggestion to use more British speakers was a great suggestion, because it led me to find LibriVox accents table and British Readers on LibriVox. As LibriTTS is based on LibriVox, I found nearly all these speakers within the train-clean-100, train-clean-360, and train-other-500 LibriTTS subsets and created a new British speakers dataset with 74 speakers and 17 hours of data 😁

Training on just these speakers did not give a great FS2 model, I think 17 hours may be too little, so I'll add in a 50:50 split of British and American LibriTTS speakers to match the good results I had with 34 hours of speech. Hopefully, this will make my speaker sound much better.

@OscarVanL
Copy link
Contributor Author

I am really happy with the Britsh models I am getting from the dataset, I feel like I am getting some models I am really satisfied with now! 🤟

Fine-tuning the vocoder definitely helped reduce the buzzing, but didn't eliminate it, but at this point, I am happy with the results. It took 3 days 9 hours to reach 1M steps though 😴

Now to fine-tune it all over again with my British speakers 😅

@vocajon
Copy link

vocajon commented Nov 13, 2020

I am really happy with the Britsh models I am getting from the dataset, I feel like I am getting some models I am really satisfied with now! 🤟

@OscarVanL Well done! Great work.

Are you in a position to share or publish your base British dataset, or trained model please?

@OscarVanL
Copy link
Contributor Author

@vocajon I probably should not share my model because of academic integrity (this is for a University project), but I was just writing a Blog post about compiling the British speaker corpus I used. It is based entirely on speakers taken from the LibriTTS dataset, so in theory, it should be open source but I must check this :) I will let you know once it's published 😄

@OscarVanL
Copy link
Contributor Author

OscarVanL commented Nov 13, 2020

@vocajon

Here's the blog post.

I created a repo with my LibriTTS British dataset. I only used the libritts-english subset, because I did not need Welsh/Scottish/Irish accents.
https://github.com/OscarVanL/LibriTTS-British-Accents

Edit: It looks like GitHub LFS is unsuitable for this purpose as it imposes bandwidth limits. I will have to look for alternatives.

@vocajon
Copy link

vocajon commented Nov 13, 2020

Great, thank you very much. Just curious, the purpose for your work is for patients before losing their voices right? What led to you excluding the Welsh/Scottish/Irish accents? You think it is unlikely any will appear as a patient in England? Or you think it is better to maintain 4 models (if enough data can be found for the others) and tune using the closest model?

@OscarVanL
Copy link
Contributor Author

Great, thank you very much. Just curious, the purpose for your work is for patients before losing their voices right? What led to you excluding the Welsh/Scottish/Irish accents? You think it is unlikely any will appear as a patient in England? Or you think it is better to maintain 4 models (if enough data can be found for the others) and tune using the closest model?

At the moment it's only a proof-of-concept phase.
The LibriTTS dataset only has 2 Scottish, 2 Welsh, and 3 Irish speakers, with this little data adding these speakers will probably not allow any model to effectively clone a voice in any of these accents. To support these accents, a larger dataset would be required, so there's no value adding them to the dataset at present.

In regards to 1 vs 4 models, I suppose it would be a matter of experimenting to see what works. I have been training with a 50:50 split of English and American accents and it still allows me to clone a British voice well, so who knows, maybe a single mixed-accent model would work if there was enough of each accent in the training data.

@GavinStein1
Copy link

Hi @OscarVanL,

Can I ask what model you used for pretraining? and did you end up training all layers or just some specific ones? and what processor did you end up using for inference after your PR?

@OscarVanL
Copy link
Contributor Author

@GavinStein1
I trained starting with the pretrained fastspeech2.v1 model here.

I trained all layers.

I used this processor:

    processor = AutoProcessor.from_pretrained(
        pretrained_path="./tensorflow_tts/processor/pretrained/libritts_mapper.json"
    )

You could also use the libritts_mapper.json generated in the dataset dump, but for the processor it's only necessary to tell it which phoneme IDs to use. In this case, that is the same in both files.

@GavinStein1
Copy link

GavinStein1 commented Nov 17, 2020

So if the processor is used for only the phoneme ids, how do you know what speaker id is your one when it is mixed in with libritts speakers?

also how many speakers/hours of speaking did you find worked best for you?

Edit: When I use that fastspeech.v1 model as a pretrained model, I cannot load weights on the following layers due to mismatch in number of weights:

  • embeddings
  • decoder
  • f0_predictor
  • energy_predictor
  • duration_predictor

Did you get this issue? and did you just ignore it as you wanted to retrain those layers anyway?

@OscarVanL
Copy link
Contributor Author

OscarVanL commented Nov 17, 2020

So if the processor is used for only the phoneme ids, how do you know what speaker id is your one when it is mixed in with libritts speakers?

also how many speakers/hours of speaking did you find worked best for you?

Edit: When I use that fastspeech.v1 model as a pretrained model, I cannot load weights on the following layers due to mismatch in number of weights:

  • embeddings
  • decoder
  • f0_predictor
  • energy_predictor
  • duration_predictor

Did you get this issue? and did you just ignore it as you wanted to retrain those layers anyway?

Sorry if my reply was confusing, I meant the Processor only uses the phoneme IDs from the mapper json (as the processor is used for mapping text to phoneme IDs).

You will of course need to also check your mapping for the correct speaker ID at inference :)

I used 120 speakers, with 17 hours of British speakers and 17 hours of American speakers.

My results were best when I picked the top 100 speakers by duration in the train-clean-360 subset. (I ended up using 120 speakers because the British speakers dataset had on average less speech)

I got these errors loading the layers too, I also asked about this problem but got no reply. I just ignored the error and the model trained fine, but maybe you should ask one of the maintainers about this.

@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Nov 17, 2020

So if the processor is used for only the phoneme ids, how do you know what speaker id is your one when it is mixed in with libritts speakers?

also how many speakers/hours of speaking did you find worked best for you?

Edit: When I use that fastspeech.v1 model as a pretrained model, I cannot load weights on the following layers due to mismatch in number of weights:

  • embeddings
  • decoder
  • f0_predictor
  • energy_predictor
  • duration_predictor

Did you get this issue? and did you just ignore it as you wanted to retrain those layers anyway?

you can load the weights like this:

model.load_weights(path, by_name=True, skip_mismatch=True). f0/energy/duration must be retrain.

@OscarVanL
Copy link
Contributor Author

I'm closing this as all the help I received helped me train a good model! 😄

To give a tl;dr of this thread, I found the best way to tune with a small speaker dataset is to merge it with a larger multi-speaker dataset (I used LibriTTS) and pass in the speaker ID at inference for the speaker I wished to clone.

Some other tricks that helped:

  • Match the hop_size to the model you're fine-tuning, fft_size to the vocoder you're using.

  • Make sure your own data is in the right format, to match the dataset you are merging it with. In the case of LibriTTS, I used 24kHz, mono, 16-bit PCM, not exceeding 15 seconds in duration.

  • Dataset quality is king. This was the thing that gave the greatest improvements

  • I had success with ~35h of speech from ~100 speakers, but that's not to say this is optimal.

  • Remove speakers where they have too little speech (<15mins). One idea could be to only select speakers with the most amount of speech from train-clean-100, train-clean-360 and/or train-other-500 to avoid speakers with too little speech.

  • Try to match your dataset to the type of voice you're cloning. For example, I was cloning a British speaker, but merged this with a dataset of Americans. This gave poor results. When I included my British speaker corpus this made a big improvement.

  • Don't bother changing which layers to train with var_train_expr. Other areas like focussing on the dataset were more fruitful.

@Megh-Thakkar
Copy link

Hi, thanks for this extremely useful thread. I am very new to TTS and want to train on a small dataset as discussed above. I had a few clarifications (they might be very basic/naive).

  1. Start with the FastSpeech2 model trained on LJSpeech (v1 (here)[https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2#pretrained-models-and-audio-samples]).
  2. The dataset consists of around 40-50 speakers from LibreTTS dataset and the speaker whose voice is to be cloned.
  3. Run mfa extraction and processing from the examples/mfa_extraction folder.
  4. Run other preprocessing steps for LibreTTS.

I wanted to know how do you combine FastSpeech2 with MB MelGAN. What is the default vocoder being used in the examples file, and how to change it.

Can you share a pipeline script if possible @OscarVanL?

Thanks a lot.

@CharlieBickerton
Copy link

@OscarVanL thanks for this great thread!

You mentioned the aim of your project was to deploy onto low-end hardware, did you end up doing this? If so, what method did you use and how many mb was the model in the end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question ❓ Further information is requested
Projects
None yet
Development

No branches or pull requests

10 participants