About multi-speaker datasets and tacotron2... #644

samuel-lunii · 2021-08-11T16:03:05Z

Hi !
I would like to ask a few questions about multi-speaker datasets. This thread gives good insights about what is needed for transfer learning with short duration for each speaker in a multi-speaker dataset using FastSpeech.

My questions are about Tacotron2.

What is the minimal total duration needed to train a good multi-speaker model from scratch ?
What is the minimal duration needed for each speaker in the dataset ?
Would it be good practice to use large durations (e.g. 1 or 2 speakers with duration > 20 hours) in combination with shorter ones (e.g. several speakers with duration ~ 1 hour) to train a single multi-speaker model ? Or is it better to perform transfer learning with short durations from a model trained on large durations ?

Thanks !

dathudeptrai · 2021-08-12T09:17:30Z

@ZDisket do you have any ideas?

ZDisket · 2021-08-12T09:34:12Z

@dathudeptrai @samuel-lunii I haven't had much success training multispeaker Tacotron2 as the attention mechanism becomes unable to learn alignment when it is multispeaker, even for datasets that are big, good and train successfully on FastSpeech2, so I would suggest using that for multispeaker instead. Maybe we can fix the issues of fixed durations for outputs by using the stochastic duration predictor from VITS, which has variations even with the same input, instead of the regular one in FastSpeech2. I'll try to answer every question

What is the minimal total duration needed to train a good multi-speaker model from scratch ?

I would say about 15 to 20 hours total

What is the minimal duration needed for each speaker in the dataset ?

In my limited experiments, those that perform well have at least 30 minutes

Would it be good practice to use large durations (e.g. 1 or 2 speakers with duration > 20 hours) in combination with shorter ones (e.g. several speakers with duration ~ 1 hour) to train a single multi-speaker model ? Or is it better to perform transfer learning with short durations from a model trained on large durations ?

You should include the speakers with little data into the big model, as when you try to fine-tune, unless it's the same amount of speakers, the embedding layer will be dropped due to incompatibility, and that will mess the new model up.

samuel-lunii · 2021-08-12T15:39:56Z

@ZDisket
Thanks for your answers.

I haven't had much success training multispeaker Tacotron2 as the attention mechanism becomes unable to learn alignment when it is multispeaker, even for datasets that are big, good and train successfully on FastSpeech2, so I would suggest using that for multispeaker instead.

So I guess your answers are about FastSpeech2 ? I actually want to use Tacotron2 for expressive speech synthesis.

As mentioned in #628, I think it could be worth adapting the Tacotron2 model in a way similar to the prosody paper (also used in the GST paper), where the speaker embedding is broadcasted to match the input text sequence length and then concatenated to the encoder output. I did have correct results by doing so with keithito's repo. I will let you know if I meet any success by doing it with this one :)

By the way, @dathudeptrai @ZDisket
I see that the speaker embeddings are used before and after the encoder. Is there a particular reason to do so ? Have you tried to use either the one or the other independently ?

dathudeptrai · 2021-08-12T16:06:05Z

@samuel-lunii

I see that the speaker embeddings are used before and after the encoder. Is there a particular reason to do so ? Have you tried to use either the one or the other independently ?

Add before ->>> speaker-dependently encoder
Add after ->>> speaker-dependently decoder

You can imagine it as Resnet. Yoy should use speaker embeddings for both encoder and decoder phrase :D

ZDisket · 2021-08-13T01:24:16Z

@samuel-lunii

I actually want to use Tacotron2 for expressive speech synthesis.

Define "expressive". FastSpeech2 is good at expressiveness, it's just the deterministic duration predictions that are the problem. I've had success adding an emotion embedding (a copy of the speaker embedding) for training with a multispeaker multi-emotion dataset that is categorized by those.

samuel-lunii · 2021-08-13T07:49:06Z

@ZDisket
Ok good to know, thanks :) I will definitely have a go with multi-speaker FastSpeech2.
However, I would like to use "GST like" architecture, in order to avoid emotion labelling...

ZDisket · 2021-08-16T05:41:24Z

@samuel-lunii

to avoid emotion labelling...

Someone I know used an automatic sentiment detection from text labeling model and used its output as emotion embedding.

samuel-lunii · 2021-08-18T14:49:28Z

@ZDisket
Yes, I was planning on doing something similar too :) I will follow your advice and try all this with FS2.

stale · 2021-10-17T15:07:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

dathudeptrai self-assigned this Aug 12, 2021

dathudeptrai added the question ❓ Further information is requested label Aug 12, 2021

dathudeptrai assigned ZDisket Aug 12, 2021

stale bot added the wontfix label Oct 17, 2021

stale bot closed this as completed Oct 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About multi-speaker datasets and tacotron2... #644

About multi-speaker datasets and tacotron2... #644

samuel-lunii commented Aug 11, 2021

dathudeptrai commented Aug 12, 2021

ZDisket commented Aug 12, 2021

samuel-lunii commented Aug 12, 2021 •

edited

Loading

dathudeptrai commented Aug 12, 2021 •

edited

Loading

ZDisket commented Aug 13, 2021

samuel-lunii commented Aug 13, 2021 •

edited

Loading

ZDisket commented Aug 16, 2021

samuel-lunii commented Aug 18, 2021

stale bot commented Oct 17, 2021

About multi-speaker datasets and tacotron2... #644

About multi-speaker datasets and tacotron2... #644

Comments

samuel-lunii commented Aug 11, 2021

dathudeptrai commented Aug 12, 2021

ZDisket commented Aug 12, 2021

samuel-lunii commented Aug 12, 2021 • edited Loading

dathudeptrai commented Aug 12, 2021 • edited Loading

ZDisket commented Aug 13, 2021

samuel-lunii commented Aug 13, 2021 • edited Loading

ZDisket commented Aug 16, 2021

samuel-lunii commented Aug 18, 2021

stale bot commented Oct 17, 2021

samuel-lunii commented Aug 12, 2021 •

edited

Loading

dathudeptrai commented Aug 12, 2021 •

edited

Loading

samuel-lunii commented Aug 13, 2021 •

edited

Loading