Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About multi-speaker datasets and tacotron2... #644

Closed
samuel-lunii opened this issue Aug 11, 2021 · 9 comments
Closed

About multi-speaker datasets and tacotron2... #644

samuel-lunii opened this issue Aug 11, 2021 · 9 comments
Assignees
Labels
question ❓ Further information is requested wontfix

Comments

@samuel-lunii
Copy link
Contributor

Hi !
I would like to ask a few questions about multi-speaker datasets. This thread gives good insights about what is needed for transfer learning with short duration for each speaker in a multi-speaker dataset using FastSpeech.

My questions are about Tacotron2.

  • What is the minimal total duration needed to train a good multi-speaker model from scratch ?
  • What is the minimal duration needed for each speaker in the dataset ?
  • Would it be good practice to use large durations (e.g. 1 or 2 speakers with duration > 20 hours) in combination with shorter ones (e.g. several speakers with duration ~ 1 hour) to train a single multi-speaker model ? Or is it better to perform transfer learning with short durations from a model trained on large durations ?

Thanks !

@dathudeptrai dathudeptrai self-assigned this Aug 12, 2021
@dathudeptrai dathudeptrai added the question ❓ Further information is requested label Aug 12, 2021
@dathudeptrai
Copy link
Collaborator

@ZDisket do you have any ideas?

@ZDisket
Copy link
Collaborator

ZDisket commented Aug 12, 2021

@dathudeptrai @samuel-lunii I haven't had much success training multispeaker Tacotron2 as the attention mechanism becomes unable to learn alignment when it is multispeaker, even for datasets that are big, good and train successfully on FastSpeech2, so I would suggest using that for multispeaker instead. Maybe we can fix the issues of fixed durations for outputs by using the stochastic duration predictor from VITS, which has variations even with the same input, instead of the regular one in FastSpeech2. I'll try to answer every question

What is the minimal total duration needed to train a good multi-speaker model from scratch ?

I would say about 15 to 20 hours total

What is the minimal duration needed for each speaker in the dataset ?

In my limited experiments, those that perform well have at least 30 minutes

Would it be good practice to use large durations (e.g. 1 or 2 speakers with duration > 20 hours) in combination with shorter ones (e.g. several speakers with duration ~ 1 hour) to train a single multi-speaker model ? Or is it better to perform transfer learning with short durations from a model trained on large durations ?

You should include the speakers with little data into the big model, as when you try to fine-tune, unless it's the same amount of speakers, the embedding layer will be dropped due to incompatibility, and that will mess the new model up.

@samuel-lunii
Copy link
Contributor Author

samuel-lunii commented Aug 12, 2021

@ZDisket
Thanks for your answers.

I haven't had much success training multispeaker Tacotron2 as the attention mechanism becomes unable to learn alignment when it is multispeaker, even for datasets that are big, good and train successfully on FastSpeech2, so I would suggest using that for multispeaker instead.

So I guess your answers are about FastSpeech2 ? I actually want to use Tacotron2 for expressive speech synthesis.

As mentioned in #628, I think it could be worth adapting the Tacotron2 model in a way similar to the prosody paper (also used in the GST paper), where the speaker embedding is broadcasted to match the input text sequence length and then concatenated to the encoder output. I did have correct results by doing so with keithito's repo. I will let you know if I meet any success by doing it with this one :)

By the way, @dathudeptrai @ZDisket
I see that the speaker embeddings are used before and after the encoder. Is there a particular reason to do so ? Have you tried to use either the one or the other independently ?

@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Aug 12, 2021

@samuel-lunii

I see that the speaker embeddings are used before and after the encoder. Is there a particular reason to do so ? Have you tried to use either the one or the other independently ?

Add before ->>> speaker-dependently encoder
Add after ->>> speaker-dependently decoder

You can imagine it as Resnet. Yoy should use speaker embeddings for both encoder and decoder phrase :D

@ZDisket
Copy link
Collaborator

ZDisket commented Aug 13, 2021

@samuel-lunii

I actually want to use Tacotron2 for expressive speech synthesis.

Define "expressive". FastSpeech2 is good at expressiveness, it's just the deterministic duration predictions that are the problem. I've had success adding an emotion embedding (a copy of the speaker embedding) for training with a multispeaker multi-emotion dataset that is categorized by those.

@samuel-lunii
Copy link
Contributor Author

samuel-lunii commented Aug 13, 2021

@ZDisket
Ok good to know, thanks :) I will definitely have a go with multi-speaker FastSpeech2.
However, I would like to use "GST like" architecture, in order to avoid emotion labelling...

@ZDisket
Copy link
Collaborator

ZDisket commented Aug 16, 2021

@samuel-lunii

to avoid emotion labelling...

Someone I know used an automatic sentiment detection from text labeling model and used its output as emotion embedding.

@samuel-lunii
Copy link
Contributor Author

@ZDisket
Yes, I was planning on doing something similar too :) I will follow your advice and try all this with FS2.

@stale
Copy link

stale bot commented Oct 17, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix label Oct 17, 2021
@stale stale bot closed this as completed Oct 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question ❓ Further information is requested wontfix
Projects
None yet
Development

No branches or pull requests

3 participants