-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About multi-speaker datasets and tacotron2... #644
Comments
@ZDisket do you have any ideas? |
@dathudeptrai @samuel-lunii I haven't had much success training multispeaker Tacotron2 as the attention mechanism becomes unable to learn alignment when it is multispeaker, even for datasets that are big, good and train successfully on FastSpeech2, so I would suggest using that for multispeaker instead. Maybe we can fix the issues of fixed durations for outputs by using the stochastic duration predictor from VITS, which has variations even with the same input, instead of the regular one in FastSpeech2. I'll try to answer every question
I would say about 15 to 20 hours total
In my limited experiments, those that perform well have at least 30 minutes
You should include the speakers with little data into the big model, as when you try to fine-tune, unless it's the same amount of speakers, the embedding layer will be dropped due to incompatibility, and that will mess the new model up. |
@ZDisket
So I guess your answers are about FastSpeech2 ? I actually want to use Tacotron2 for expressive speech synthesis. As mentioned in #628, I think it could be worth adapting the Tacotron2 model in a way similar to the prosody paper (also used in the GST paper), where the speaker embedding is broadcasted to match the input text sequence length and then concatenated to the encoder output. I did have correct results by doing so with keithito's repo. I will let you know if I meet any success by doing it with this one :) By the way, @dathudeptrai @ZDisket |
Add before ->>> speaker-dependently encoder You can imagine it as Resnet. Yoy should use speaker embeddings for both encoder and decoder phrase :D |
Define "expressive". FastSpeech2 is good at expressiveness, it's just the deterministic duration predictions that are the problem. I've had success adding an emotion embedding (a copy of the speaker embedding) for training with a multispeaker multi-emotion dataset that is categorized by those. |
@ZDisket |
Someone I know used an automatic sentiment detection from text labeling model and used its output as emotion embedding. |
@ZDisket |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Hi !
I would like to ask a few questions about multi-speaker datasets. This thread gives good insights about what is needed for transfer learning with short duration for each speaker in a multi-speaker dataset using FastSpeech.
My questions are about Tacotron2.
Thanks !
The text was updated successfully, but these errors were encountered: