You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a lot of people have reported here, the overall loss curve while training SoundStream seems to flatten itself quite early (after 1-2k steps) and I seem to be experiencing the same with MagnaTagATune dataset.
After training for 20k steps (batch size = 64, gradient accumulation = 8, 2 second 24kHz audio) the results seem to be very similar (my own listening tests + several objective torchmetrics related to audio) to what they were at step 2000.
I am very curious to hear whether anyone was able to train something that sounded alright, even after 1M steps? I've seen some discussions here on loss balancer missing (among other proposed fixes) and I just hope that the implementation checks out.
Don't get me wrong, this is very, very good work, I am just curious if there are things in the original Soundstream paper that weren't solved in the implementation here.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
First of all, great repo and great discussions!
As a lot of people have reported here, the overall loss curve while training SoundStream seems to flatten itself quite early (after 1-2k steps) and I seem to be experiencing the same with MagnaTagATune dataset.
After training for 20k steps (batch size = 64, gradient accumulation = 8, 2 second 24kHz audio) the results seem to be very similar (my own listening tests + several objective torchmetrics related to audio) to what they were at step 2000.
I am very curious to hear whether anyone was able to train something that sounded alright, even after 1M steps? I've seen some discussions here on loss balancer missing (among other proposed fixes) and I just hope that the implementation checks out.
Don't get me wrong, this is very, very good work, I am just curious if there are things in the original Soundstream paper that weren't solved in the implementation here.
Beta Was this translation helpful? Give feedback.
All reactions