Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How about your training time per epoch ? #14

Open
DHms2020 opened this issue Apr 21, 2022 · 1 comment
Open

How about your training time per epoch ? #14

DHms2020 opened this issue Apr 21, 2022 · 1 comment

Comments

@DHms2020
Copy link

I reproduce the data preprocess ,and then train the model with electra-large-discriminator plm / msde local_and_nonlocal strategy.
I found it takes me around 50 min per epoch on a Tesla V100 32G with the same hyper-parameters on paper
Besides, I do some modify to use the DDP with 4 GPU, but the time only reduce to 40 min per epoch
Is the same for your training time ?
I want to do some experiment with the LGESQL basemodel, but the time counsuming is .....[SAD]

@rhythmcao
Copy link
Collaborator

Sorry for the late reply. I just checked my experiment logs. It takes roughly 1200 seconds for training with large-series PLM per epoch, with running script in run/run_train_lgesql_plm.sh. Sadly, it is a little slow in your experiment. But, 200 epochs is not a necessity. Actually, 100 epochs is enough for comparable performances. We trained the large-series PLM for more epochs just for more stable performances according to this work.

If you just want to verify your ideas, why not experiment with Glove embeddings or base-series PLM? It can be much faster. We did not spend too much time on hyper-parameter tuning with large-series PLM and most ablation studies are conducted with Glove embeddings. As for how to set the grad_accumulate hyper-parameter (the actual mini-batch size for each forward process is batch_size/grad_accumulate), you can run and check the usage of CUDA memory based on your GPU device.

Attention: there is a mistake about learning rate in the original paper. Learning rate should be 1e-4 for large-series PLM and 2e-4 for base-series PLM (not e-5).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants