-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to continue training from checkpoint. #138
Comments
Looks like we are indeed only saving the weights. Not sure if that means we can not continue training or if there is a workaround. @weiji14 and @srmsoumya ? Line 50 in b8aa8cd
|
Yeah, we did not save the AdamW optimizer state, so it won't be possible to resume training from that checkpoint using the AdamW optimizer, or any adaptive optimization algorithms. It might be possible to resume training using non-adaptive optimizers such as Stochastic Gradient Descent, but it would require a lot of manual handling of the checkpoint loading, so not a straightforward workaround. That said, the original objective seems to be on finetuning the checkpoint on a specific region, rather than resuming the self-supervised training. The entrypoint shouldn't be trainer.py, but a separate finetuning script (which could technically still use elements of the MAE training loop). |
Main use case is to resume training if halted (e.g. we were using Spot instances), but I can see use cases where a regional user might want to continue training with regional data. If we chose not to save the optimizers, we should document how to resume training with new initialized optimizers. |
I agree we should have a way to resume training for the checkpoints we save (or at least the last one), if that is technically possible and won't slow down training too much. |
We have addressed this for v0.2 and will also for v1, by storing the optimizer state during training. So I am closing this, but feel free to reopen if this is an issue that persists for future versions of the model. |
Not saving the optimizer remains the default. Line 51 in 50094ba
|
Addressed in PR #193 |
I am trying to run some more training loops for a specific region, using this notebook.
I was not happy with the clustering:
So I wanted to run a few epochs only on my target areag.
When I do so, with
I get this error:
The text was updated successfully, but these errors were encountered: