-
-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sample packing with resume_from_checkpoint #406
Comments
I think we can convert the MultipackDataloader to an iterabledataset wrapper around another dataset. Not sure how best to handle the sampler though |
hit this issue as well. seeking patch......SOS |
my backtrace
|
this issue is pending weeks |
I met the same problem. Do you solve it ? |
Any updates on this? Having this problem for a single GPU run. |
This one bit me earlier. As a stopgap maybe warn the user they won't be able to resume when using sample packing? |
Facing similar issue when trying to resume_with_checkpoint. Any updates on this? |
this also impacted me, so I can't use sample packing because I need to be able to resume from checkpoint |
One quick and dirty way to solve this may be to save the current epoch and index of the internal_batch_generator in the MultipackDataloader for every X steps. You can quickly resume from this with some custom logic by looping over the batches until you get to the right epoch and index. |
@ehartford @winglian Unless I am missing something, here is a more thorough idea of how to implement this. This is quite a large task.
Upon a crash/OOM/whatever failure, you can then resume the training by using the |
I have offered a $1,000 bounty to the person who fixes this bug. |
Please check that this issue hasn't been reported before.
Expected Behavior
resume should step over the number of batches already trained and resume from the same point in the data
Current behaviour
fails to load due to the MultipackDataloader not having a batch sampler, Setting the batch sampler to be the same as the sampler results in an issue in acclerate.data_loader reloading the dataset into a new dataloader in an attempt to skip over the batches
Steps to reproduce
see above
Possible solution
need to figure out a way to refactor the current dataloader either as a better sampler or extend the real torch dataloader
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: