-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
axolotl hanging during training on custom dataset (ran for 30 minutes before timing out) #592
Comments
After some discussion in TheBloke's Discord, I was told training only gets stuck without NVLink on GPUs which makes perfect sense so far as I have no GPUs available with NVLink. I would love it if axolotl could resolve this hanging issue as it is otherwise great with all the integrations! |
Hey Casper,I have a few ideas on how to resolve this for you. I'm unavailable for a few hours, but if you can hop on our discord server, it's probably raiser to help you out there |
The issue is resolved when #463 is merged. Training now works and continues past epoch 1.0. Here is the
|
Please check that this issue hasn't been reported before.
Expected Behavior
For the training to progress at a normal speed without getting stuck.
Current behaviour
Steps to reproduce
I don't know how you can reproduce this precisely since the dataset is private. Previously, #494 was reported to be the same issue, and #531 was introduced to solve it.
However, I can now see that it has not been solved. I ran into the same error after restarting from scratch, so this issue persists and was not random. This is in a multi-GPU setting (4x RTX 4090 in this case).
I tried setting
sample_packing
to false and also tried settingpad_to_sequence_len
to false. This did not resolve the issue. It seems that it happens right as epoch 1.0 is about to hit whenmicro_batch_size
is 1, and if I increase themicro_batch_size
, it just seems to happen earlier than epoch 1.0.These are my settings:
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: