Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model training process halted for small dataset #2029

Open
Shaw-184 opened this issue Jul 4, 2024 · 4 comments
Open

Model training process halted for small dataset #2029

Shaw-184 opened this issue Jul 4, 2024 · 4 comments

Comments

@Shaw-184
Copy link

Shaw-184 commented Jul 4, 2024

馃悰 Describe the bug

I am using yolonas to train a model on a dataset with 10 images. validation has 1 image. The model trains for 7 epochs and then the process just halts. No errors are printed and no logs are printed. I cant even kill the process.
But the same script works for larger datasets.

Versions

super-gradients==3.7.1
torch==2.3.1
torchmetrics==0.8.0
torchvision==0.18.1

@Shaw-184
Copy link
Author

Shaw-184 commented Jul 5, 2024

@shaydeci @BloodAxe could you check this out. Some urgent work. Thank you

@BloodAxe
Copy link
Collaborator

BloodAxe commented Jul 6, 2024

The more information is provided the better. Would be great if you share what batch size you are using and whether you are training in single gpu or DDP. Is it launch from CLI or Jupyter

@Shaw-184
Copy link
Author

Shaw-184 commented Jul 8, 2024

Batch size is 2. SIngle GPU training. Running it from CLI

@BloodAxe
Copy link
Collaborator

BloodAxe commented Jul 9, 2024

Pytorch DataLoader does not like when dataloader length is smaller than num_workers. Any chance you have len(dataloader) < num_workers?
That is my only idea at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants