Model training process halted for small dataset #2029

Shaw-184 · 2024-07-04T14:47:29Z

🐛 Describe the bug

I am using yolonas to train a model on a dataset with 10 images. validation has 1 image. The model trains for 7 epochs and then the process just halts. No errors are printed and no logs are printed. I cant even kill the process.
But the same script works for larger datasets.

Versions

super-gradients==3.7.1
torch==2.3.1
torchmetrics==0.8.0
torchvision==0.18.1

Shaw-184 · 2024-07-05T04:54:58Z

@shaydeci @BloodAxe could you check this out. Some urgent work. Thank you

BloodAxe · 2024-07-06T10:01:08Z

The more information is provided the better. Would be great if you share what batch size you are using and whether you are training in single gpu or DDP. Is it launch from CLI or Jupyter

Shaw-184 · 2024-07-08T05:44:32Z

Batch size is 2. SIngle GPU training. Running it from CLI

BloodAxe · 2024-07-09T16:58:17Z

Pytorch DataLoader does not like when dataloader length is smaller than num_workers. Any chance you have len(dataloader) < num_workers?
That is my only idea at the moment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model training process halted for small dataset #2029

Model training process halted for small dataset #2029

Shaw-184 commented Jul 4, 2024

Shaw-184 commented Jul 5, 2024

BloodAxe commented Jul 6, 2024

Shaw-184 commented Jul 8, 2024

BloodAxe commented Jul 9, 2024

Model training process halted for small dataset #2029

Model training process halted for small dataset #2029

Comments

Shaw-184 commented Jul 4, 2024

🐛 Describe the bug

Versions

Shaw-184 commented Jul 5, 2024

BloodAxe commented Jul 6, 2024

Shaw-184 commented Jul 8, 2024

BloodAxe commented Jul 9, 2024