Handling of large datasets and the cache parameter - strategies #5125

Nisse123 · 2021-10-11T08:05:52Z

Nisse123
Oct 11, 2021

I am working with a large dataset, around 500.000 images. I am just exploring and have a training-validation split about 8:2. I have access to a DGX2 with fast storage.
Training with default parameters 640p and 300 epochs is rather slow. I have tried the cache parameter which seems for this particular dataset, the cache that fits the ram is about 200.000 images (450 GB) and this improves the training speed dramatically.
What is the strategy here because I cannot cache the whole dataset. Do I first train on 200.000 images, take that model and then do "transfer learning" from that with the rest of the images? Of course If I do that I have to balance the classes in the split of the dataset. I am also working with a smaller subset just for experimenting with parameters and script development.

Is the training speed linear so If I calculate the time for like 5 epochs, I can predict the training time for 300 epochs?

For reference the Coco128 takes around 4 minutes with the default training parameters.

If I create a new dataset that is prescaled to 640p, does this improve the training speed?

What about the workers parameter? I read somewhere that the rule of thumb is to choose about 80 % of the available cores, in my case this is 60 workers. How does this relate to the number of GPUs, like one worker (data loader) for each gpu?

Thanks,

Answered by glenn-jocher

Oct 11, 2021

@Nisse123 you can try caching to disk to not use RAM:

python train.py --cache disk

You can experiment with additional workers but 60 seems to be far too high to be useful. If you are using a DGX2 you should also be training Multi-GPU naturally for fastest results. See Multi-GPU Training tutorial:

YOLOv5 Tutorials

Train Custom Data 🚀 RECOMMENDED
Tips for Best Training Results ☘️ RECOMMENDED
Weights & Biases Logging 🌟 NEW
Supervisely Ecosystem 🌟 NEW
Multi-GPU Training
PyTorch Hub ⭐ NEW
TorchScript, ONNX, CoreML Export 🚀
Test-Time Augmentation (TTA)
Model Ensembling
Model Pruning/Sparsity
Hyperparameter Evolution
Transfer Learning with Frozen Layers ⭐ NEW
TensorRT Deployment

View full answer

glenn-jocher · 2021-10-11T19:27:07Z

glenn-jocher
Oct 11, 2021
Maintainer

@Nisse123 you can try caching to disk to not use RAM:

python train.py --cache disk

You can experiment with additional workers but 60 seems to be far too high to be useful. If you are using a DGX2 you should also be training Multi-GPU naturally for fastest results. See Multi-GPU Training tutorial:

YOLOv5 Tutorials

Train Custom Data 🚀 RECOMMENDED
Tips for Best Training Results ☘️ RECOMMENDED
Weights & Biases Logging 🌟 NEW
Supervisely Ecosystem 🌟 NEW
Multi-GPU Training
PyTorch Hub ⭐ NEW
TorchScript, ONNX, CoreML Export 🚀
Test-Time Augmentation (TTA)
Model Ensembling
Model Pruning/Sparsity
Hyperparameter Evolution
Transfer Learning with Frozen Layers ⭐ NEW
TensorRT Deployment

2 replies

Nisse123 Oct 12, 2021
Author

Thank you for the answer.
Since I already have the dataset on a fast disk, what is the improvement by caching to the same storage?
About multi-gpu, I noticed that If I just start train.py without -m torch.distributed.launch, all GPUs are in use anyway.

glenn-jocher Oct 12, 2021
Maintainer

@Nisse123 lack of compression is the obvious advantage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of large datasets and the cache parameter - strategies #5125

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Handling of large datasets and the cache parameter - strategies #5125

Nisse123 Oct 11, 2021

YOLOv5 Tutorials

Replies: 1 comment · 2 replies

glenn-jocher Oct 11, 2021 Maintainer

YOLOv5 Tutorials

Nisse123 Oct 12, 2021 Author

glenn-jocher Oct 12, 2021 Maintainer

Nisse123
Oct 11, 2021

Replies: 1 comment 2 replies

glenn-jocher
Oct 11, 2021
Maintainer

Nisse123 Oct 12, 2021
Author

glenn-jocher Oct 12, 2021
Maintainer