Handling of large datasets and the cache parameter - strategies #5125
-
I am working with a large dataset, around 500.000 images. I am just exploring and have a training-validation split about 8:2. I have access to a DGX2 with fast storage. Is the training speed linear so If I calculate the time for like 5 epochs, I can predict the training time for 300 epochs? For reference the Coco128 takes around 4 minutes with the default training parameters. If I create a new dataset that is prescaled to 640p, does this improve the training speed? What about the workers parameter? I read somewhere that the rule of thumb is to choose about 80 % of the available cores, in my case this is 60 workers. How does this relate to the number of GPUs, like one worker (data loader) for each gpu? Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
@Nisse123 you can try caching to disk to not use RAM:
You can experiment with additional workers but 60 seems to be far too high to be useful. If you are using a DGX2 you should also be training Multi-GPU naturally for fastest results. See Multi-GPU Training tutorial: YOLOv5 Tutorials
|
Beta Was this translation helpful? Give feedback.
@Nisse123 you can try caching to disk to not use RAM:
You can experiment with additional workers but 60 seems to be far too high to be useful. If you are using a DGX2 you should also be training Multi-GPU naturally for fastest results. See Multi-GPU Training tutorial:
YOLOv5 Tutorials