-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can you add a num_workers option for the Dataloader? #141
Comments
@Jaiczay if I had a dime for everytime someone didn't know what some section does... I don't have time to be a teacher, and line 210 is already commented with an explanation, so I've simply added comments to the next two to avert the next question by someone else down the line. You'd have to convert this implementation to use the PyTorch dataloader in order to access the Do you have profiler results to show? |
Because my GPU runs on 30-40% and only on CPU thread runs on 95% |
Ah this is strange. Are you using a single GPU or multiple? Can you try something like https://github.com/spyder-ide/spyder-line-profiler to figure out exactly which areas are causing the slowdown? |
@Jaiczay I don't have access to a local GPU, but if I run one batch of default COCO training on my Macbook Pro, I see the dataloader uses up about 240 ms. If this is true then yes you might be correct that the dataloader is a chokepoint in the training process. As a reference on a GCP instance with one P100, a batch nominally takes about 600 ms to process. If I dig deeper in the dataloader, it seems the slowest part of the process is simply loading the jpegs (which are compressed of course, hence the slow speed). Not much to do about things there unless you were to decompress all the jpegs (which might be a good idea if you plan to do lots of training). |
Good news. I was able to replace I'll try multithreading the dataloader next, though this will surely take longer to complete. |
Thx for the quick answer! I got it running with the Pytorch dataloader, but the loss values are all nan so at least I don't get an error any more, but I will try out your fix first bevor I fix that. |
@Jaiczay I've got wonderful news. I re-added support for the PyTorch DataLoader, including IMPORTANT: Note that Lines 44 to 48 in 0fb2653
https://support.apple.com/kb/SP776?locale=en_US&viewlocale=en_US
|
Wow, that's great! Thank you btw for the awesome repo, this is by far the best Pytorch implementation of YOLOv3! |
@glenn-jocher I think you need to update the test.py as well, because when I continue training I suddenly become a mAP of 0.5 and before it was around 0.94 |
Hmmm yes I think there might be a problem in train.py, maybe in the target loading order. Since they are coming in asynchronously now there may be some sort of issue in assigning targets to images. So your resumed training may be bringing your mAP to zero eventually. I'll try and sort it out later today. test.py currently works fine, for example with |
Current workaround is not to use MultiGPU. |
@glenn-jocher I haven't look into details from the update. I just wanted to point out if you are using Dataloader from the pytorch library, the worker threads might mess up the random seeds. Let's say you have 4 worker threads, you'll might end up with the same augmentation for the 4 threads. And it's also almost impossible to get deterministic behavior (if that's a concern) without modifying the Dataloader class or simply write your own multi-processing dataloader. If all this is already considered, I guess its all good :) |
@Jaiczay These operations are just to convert BGR to RGB, so, batch_size X w X h X channel. The channel dim was just inversed. Then transpose to batch_size X channel X w X h. |
Can you please add a num_workers option for the Dataloader to speed up the data loading process?
I tried it myself with this tutorial https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel, but I couldn't get it to work.
I don't really get what the part at dataset.py Line 210 - 212 does.
The text was updated successfully, but these errors were encountered: