-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is dataset.indices broadcase necessary? #1820
Comments
@ardeal I'm not sure. This would need some empirical results to verify (testing). Can you try to run the code both ways and check if dataset indices are updated on all workers? |
Hi, I have reveiewed torch.distributed toturial, and I am right now ask the quesiton on torch forum: according to torch's explanation, everythign is done by torch itself. The only thing the user needs to do is to feed the model in to DDP. Pleasse check the code: https://github.com/rwightman/pytorch-image-models/blob/master/train.py what confused me is that: I am strudying torch.distributed, once I make it clear, I will tell you. |
@ardeal ok got it. Yes please let us know what you find. One point is that the code you show is only used with --image-weights. This is because the image weights are reassigned after every validation at the end of each training epoch, and then dataset.indices is updated, which determines which images are loaded the next epoch. Normally dataset indices is just range(n) for n images and the array never changes. |
which determines which images are loaded the next epoch. You could review the upper link of pytorch-image-models repo which did much data augmentation as your repo and didn't broadcast dataset. The reason why I asked questions about broadcast dataset is that I find your repo and pytorch-image-models broadcast or not broadcast differently. |
Please check the answer on pytorch forum. according to the answer, you don't need to broadcast dataset.
|
@ardeal thanks for the explanation! So is your recommendation that we update this section: Lines 252 to 265 in 1be3170
To this? # Update image weights (optional)
if opt.image_weights:
# Generate indices
if RANK in [-1, 0]:
cw = model.class_weights.cpu().numpy() * (1 - maps) ** 2 / nc # class weights
iw = labels_to_image_weights(dataset.labels, nc=nc, class_weights=cw) # image weights
dataset.indices = random.choices(range(dataset.n), weights=iw, k=dataset.n) # rand weighted idx or this? # Update image weights (optional)
if opt.image_weights:
cw = model.class_weights.cpu().numpy() * (1 - maps) ** 2 / nc # class weights
iw = labels_to_image_weights(dataset.labels, nc=nc, class_weights=cw) # image weights
dataset.indices = random.choices(range(dataset.n), weights=iw, k=dataset.n) # rand weighted idx |
@ardeal I've eliminated the DDP portion of the One question I had for you along the DDP topic is my new YOLOv5 EarlyStopping PR ultralytics/yolov5#4576. It works well for Single-GPU, but in DDP mode the The idea is that the stopper returns stop=True after patience is exceeded, and then we break the training loop, but only RANK -1, 0 processes are breaking I think. |
Hi @glenn-jocher , For the issue about sync up stop variable, I think 2 experiments need to be done:
import torch.distributed as dist
torch.cuda.empty_cache()
dist.destroy_process_group() |
@ardeal thanks! Yes the problem is that other RANKS do not receive the The dist.destroy_process_group() command needs to take place outside of the train function, I've placed it in the caller function, otherwise the process hangs and training does not finish. |
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv3 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv3 🚀 and Vision AI ⭐! |
Hi,
in train.py of yolov3 code:
as there is:
sampler = torch.utils.data.distributed.DistributedSampler(dataset) if rank != -1 else None
in the dataloader, I think the dataset.indices manual broadcast is not necessary in this case as DistributedSampler will broadcast dataset automatically?The text was updated successfully, but these errors were encountered: