-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bringing together results on ddp on a single machine #702
Comments
I don't know if pytorch-lightning has done anything special about it. With pure PyTorch, you may use For example, if you have 2 workers and each of them evaluated 2 examples, then you can use dist.all_gather to get the 4 scores and then compute the mean validation score. |
@matthew-z thanks for your reply. That seems promising. I have also found this issue requesting similar behaviour: #243
If so, where would we put output_list when using pytorch lightning? Thanks for your help! I think it is something many new users will be confused about when it comes to ddp use |
I didn't try it with PL, but in general you may try to sync data as follow with DDP: def gather_list_and_concat(list_of_nums):
tensor = torch.Tensor(list_of_nums).cuda()
gather_t = [torch.ones_like(tensor) for _ in range(dist.get_world_size())]
dist.all_gather(gather_t, tensor)
return torch.cat(gather_t)
>>> res = gather_list_and_concat([1, 2]) # in node 1
>>> res = gather_list_and_concat([3, 4]) # in node 2
>>> res
torch.Tensor([1,2,3,4])
|
Hi - So no need to keep it in sync - they are all sharing the same tensor. You can create these in the model itself. I put them in a class that tracks a few other things, and keep a reference to the instance in the model. Works the same. There are still some challenges. You need to index the data by the batch number - but each sub process's batch numbers start with 0. And they do not all have exactly the same number of batches, as the sampler has to adapt total batches when things don't divide evenly. What I did was to add a dimension to capture the process rank - always 0 based. I also sized the storage on the batch count / number of process, rounding down - and not worrying about the few missed elements of data. Of course, DistributedSampler will hand one or more sub processes 1 or 2 more batches than you have allowed room for - obviously don't try to index a position in the tensor beyond the max index-1. The result statistics for me have been essentially identical to running on a single gpu - except for the time which has gone from 15 minutes/epoch to 85 secs/epoch. Hope this helps, seth |
I just read the source code, and actually PL didn't do much modification to the original DDP except redirecting forward to train/validation/test_step. In DDP, I don't think different sub-processes share the same tensor. Try to put a breakpoint in train_step and run PyCharm debugger, and you will find that the loss values are different in sub-processes, so do the validation outputs. The logger only records the rank-0 process. |
Sorry - my explanation was obviously insufficent. This definitely works, and - I just have to explain it better. I don't believe the logger() behavior is indicative of what happens with the tensor. Try this - in the init() of your model class, create test tensor with 1 value per gpu you are using. Also create a flag so that this is only called once per process:
In on_epoch_start(), on the first epoch, write into self.test.data, and print it:
With 3 gpus you will see something like this:
As you can see, the the self.testData tensor is collecting ALL the values and ALL the values are seen by EACH process. And that tensor is on the cpu. If there was one in each sub process we would see:
I can only conclude that there is only one copy of it and all the sub processes can see it, or it is coalesced by ddp for us. The key issue here is you can use this to produce a process rank indexed tensor of result values that you store yourself - at the end of training_step() - to use to calculate statistics. This works. Before I did this, I had the same problem you had - loss values only recorded for the 0 gpu, and therefore of questionable value. With this, the statistics make sense, and are essentially identical to what I get if I use on a single gpu. I hope this is useful to you, seth FWIW - there are some behaviors here that bear thinking hard about, As already noted, the printed testData contents only make sense if there is only one copy - or ddp merges this tensor for us. Interestingly though, there must be 3 copies of firstEpochFlag - or the printing would only have been done once as it is set to False the first time it is seen. The choices made in DDP about which elements of the model class are copied to sub processes must be pretty well thought through. Again, hope this helps. I can show you my entire model class if you like - there is a lot in it ... seth |
I have verified that there is one testData tensor per proc - so ddp must be coalescing them |
Lightning only routes forward to training_step, validation_step. BUT, we should be calling dist.all_gather with the outputs out of training_step and validation_step if we want full batch metrics. so, maybe someone wants to submit a PR for this? |
My solution works for me - and has the benefit of allowing me compute any statistics I like on the results. Of course, my model code has to know a little about what only doing certain things on proc 0 ... |
@sneiman I'm facing a similar issue. Could you show the entire model class? Aside: When I try to pass a dictionary of tensors into progress_bar or even log, I've been getting this consistent error. (Training using 'ddp' on 4 gpus). Have a follow up questions, why can't we pass dictionaries of tensors like 'dp'? Here is the error: Let me know your thoughts! |
If I am not mistaken, your error as above is because in I will post more of my code for you to take a look at to see how I collect information, in the next post. The overall idea is pretty simple ... declare a multi-dimensional tensor to hold the information you want to collect, with one dimension per gpu. then write into it during running each gpu's data in its own portion of the tensor. pl will reduce the to coalescel. you still have to mindful of multiprocessing. example to come. |
Ahh thanks for doing that. Right, that would make sense. Since each GPU has its own set of processes, it probably passes a tensor with len(n_gpus) to the callbacks step correct? In the mean time I will try to implement something & update you with how that goes. |
Let's see if this will help: Within my model I have an instance of a class (cleverly named 'data') which holds the data loaders, and does the work of breaking the data in train, validation and test sets. Because statistics and history change along with data format, I keep all my history and statistics calculations in the same class. It gets tweaked whenever I change the data, or use the model in a way that affects statistics or data collected. In this class I create my history tensors, and related statistic variables:
Ok - thats it ... lets look at
I think you can see the idea here. Create history TENSORS as part of your model - being in an instance that is part of your model is just as good. Write functions to put data in them, clear them and to calculate statistics. Put the data in them after every step. Calculate and use the results as you need to - generally once an epoch. BE CAREFUL to detach those tensors before you make the cap call - or else you will have a memory leak. I also found that I needed to delete them to avoid memory leaks. Because you are using distributed processing you need to force the processes to all finish their calculations before you clear the data. I do it at the end of
I think you can get it from here .... |
Thanks for this @sneiman , the fix is making a lot more sense now. I will implement a version of this for my auto segmentation pipeline & can share when completed. Question regarding that torch.distributed.barrier() call. Is this called when you activate validation_end? If it is then theoretically you could log training metrics & save when you call that? That could work, but only if you would need a validation step. It would be great if lightning could do something like this automatically when 'ddp' backend is called! Think engineering an internal solution like you presented would save a lot of headaches! Thoughts @williamFalcon ? |
I make the call at the barriere and clear data calls at the end of on_epoch_start(). thats the last moment really - and so any thing that needs can still get the data.
I am not sure if engineering this into pl is consistent with pl philosophy. generally pl just does things for you and makes as few demands upon you as possible - without becoming too heavy.
ddp and ddp2 cause some challenges with this as the processes get distributed and so some of the standard call sequences don't work. ill ponder it though.
Sent from mobile device - sorry for typos and brevity.
This entire message is confidential. If it isn't intended for you, you may not use it - so please throw it away and forget about it.
On Feb 5, 2020, at 8:54 AM, Joseph Marsilla <notifications@github.com> wrote:
Thanks for this @sneiman<https://github.com/sneiman> , the fix is making a lot more sense now. I will implement a version of this for my auto segmentation pipeline & can share when completed.
Question regarding that torch.distributed.barrier() call. Is this called when you activate validation_end? If it is then theoretically you could log training metrics & save when you call that? That could work, but only if you would need a validation step.
It would be great if lightning could do something like this automatically when 'ddp' backend is called! Think engineering an internal solution like you presented would save a lot of headaches! Thoughts @williamFalcon<https://github.com/williamFalcon> ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#702>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA6YDOW6UYHGCFNWRKCLAULRBLVKJANCNFSM4KIHVBDA>.
|
Interesting, thanks for the explanation! Working to develop a solution for 2D/3D segmentation. Have you come into hurdles with your ddp script actually training? Mine is seeming to freeze 1% into the first epoch. |
i have not had a straight freeze. are you on gpu? if so check to see if it is active and has memory using nvidia-smi. if that looks fine check your use of the barrier() call. it will wait until EVERY process created by ddp has reached it before it will let any of them continue.
Sent from mobile device - sorry for typos and brevity.
This entire message is confidential. If it isn't intended for you, you may not use it - so please throw it away and forget about it.
On Feb 5, 2020, at 9:54 PM, Joseph Marsilla <notifications@github.com> wrote:
Interesting, thanks for the explanation! Working to develop a solution for 2D/3D segmentation. Have you come into hurdles with your ddp script actually training? Mine is seeming to freeze 1% into the first epoch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#702>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA6YDOVKNLATFYECWDIXMSTRBOQ2DANCNFSM4KIHVBDA>.
|
@brucemuller mind submitting a PR for this? Ideally we bring results back for training_step and validation_step? This has been on my radar for a bit but haven't had time to fix. WOuld love a PR! |
@williamFalcon Using @matthew-z code I think worked for me:
I'm not sure I have time for PR atm but would like to. If I create a fork would someone like to help make the changes (@sneiman @jmarsil) ? I haven't followed everything in this thread but would dist.all_gather be sufficient to keep pl light enough? Side note: I used ddp on 4 GPUs recently and it was actually slower than using 1 GPU. Does anyone know how I can start understanding why or if there's any common reasons? |
What do you mean by "slow"? The speed (e.g, If you mean something like "it takes more time to train an epoch" Well, you need to figure out what they are waiting for. E.g., dataloading, param syncing or a very slow node. |
@brucemuller that could happen if you’re not using distributed sampler. we updated the docs with more details. check them out? https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html |
Hi all - My experience suggests that a declaring them when I create the model - in fact, I suspect this approach works because somewhere in pl or pytorch, ALL tensors that are attributes of the model class are identified and reduced I can imagine doing some things with decorators to make this more a little less tedious - but not a lot more. Is there something i am missing about how this could be better? |
Wanted to share an issue/solution that has surfaced in trying to bring objects from ddp spawned sub processes back to the master process. The issue is this: if you create a python multiprocessing Queue or SimpleQueue in the normal way, and attempt to use it in process created by The solution is to create the desired queue using the spawn context object for multiprocessing, but spawn the processes using the torch.multiprocessing module. Parroting the typical multiprocessing example:
I have not tested, but I suspect the same is true for pipes and any other multiprocessing connection. Also note, I have tested this on Ubuntu 18.04, python 3.6.8, pytorch 1.4, pytorch-lightning 0.6 using a multi-core, multi-gpu machine, but not on a multi-node setup. This happens with You cannot use Still working on how to use this to get the stateful model and trainer back from a ddp job. There are a lot of complications because things have to be picklable to get through the queue. But at least there is a tunnel ... For the curious, SIGSEGV is a kernel notice of a segment violation. Our spawned process's attempt to use q causes the process to use memory that it has no right to. The torch.multiprocessing.spawn call uses its own join and a wrapper to catch the exception and avoid segfault. @williamFalcon @Borda @brucemuller @matthew-z @jmarsil - I have a collected a few tricks in using ddp - the history tensors above, this one, getting input from a sub-process, and perhaps a few others. Do you think it worthwhile to do a little write up about them? |
@brucemuller @mikerossgithub @sneiman try master now? just merged #2434 which does a reduce mean op across all gpus for anything you return from val_epoch_end |
Possibly related? I got this error on validation end after upgrading to masters from 0.8.1
|
what are you returning from your validation_epoch_end? |
one of the tensors returned for logging is a cpu tensor... now they have to be cuda tensors... intended? #2442 |
logging does not like any other then CPU i guess.. |
I have a related question: emb_list = []
for batch_idx, sample in enumerate(validation_dataloader):
emb = model(sample)
dist.barrier()
out_emb = [torch.zeros_like(emb) for _ in range(world_size)]
dist.all_gather(out_emb, emb)
if rank == 0:
interleaved_out = torch.empty((emb.shape[0]*world_size, emb.shape[1]),
device=emb.device, dtype=emb.dtype)
for current_rank in range(world_size):
interleaved_out[current_rank::world_size] = out_emb[current_rank]
emb_list.append(interleaved_out.detach().clone())
if rank == 0:
emb_list = torch.cat(emb_list)
# Create KDTree and compute recall The Is there any way to do the same in Pytorch Lightning? |
I was able to achieve the same in pytorch lightning calling I think it would be nice to provide one hook that gather all the |
good point. this is a new feature, mind opening a new GH issue? |
Sure, i just created a new issues #4175 |
Would this also support other objects than tensors from the |
@cattaneod tell me please does disy.all_gather() gather all results? |
Yes, the code I provided in my previous comment works fine with DistributedDataParallel (i tested only with single machine, multi-GPU tho). |
this works only when tensor on every device comes with same size. an additional all_gather would require to get the length first to get proper result. |
I wonder what if different process dataloader have different num of batch to iter? Will dis.barrier() stuck never stop? |
y
My code shows it will stuck: dist.barrier, how to slove it? Can anyone help m? |
I'd like to understand how to use ddp properly with multiple GPUs on a single machine as I'm unsure of how to bring results together using this method.
I'm using TensorBoard for logging
The problem seems to be that my code (below) runs on each of the three GPUs (with a third of the data each), but the variables like "overall_correct" only exist for each of the three processes so only a third of the data gets logged. For example, my overall performance on a single GPU is 82% but with the above process on 3 GPUs it is a third of that. I know this is a kindof silly thing but can someone explain how I should bring together the required validation/training statistics from the sub-processes using pytorch lightning?
My process is roughly:
The text was updated successfully, but these errors were encountered: