Fix RPC Param server example for multiple trainers #877

rohan-varma · 2021-01-28T07:00:58Z

When running with multiple trainers, we were running into the following issue with the parameter server example:

Process Process-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "rpc_parameter_server.py", line 223, in run_worker
    run_training_loop(rank, num_gpus, train_loader, test_loader)
  File "rpc_parameter_server.py", line 182, in run_training_loop
    dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [32, 1, 3, 3]] is at version 28; expected version 27 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

At a high level, this resulted from one trainer running a backwards pass while another was updating params with the optimizer. Still coordinating with folks internally to understand if this is expected behavior or not.

In the meantime, we have changed the example as follows:

Each trainer now has its own model on the parameter server, eliminating the issue of trainers stepping on each other
These models are periodically synced at a given interval by averaging them on the PS.

I think for the most part the changes still ensure the example fulfills its main purpose which is to demonstrate RPC and dist autograd.

Tested by running 4 trainers with no crash, repeatedly:
python3 rpc_parameter_server.py --world_size=5 --rank=2

If this looks good, we will update the corresponding tutorial in pytorch/tutorials accordingly.

mrshenli · 2021-01-29T00:01:54Z

At a high level, this resulted from one trainer running a backwards pass while another was updating params with the optimizer. Still coordinating with folks internally to understand if this is expected behavior or not.

This makes sense. If the the param is modified between forward and backward passes, the autograd algorithm is no longer correct. Thanks for digging into this!

Regarding the fix, will it also work if we force a barrier before every optimizer.step(), which guarantees no unintentional param changes? In this way, we don't need multiple model copies. If the model is on CUDA, since all updates use the same default stream, they won't be race contention issues either. Not sure about CPU models though.

lucasleesw · 2021-03-10T10:41:57Z

Hi, is there any examples for multiple gpu training? e.g. each gpu runs one trainer.

lucasleesw · 2021-03-10T11:21:41Z

Could help me understand where the forward calculation happened?
In my understand, whenever each Trainer run model_output = self.param_server_rref.rpc_sync().forward(self.rank, x) , the forward calculation is done by the "parameter server", because the self.param_server_rref.owner() is the "parameter server". And the RPC docs says rref.rpc_sync() run on worker rref.owner(). Plz correct me if I am misunderstand, thank you!

msaroufim · 2022-03-09T17:46:52Z

Hi @rohan-varma @mrshenli is this an example you'd still like to see merged in?

Fix for RPC parameter server

8362806

facebook-github-bot added the cla signed label Jan 28, 2021

rohan-varma added 2 commits January 27, 2021 23:18

Fix

0f4c176

Minor changes

8315ac1

rootlu mentioned this pull request Apr 4, 2021

Error while running distributed/rpc/parameter_server #856

Open

msaroufim added need review distributed and removed need review labels Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RPC Param server example for multiple trainers #877

Fix RPC Param server example for multiple trainers #877

rohan-varma commented Jan 28, 2021 •

edited

Loading

mrshenli commented Jan 29, 2021

lucasleesw commented Mar 10, 2021

lucasleesw commented Mar 10, 2021

msaroufim commented Mar 9, 2022

Fix RPC Param server example for multiple trainers #877

Are you sure you want to change the base?

Fix RPC Param server example for multiple trainers #877

Conversation

rohan-varma commented Jan 28, 2021 • edited Loading

mrshenli commented Jan 29, 2021

lucasleesw commented Mar 10, 2021

lucasleesw commented Mar 10, 2021

msaroufim commented Mar 9, 2022

rohan-varma commented Jan 28, 2021 •

edited

Loading