New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[WIP] Adding Distributed RL Model Example #692

Open

mrshenli wants to merge 1 commit into pytorch:main from mrshenli:rl

Contributor

mrshenli commented Jan 8, 2020

In this example, the RL model is distributed across one agent and
multiple observers. Each observer has a replicated submodel of
the RL model which all connect to the submodel on the agent.

During training, this example uses distributed autograd to set
gradients for all submodels. Then, it uses RPC calls to collect
gradients from all observers to the agent, sums those gradients,
applies the gradient to the local dummy model on agent, and then
broadcast the model parameters back to the observers to update
their models.


          Adding Distributed RL Model Example

fd8340c

In this example, the RL model is distributed across one agent and
multiple observers. Each observer has a replicated submodel of
the RL model which all connect to the submodel on the agent.

During training, this example uses distributed autograd to set
gradients for all submodels. Then, it uses RPC calls to collect
gradients from all observers to the agent, sums those gradients,
applies the gradient to the local dummy model on agent, and then
broadcast the model parameters back to the observers to update
their models.

mrshenli commented

View reviewed changes

Contributor Author

mrshenli left a comment

This is a quick example showing how to use RPC to build distributed RL models. I didn't aim for efficiency or code simplicity in this WIP version. But hopefully, it could demonstrate the ideas.

distributed/rpc/rl/distributed_model.py

+                      x = self.affine1(x)
+                      x = self.dropout(x)
+                      x = F.relu(x)
+                      return self.affine2(x)

Contributor Author

mrshenli Jan 8, 2020

each observer applies four layers in the forward pass

distributed/rpc/rl/distributed_model.py

+                      self.rewards = []
+                  def forward(self, action_scores):
+                      return F.softmax(action_scores, dim=1)

Contributor Author

mrshenli Jan 8, 2020

each agent only applies a softmax.

distributed/rpc/rl/distributed_model.py

+                      self.agent_rref = RRef(self)
+                      self.rewards = {}
+                      self.saved_log_probs = {}
+                      self.ob_policy = ObserverPolicy()

Contributor Author

mrshenli Jan 8, 2020

The agent also creates a dummy ObserverPolicy so that it can use the same optimizer to update all model parameters. Note that this ObserverPolicy never participates in the forward or backward pass. It is only used to apply the summed gradients.

distributed/rpc/rl/distributed_model.py

+                      grads = [fut.wait() for fut in futs]
+                      grads = [*zip(*grads)]
+                      grads = [sum(grad) for grad in grads]

Contributor Author

mrshenli Jan 8, 2020

the above few lines just sums the grads from all observers.

distributed/rpc/rl/distributed_model.py

+                      # set grads for agent model
+                      ctx_grads = dist_autograd.get_gradients(ctx_id)
+                      for p in self.policy.parameters():
+                          p.grad = ctx_grads[p]

Contributor Author

mrshenli Jan 8, 2020

Then, we set the grad field for all parameters. (note that the ob_policy is a dummy model, which is only useful for grad updates)

distributed/rpc/rl/distributed_model.py

+                      for ob_rref in self.ob_rrefs:
+                          futs.append(_async_remote_method(Observer.update_model, ob_rref, ob_params))
+                      for fut in futs:
+                          fut.wait()

Contributor Author

mrshenli Jan 8, 2020

params on both the dummy model and the AgentPolicy are updated. Now broadcast the dummy model params to all observers to perform updates there.

distributed/rpc/rl/distributed_model.py

		for fut in futs:
		fut.wait()

Contributor Author

mrshenli Jan 8, 2020

I believe the above few steps can be replaced by c10d gather/scatter, and wrapped into your own version of distributed optimizer. In that way, it will be more efficient and look better.

distributed/rpc/rl/distributed_model.py

+                      self.policy = ObserverPolicy()
+                  def get_gradients(self, ctx_id):
+                      all_grads = dist_autograd.get_gradients(ctx_id)

Contributor Author

mrshenli Jan 8, 2020

After the distributed backward pass, the grads for the ObserverPolicy model lives in dist_autograd.get_gradients(ctx_id) on the observer. Retrieve them and give them to the agent.

mrshenli mentioned this pull request

RPC and dist_autograd should respect no_grad mode pytorch/pytorch#31937

Open

mrshenli mentioned this pull request

[WIP] adding batch server/client #702

Open

facebook-github-bot added the cla signed label

msaroufim added the distributed label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed distributed