Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: PPO using SDE device issue. #1957

Closed
5 tasks done
llewynS opened this issue Jul 2, 2024 · 4 comments
Closed
5 tasks done

[Bug]: PPO using SDE device issue. #1957

llewynS opened this issue Jul 2, 2024 · 4 comments
Labels
bug Something isn't working custom gym env Issue related to Custom Gym Env

Comments

@llewynS
Copy link

llewynS commented Jul 2, 2024

🐛 Bug

Get a device mismatch when attempting to use PPO with multiinput dict.

This was when calling:


with torch_no_grad():
                actions = myppo.policy._predict(inp_dict, deterministic = isTraining)
File "C:\Users\User\AppData\Roaming\Python\Python311\site-packages\stable_baselines3\common\distributions.py", line 597, in get_noise
    return th.mm(latent_sde, self.exploration_mat)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Looking at stable baselines the issue comes from this code:

def sample_weights(self, log_std: th.Tensor, batch_size: int = 1) -> None:
     """
     Sample weights for the noise exploration matrix,
     using a centered Gaussian distribution.

     :param log_std:
     :param batch_size:
     """
     std = self.get_std(log_std)
     self.weights_dist = Normal(th.zeros_like(std), std)
     # Reparametrization trick to pass gradients
     self.exploration_mat = self.weights_dist.rsample()
     # Pre-compute matrices in case of parallel exploration
     self.exploration_matrices = self.weights_dist.rsample((batch_size,))

Doing some more digging it's actually this class, this class doesn't define the device, when the network is created standard python types are input so it just gets put on the CPU. This class needs to be modified to make it take device and use the device of the model it is being used with.

To Reproduce

from stable_baselines3 import PPO
import gymnasium as gym
from stable_baselines3.common.vec_env import DummyVecEnv
from torch import tensor
# from rlos.rl.core.concreteclasses.wrappedlibs.SB3 import PPOSB3Agent
OPEN_DIVERSE_G = [[1, 1, 1, 1, 1, 1, 1],
                [1, "c", "c", 0, "c", "c", 1],
                [1, 1, 1, 1, 1, 1, 1]]
model_kwargs = {"verbose": 1, 
                "learning_rate": 3e-5,
                "policy_kwargs": dict(net_arch=[256,256]),
                "gamma": 0.95,
                "device": "cuda",
                "vf_coef": 0.5,
                "ent_coef": 0.0,
                "max_grad_norm": 0.5,
                "normalize_advantage": True,
                "n_steps": 512,
                "n_epochs": 60,
                "sde_sample_freq": 4,
                "use_sde": True,
                "gae_lambda": 0.9,
                "clip_range": 0.4}
env = gym.make('PointMaze_UMaze-v3', maze_map = OPEN_DIVERSE_G)
env = DummyVecEnv([lambda: env])
model = PPO("MultiInputPolicy", env, **model_kwargs)
observation = env.reset()
temp = {
    "observation": tensor(observation["observation"], device="cuda"),
    "achieved_goal": tensor(observation["achieved_goal"], device="cuda"),
    "desired_goal": tensor(observation["desired_goal"], device="cuda")
}
model.policy._predict(temp)

If you change the model_kwargs so that use_sde is False then it works as expected.

Relevant log output / Error message

No response

System Info

No response

Checklist

  • My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal and working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.
@llewynS llewynS added the bug Something isn't working label Jul 2, 2024
@llewynS
Copy link
Author

llewynS commented Jul 2, 2024

After a lot of digging around, I've noted that the policy is made on the cpu and then shifted to the device selected in the on policy algorithm class at this line

This works for action dists that aren't the use_sde ones but the use_sde one is a distribution. A work around to get it to work is to do this:

from stable_baselines3 import PPO
import gymnasium as gym
from stable_baselines3.common.vec_env import DummyVecEnv
from torch import tensor
# from rlos.rl.core.concreteclasses.wrappedlibs.SB3 import PPOSB3Agent
OPEN_DIVERSE_G = [[1, 1, 1, 1, 1, 1, 1],
                [1, "c", "c", 0, "c", "c", 1],
                [1, 1, 1, 1, 1, 1, 1]]
model_kwargs = {"verbose": 1, 
                "learning_rate": 3e-5,
                "policy_kwargs": dict(net_arch=[256,256]),
                "gamma": 0.95,
                "device": "cuda",
                "vf_coef": 0.5,
                "ent_coef": 0.0,
                "max_grad_norm": 0.5,
                "normalize_advantage": True,
                "n_steps": 512,
                "n_epochs": 60,
                "sde_sample_freq": 4,
                "use_sde": True,
                "gae_lambda": 0.9,
                "clip_range": 0.4}
env = gym.make('PointMaze_UMaze-v3', maze_map = OPEN_DIVERSE_G)
env = DummyVecEnv([lambda: env])
model = PPO("MultiInputPolicy", env, **model_kwargs)
observation = env.reset()
temp = {
    "observation": tensor(observation["observation"], device="cuda"),
    "achieved_goal": tensor(observation["achieved_goal"], device="cuda"),
    "desired_goal": tensor(observation["desired_goal"], device="cuda")
}
model.policy.action_dist.exploration_mat = model.policy.action_dist.exploration_mat.to("cuda")
model.policy._predict(temp)

@araffin araffin added the custom gym env Issue related to Custom Gym Env label Jul 2, 2024
@araffin
Copy link
Member

araffin commented Jul 5, 2024

hello,
why would try to access the private method model.policy._predict(temp)?
Does it crashes if you call .learn() before?

This looks similar to #44 but should have been fixed in #45

Maybe a reset_noise() before predict should solves that.
Also, I would recommend using deterministic=True at test time, gSDE is meant to improve the smoothness of the action noise during training.

@llewynS
Copy link
Author

llewynS commented Jul 6, 2024

Hi,

As you suggested to me in this feature request , it is so I can directly use tensors without having to detach/put them on the CPU.

Yes where I actually use it in my code I call model.learn(0) first and still get the error.

model.policy.action_dist.exploration_mat = model.policy.action_dist.exploration_mat.to("cuda") resolves the issue but it would be better to have the library not require one do this imo.

@araffin
Copy link
Member

araffin commented Jul 29, 2024

Hello,
as I wrote, it seems that calling model.policy.reset_noise() before the first predict solves the issue (btw, the provided code was not working):

import gymnasium as gym
import numpy as np
import torch as th
from stable_baselines3 import PPO

env = gym.make("Pendulum-v1")
model = PPO("MlpPolicy", env, use_sde=True, seed=1, verbose=1)
obs, _ = env.reset()

device = model.device

# Single observation
tensor = th.as_tensor(obs[np.newaxis, ...]).to(device)
# Multiple observations
multi_obs = th.cat([tensor] * 5, dim=0).to(device)

# Sample noise for gSDE on the correct device
model.policy.reset_noise()
with th.no_grad():
    model.policy._predict(tensor)
    model.policy._predict(multi_obs)

@araffin araffin closed this as completed Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working custom gym env Issue related to Custom Gym Env
Projects
None yet
Development

No branches or pull requests

2 participants