[Bug]: PPO using SDE device issue. #1957

llewynS · 2024-07-02T00:30:59Z

🐛 Bug

Get a device mismatch when attempting to use PPO with multiinput dict.

This was when calling:


with torch_no_grad():
                actions = myppo.policy._predict(inp_dict, deterministic = isTraining)

File "C:\Users\User\AppData\Roaming\Python\Python311\site-packages\stable_baselines3\common\distributions.py", line 597, in get_noise
    return th.mm(latent_sde, self.exploration_mat)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Looking at stable baselines the issue comes from this code:

def sample_weights(self, log_std: th.Tensor, batch_size: int = 1) -> None:
     """
     Sample weights for the noise exploration matrix,
     using a centered Gaussian distribution.

     :param log_std:
     :param batch_size:
     """
     std = self.get_std(log_std)
     self.weights_dist = Normal(th.zeros_like(std), std)
     # Reparametrization trick to pass gradients
     self.exploration_mat = self.weights_dist.rsample()
     # Pre-compute matrices in case of parallel exploration
     self.exploration_matrices = self.weights_dist.rsample((batch_size,))

Doing some more digging it's actually this class, this class doesn't define the device, when the network is created standard python types are input so it just gets put on the CPU. This class needs to be modified to make it take device and use the device of the model it is being used with.

To Reproduce

from stable_baselines3 import PPO
import gymnasium as gym
from stable_baselines3.common.vec_env import DummyVecEnv
from torch import tensor
# from rlos.rl.core.concreteclasses.wrappedlibs.SB3 import PPOSB3Agent
OPEN_DIVERSE_G = [[1, 1, 1, 1, 1, 1, 1],
                [1, "c", "c", 0, "c", "c", 1],
                [1, 1, 1, 1, 1, 1, 1]]
model_kwargs = {"verbose": 1, 
                "learning_rate": 3e-5,
                "policy_kwargs": dict(net_arch=[256,256]),
                "gamma": 0.95,
                "device": "cuda",
                "vf_coef": 0.5,
                "ent_coef": 0.0,
                "max_grad_norm": 0.5,
                "normalize_advantage": True,
                "n_steps": 512,
                "n_epochs": 60,
                "sde_sample_freq": 4,
                "use_sde": True,
                "gae_lambda": 0.9,
                "clip_range": 0.4}
env = gym.make('PointMaze_UMaze-v3', maze_map = OPEN_DIVERSE_G)
env = DummyVecEnv([lambda: env])
model = PPO("MultiInputPolicy", env, **model_kwargs)
observation = env.reset()
temp = {
    "observation": tensor(observation["observation"], device="cuda"),
    "achieved_goal": tensor(observation["achieved_goal"], device="cuda"),
    "desired_goal": tensor(observation["desired_goal"], device="cuda")
}
model.policy._predict(temp)

If you change the model_kwargs so that use_sde is False then it works as expected.

Relevant log output / Error message

No response

System Info

No response

Checklist

My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
I have checked that there is no similar issue in the repo
I have read the documentation
I have provided a minimal and working example to reproduce the bug
I've used the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

llewynS · 2024-07-02T01:51:35Z

After a lot of digging around, I've noted that the policy is made on the cpu and then shifted to the device selected in the on policy algorithm class at this line

This works for action dists that aren't the use_sde ones but the use_sde one is a distribution. A work around to get it to work is to do this:

from stable_baselines3 import PPO
import gymnasium as gym
from stable_baselines3.common.vec_env import DummyVecEnv
from torch import tensor
# from rlos.rl.core.concreteclasses.wrappedlibs.SB3 import PPOSB3Agent
OPEN_DIVERSE_G = [[1, 1, 1, 1, 1, 1, 1],
                [1, "c", "c", 0, "c", "c", 1],
                [1, 1, 1, 1, 1, 1, 1]]
model_kwargs = {"verbose": 1, 
                "learning_rate": 3e-5,
                "policy_kwargs": dict(net_arch=[256,256]),
                "gamma": 0.95,
                "device": "cuda",
                "vf_coef": 0.5,
                "ent_coef": 0.0,
                "max_grad_norm": 0.5,
                "normalize_advantage": True,
                "n_steps": 512,
                "n_epochs": 60,
                "sde_sample_freq": 4,
                "use_sde": True,
                "gae_lambda": 0.9,
                "clip_range": 0.4}
env = gym.make('PointMaze_UMaze-v3', maze_map = OPEN_DIVERSE_G)
env = DummyVecEnv([lambda: env])
model = PPO("MultiInputPolicy", env, **model_kwargs)
observation = env.reset()
temp = {
    "observation": tensor(observation["observation"], device="cuda"),
    "achieved_goal": tensor(observation["achieved_goal"], device="cuda"),
    "desired_goal": tensor(observation["desired_goal"], device="cuda")
}
model.policy.action_dist.exploration_mat = model.policy.action_dist.exploration_mat.to("cuda")
model.policy._predict(temp)

araffin · 2024-07-05T12:56:36Z

hello,
why would try to access the private method model.policy._predict(temp)?
Does it crashes if you call .learn() before?

This looks similar to #44 but should have been fixed in #45

Maybe a reset_noise() before predict should solves that.
Also, I would recommend using deterministic=True at test time, gSDE is meant to improve the smoothness of the action noise during training.

llewynS · 2024-07-06T02:51:55Z

Hi,

As you suggested to me in this feature request , it is so I can directly use tensors without having to detach/put them on the CPU.

Yes where I actually use it in my code I call model.learn(0) first and still get the error.

model.policy.action_dist.exploration_mat = model.policy.action_dist.exploration_mat.to("cuda") resolves the issue but it would be better to have the library not require one do this imo.

araffin · 2024-07-29T08:35:59Z

Hello,
as I wrote, it seems that calling model.policy.reset_noise() before the first predict solves the issue (btw, the provided code was not working):

import gymnasium as gym
import numpy as np
import torch as th
from stable_baselines3 import PPO

env = gym.make("Pendulum-v1")
model = PPO("MlpPolicy", env, use_sde=True, seed=1, verbose=1)
obs, _ = env.reset()

device = model.device

# Single observation
tensor = th.as_tensor(obs[np.newaxis, ...]).to(device)
# Multiple observations
multi_obs = th.cat([tensor] * 5, dim=0).to(device)

# Sample noise for gSDE on the correct device
model.policy.reset_noise()
with th.no_grad():
    model.policy._predict(tensor)
    model.policy._predict(multi_obs)

llewynS added the bug Something isn't working label Jul 2, 2024

araffin added the custom gym env Issue related to Custom Gym Env label Jul 2, 2024

araffin closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: PPO using SDE device issue. #1957

[Bug]: PPO using SDE device issue. #1957

llewynS commented Jul 2, 2024

llewynS commented Jul 2, 2024

araffin commented Jul 5, 2024

llewynS commented Jul 6, 2024

araffin commented Jul 29, 2024

[Bug]: PPO using SDE device issue. #1957

[Bug]: PPO using SDE device issue. #1957

Comments

llewynS commented Jul 2, 2024

🐛 Bug

To Reproduce

Relevant log output / Error message

System Info

Checklist

llewynS commented Jul 2, 2024

araffin commented Jul 5, 2024

llewynS commented Jul 6, 2024

araffin commented Jul 29, 2024