Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SB2 vs SB3 - Performance difference #1124

Closed
4 tasks done
MatPoliquin opened this issue Oct 16, 2022 · 11 comments
Closed
4 tasks done

SB2 vs SB3 - Performance difference #1124

MatPoliquin opened this issue Oct 16, 2022 · 11 comments
Labels
more information needed Please fill the issue template completely question Further information is requested

Comments

@MatPoliquin
Copy link

MatPoliquin commented Oct 16, 2022

❓ Question

EDIT: After doing some more digging I updated the post title and added more details with a newer version of SB3 (1.6.2)

I am using OpenAI gym-retro env to train on games and migrated from SB2 to SB3 1.6.2. I noticed the training FPS reduced by a lot from 1300fps to 900fps.

  • gym-retro env: Pong-Atari2600
  • num_env==24
  • PPO
  • CnnPolicy

Using Nvidia Nsight I profiled both versions (you can find the reports in the link to google drive below, you need Nsight to view it):
https://drive.google.com/drive/folders/1Lqxf-qKXTj__Hp8WUXgNHejZaJGy8oct?usp=sharing

Here are the parameters I use for PPO with SB3 (with SB1 I just use the default parameters provided by SB):
PPO(policy=args.nn, env=env, verbose=1, n_steps = 128, n_epochs = 4, batch_size = 256, learning_rate = 2.5e-4, clip_range = 0.2, vf_coef = 0.5, ent_coef = 0.01, max_grad_norm=0.5, clip_range_vf=None)

My specs:

  • Dual Xeon 2666v3
  • RTX 2060 Super 8g
  • Ubuntu 20.04
  • stable-baselines3 1.6.2
  • gym 0.26.2

Code I use to wrap the retro env (same for both SB2 and SB3 cases):

 def make_retro(*, game, state=None, num_players, max_episode_steps=4500, **kwargs):
      import retro
      if state is None:
         state = retro.State.DEFAULT
     env = retro.make(game, state, **kwargs, players=num_players)
     return env
def init_env(output_path, num_env, state, num_players, args, use_frameskip=True, use_display=False):
    seed = 0
    start_index = 0
    start_method=None
    allow_early_resets=True
   
    def make_env(rank):
       def _thunk():
            env = make_retro(game=args.env, use_restricted_actions=retro.Actions.FILTERED, state=state, num_players=num_players)

            env.seed(seed + rank)
            env = Monitor(env, output_path and os.path.join(output_path, str(rank)), allow_early_resets=allow_early_resets)
            if use_frameskip:
                env = StochasticFrameSkip(env, n=4, stickprob=0.25)

            env = WarpFrame(env)
            env = ClipRewardEnv(env)

            return env
        return _thunk
    #set_global_seeds(seed)


    env = SubprocVecEnv([make_env(i + start_index) for i in range(num_env)], start_method=start_method)
    
    env = VecFrameStack(env, n_stack=4)

    env = VecTransposeImage(env)

    return env

Checklist

  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • If code there is, it is minimal and working
  • If code there is, it is formatted using the markdown code blocks for both code and stack traces.
@MatPoliquin MatPoliquin added the question Further information is requested label Oct 16, 2022
@araffin araffin added the more information needed Please fill the issue template completely label Oct 16, 2022
@araffin
Copy link
Member

araffin commented Oct 17, 2022

Hello,
Please provide a minimal code to reproduce the issue.
I guess you are not using SubprocVecEnv ? (you should try)

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

there might be a performance drop because pytorch uses eager evaluation but probably not that much.

It seems VecTransposeImage has a high CPU usage (as expected for a large number of envs==24). Is there plans to do this operation on GPU instead?

are you sure it is due to VecTransposeImage ? and that the slowness is due to it?
One thing you can try is setting OMP_NUM_THREADS to a lower value (start with 1), see #413 and #283

@MatPoliquin
Copy link
Author

Hello, Please provide a minimal code to reproduce the issue. I guess you are not using SubprocVecEnv ? (you should try)

I already use SubprocVecEnv, I edited my post to add the code that setups the env

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

there might be a performance drop because pytorch uses eager evaluation but probably not that much.

Good point, might explain at least part of it

It seems VecTransposeImage has a high CPU usage (as expected for a large number of envs==24). Is there plans to do this operation on GPU instead?

are you sure it is due to VecTransposeImage ? and that the slowness is due to it? One thing you can try is setting OMP_NUM_THREADS to a lower value (start with 1), see #413 and #283

I tried to set OMP_NUM_THREADS to lower values but it doesn't make much of a difference since I use the GPU. It only makes a difference if I force pytorch to use the CPU, as expected.

@MatPoliquin MatPoliquin changed the title VecTransposeImage on GPU Difference of performance between SB1 and SB3 Oct 19, 2022
@MatPoliquin MatPoliquin changed the title Difference of performance between SB1 and SB3 SB1 vs SB3 - Performance difference Oct 19, 2022
@araffin araffin changed the title SB1 vs SB3 - Performance difference SB2 vs SB3 - Performance difference Oct 19, 2022
@araffin
Copy link
Member

araffin commented Oct 19, 2022

Related: #90 and #122 (comment) (there I provide colab notebook to compare)

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

Did you double check that the hyperparameters were equivalents?
what were you using for SB2 PPO?

EDIT: the 1.4x difference seems to match the results I got with colab notebooks

@MatPoliquin
Copy link
Author

Related: #90 and #122 (comment) (there I provide colab notebook to compare)

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

Did you double check that the hyperparameters were equivalents? what were you using for SB2 PPO?
I was using the default parameters:
https://github.com/hill-a/stable-baselines/blob/45beb246833b6818e0f3fc1f44336b1c52351170/stable_baselines/ppo2/ppo2.py#L53

The only parameter I am not sure about is batch_size, I experimented with different values (256, 512, 1024) and the performance is still lower

EDIT: the 1.4x difference seems to match the results I got with colab notebooks
Interesting, did not see this, so basically my results are probably normal

@araffin
Copy link
Member

araffin commented Oct 20, 2022

The only parameter I am not sure about is batch_size, I experimented with different values (256, 512, 1024) and the performance is still lower

See conversion for batch size: https://stable-baselines3.readthedocs.io/en/master/guide/migration.html#ppo

Interesting, did not see this, so basically my results are probably normal

yes...

@araffin
Copy link
Member

araffin commented Oct 22, 2022

Reading https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/, we should probably try channel last memory format, it is just few lines of code to change (https://pytorch.org/blog/tensor-memory-format-matters/) and the shape of the tensors are the same.

@araffin
Copy link
Member

araffin commented Oct 22, 2022

@MatPoliquin all you need to do is apparently:

x = x.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)

I would be happy to receive your feedback if you give it a try ;)

@MatPoliquin
Copy link
Author

MatPoliquin commented Nov 7, 2022

@MatPoliquin all you need to do is apparently:

x = x.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)

I would be happy to receive your feedback if you give it a try ;)

So these changes should be made in on_policy_algorithm.py?

I modified the code below but not quite sure if it's correct

line 102:

def _setup_model(self) -> None:
        self._setup_lr_schedule()
        self.set_random_seed(self.seed)

        buffer_cls = DictRolloutBuffer if isinstance(self.observation_space, gym.spaces.Dict) else RolloutBuffer

        self.rollout_buffer = buffer_cls(
            self.n_steps,
            self.observation_space,
            self.action_space,
            device=self.device,
            gamma=self.gamma,
            gae_lambda=self.gae_lambda,
            n_envs=self.n_envs,
        )

       
        self.rollout_buffer = self.rollout_buffer.to(memory_format=torch.channels_last)


        self.policy = self.policy_class(  # pytype:disable=not-instantiable
            self.observation_space,
            self.action_space,
            self.lr_schedule,
            use_sde=self.use_sde,
            **self.policy_kwargs  # pytype:disable=not-instantiable
        )
        self.policy = self.policy.to(self.device, memory_format=torch.channels_last)

@araffin
Copy link
Member

araffin commented Nov 7, 2022

gym 0.26.2

You are using the experimental branch right?
Otherwise, SB3 is only compatible with gym 0.21 for now.

I modified the code below but not quite sure if it's correct

yes and you need to modify the rollout buffer.

I did some quick tests with the RL Zoo (default to 8 envs), what I can recommend:

  • use subprocess vec env
  • limit the number of cpu threads used
  • try without GPU

For instance, with the default command, I get around 800FPS:
python train.py --algo ppo --env PongNoFrameskip-v4 -P --verbose 0 --eval-freq 100000

With subprocess, I get 1100 FPS:
`OMP_NUM_THREADS=4 python train.py --algo ppo --env PongNoFrameskip-v4 -P --verbose 0 --eval-freq 100000 --vec-env subproc

You could also try to add support for CNN in the experimental SBX; araffin/sbx#6 and araffin/sbx#4

(SBX PPO is ~2x faster than SB3 PPO but it has less features)

@araffin
Copy link
Member

araffin commented Dec 17, 2022

Reading https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/, we should probably try channel last memory format, it is just few lines of code to change (https://pytorch.org/blog/tensor-memory-format-matters/) and the shape of the tensors are the same.

So, I tested that but didn't help much.

What gave me 8% speed boost was to set copy=False when creating tensor (see https://github.com/DLR-RM/stable-baselines3/compare/feat/non-blocking?expand=1).
With that and subprocess envs, I can get ~1800 FPS using

python -m rl_zoo3.train --algo ppo --env PongNoFrameskip-v4 --verbose 0 -P --seed 1 -n 60000 --vec-env subproc --eval-freq -1

@araffin
Copy link
Member

araffin commented Jun 15, 2023

Small update for that, I have now an experimental SB3 + Jax = SBX version here: https://github.com/araffin/sbx

With the proper hyperparameter, SAC can run 20x faster than its PyTorch equivalent =): https://twitter.com/araffin2/status/1590714601754497024

@araffin araffin closed this as completed Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more information needed Please fill the issue template completely question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants