SB2 vs SB3 - Performance difference #1124

MatPoliquin · 2022-10-16T15:07:21Z

❓ Question

EDIT: After doing some more digging I updated the post title and added more details with a newer version of SB3 (1.6.2)

I am using OpenAI gym-retro env to train on games and migrated from SB2 to SB3 1.6.2. I noticed the training FPS reduced by a lot from 1300fps to 900fps.

gym-retro env: Pong-Atari2600
num_env==24
PPO
CnnPolicy

Using Nvidia Nsight I profiled both versions (you can find the reports in the link to google drive below, you need Nsight to view it):
https://drive.google.com/drive/folders/1Lqxf-qKXTj__Hp8WUXgNHejZaJGy8oct?usp=sharing

Here are the parameters I use for PPO with SB3 (with SB1 I just use the default parameters provided by SB):
PPO(policy=args.nn, env=env, verbose=1, n_steps = 128, n_epochs = 4, batch_size = 256, learning_rate = 2.5e-4, clip_range = 0.2, vf_coef = 0.5, ent_coef = 0.01, max_grad_norm=0.5, clip_range_vf=None)

My specs:

Dual Xeon 2666v3
RTX 2060 Super 8g
Ubuntu 20.04
stable-baselines3 1.6.2
gym 0.26.2

Code I use to wrap the retro env (same for both SB2 and SB3 cases):

 def make_retro(*, game, state=None, num_players, max_episode_steps=4500, **kwargs):
      import retro
      if state is None:
         state = retro.State.DEFAULT
     env = retro.make(game, state, **kwargs, players=num_players)
     return env

def init_env(output_path, num_env, state, num_players, args, use_frameskip=True, use_display=False):
    seed = 0
    start_index = 0
    start_method=None
    allow_early_resets=True
   
    def make_env(rank):
       def _thunk():
            env = make_retro(game=args.env, use_restricted_actions=retro.Actions.FILTERED, state=state, num_players=num_players)

            env.seed(seed + rank)
            env = Monitor(env, output_path and os.path.join(output_path, str(rank)), allow_early_resets=allow_early_resets)
            if use_frameskip:
                env = StochasticFrameSkip(env, n=4, stickprob=0.25)

            env = WarpFrame(env)
            env = ClipRewardEnv(env)

            return env
        return _thunk
    #set_global_seeds(seed)


    env = SubprocVecEnv([make_env(i + start_index) for i in range(num_env)], start_method=start_method)
    
    env = VecFrameStack(env, n_stack=4)

    env = VecTransposeImage(env)

    return env

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

araffin · 2022-10-17T07:13:47Z

Hello,
Please provide a minimal code to reproduce the issue.
I guess you are not using SubprocVecEnv ? (you should try)

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

there might be a performance drop because pytorch uses eager evaluation but probably not that much.

It seems VecTransposeImage has a high CPU usage (as expected for a large number of envs==24). Is there plans to do this operation on GPU instead?

are you sure it is due to VecTransposeImage ? and that the slowness is due to it?
One thing you can try is setting OMP_NUM_THREADS to a lower value (start with 1), see #413 and #283

MatPoliquin · 2022-10-18T00:56:46Z

Hello, Please provide a minimal code to reproduce the issue. I guess you are not using SubprocVecEnv ? (you should try)

I already use SubprocVecEnv, I edited my post to add the code that setups the env

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

there might be a performance drop because pytorch uses eager evaluation but probably not that much.

Good point, might explain at least part of it

It seems VecTransposeImage has a high CPU usage (as expected for a large number of envs==24). Is there plans to do this operation on GPU instead?

are you sure it is due to VecTransposeImage ? and that the slowness is due to it? One thing you can try is setting OMP_NUM_THREADS to a lower value (start with 1), see #413 and #283

I tried to set OMP_NUM_THREADS to lower values but it doesn't make much of a difference since I use the GPU. It only makes a difference if I force pytorch to use the CPU, as expected.

araffin · 2022-10-19T16:42:24Z

Related: #90 and #122 (comment) (there I provide colab notebook to compare)

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

Did you double check that the hyperparameters were equivalents?
what were you using for SB2 PPO?

EDIT: the 1.4x difference seems to match the results I got with colab notebooks

MatPoliquin · 2022-10-20T00:44:00Z

Related: #90 and #122 (comment) (there I provide colab notebook to compare)

I noticed the training FPS reduced by a lot from 1300fps to 900fps.

Did you double check that the hyperparameters were equivalents? what were you using for SB2 PPO?
I was using the default parameters:
https://github.com/hill-a/stable-baselines/blob/45beb246833b6818e0f3fc1f44336b1c52351170/stable_baselines/ppo2/ppo2.py#L53

The only parameter I am not sure about is batch_size, I experimented with different values (256, 512, 1024) and the performance is still lower

EDIT: the 1.4x difference seems to match the results I got with colab notebooks
Interesting, did not see this, so basically my results are probably normal

araffin · 2022-10-20T09:09:47Z

The only parameter I am not sure about is batch_size, I experimented with different values (256, 512, 1024) and the performance is still lower

See conversion for batch size: https://stable-baselines3.readthedocs.io/en/master/guide/migration.html#ppo

Interesting, did not see this, so basically my results are probably normal

yes...

araffin · 2022-10-22T16:09:54Z

Reading https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/, we should probably try channel last memory format, it is just few lines of code to change (https://pytorch.org/blog/tensor-memory-format-matters/) and the shape of the tensors are the same.

araffin · 2022-10-22T16:11:41Z

@MatPoliquin all you need to do is apparently:

x = x.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)

I would be happy to receive your feedback if you give it a try ;)

MatPoliquin · 2022-11-07T03:06:11Z

@MatPoliquin all you need to do is apparently:
x = x.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)
I would be happy to receive your feedback if you give it a try ;)

So these changes should be made in on_policy_algorithm.py?

I modified the code below but not quite sure if it's correct

line 102:

def _setup_model(self) -> None:
        self._setup_lr_schedule()
        self.set_random_seed(self.seed)

        buffer_cls = DictRolloutBuffer if isinstance(self.observation_space, gym.spaces.Dict) else RolloutBuffer

        self.rollout_buffer = buffer_cls(
            self.n_steps,
            self.observation_space,
            self.action_space,
            device=self.device,
            gamma=self.gamma,
            gae_lambda=self.gae_lambda,
            n_envs=self.n_envs,
        )

       
        self.rollout_buffer = self.rollout_buffer.to(memory_format=torch.channels_last)


        self.policy = self.policy_class(  # pytype:disable=not-instantiable
            self.observation_space,
            self.action_space,
            self.lr_schedule,
            use_sde=self.use_sde,
            **self.policy_kwargs  # pytype:disable=not-instantiable
        )
        self.policy = self.policy.to(self.device, memory_format=torch.channels_last)

araffin · 2022-11-07T14:29:44Z

gym 0.26.2

You are using the experimental branch right?
Otherwise, SB3 is only compatible with gym 0.21 for now.

I modified the code below but not quite sure if it's correct

yes and you need to modify the rollout buffer.

I did some quick tests with the RL Zoo (default to 8 envs), what I can recommend:

use subprocess vec env
limit the number of cpu threads used
try without GPU

For instance, with the default command, I get around 800FPS:
python train.py --algo ppo --env PongNoFrameskip-v4 -P --verbose 0 --eval-freq 100000

With subprocess, I get 1100 FPS:
`OMP_NUM_THREADS=4 python train.py --algo ppo --env PongNoFrameskip-v4 -P --verbose 0 --eval-freq 100000 --vec-env subproc

You could also try to add support for CNN in the experimental SBX; araffin/sbx#6 and araffin/sbx#4

(SBX PPO is ~2x faster than SB3 PPO but it has less features)

araffin · 2022-12-17T12:27:14Z

Reading https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/, we should probably try channel last memory format, it is just few lines of code to change (https://pytorch.org/blog/tensor-memory-format-matters/) and the shape of the tensors are the same.

So, I tested that but didn't help much.

What gave me 8% speed boost was to set copy=False when creating tensor (see https://github.com/DLR-RM/stable-baselines3/compare/feat/non-blocking?expand=1).
With that and subprocess envs, I can get ~1800 FPS using

python -m rl_zoo3.train --algo ppo --env PongNoFrameskip-v4 --verbose 0 -P --seed 1 -n 60000 --vec-env subproc --eval-freq -1

araffin · 2023-06-15T20:59:42Z

Small update for that, I have now an experimental SB3 + Jax = SBX version here: https://github.com/araffin/sbx

With the proper hyperparameter, SAC can run 20x faster than its PyTorch equivalent =): https://twitter.com/araffin2/status/1590714601754497024

MatPoliquin added the question Further information is requested label Oct 16, 2022

araffin added the more information needed Please fill the issue template completely label Oct 16, 2022

MatPoliquin changed the title ~~VecTransposeImage on GPU~~ Difference of performance between SB1 and SB3 Oct 19, 2022

MatPoliquin changed the title ~~Difference of performance between SB1 and SB3~~ SB1 vs SB3 - Performance difference Oct 19, 2022

araffin changed the title ~~SB1 vs SB3 - Performance difference~~ SB2 vs SB3 - Performance difference Oct 19, 2022

araffin mentioned this issue Dec 18, 2022

Construct tensors directly on GPUs #1218

Merged

14 tasks

araffin mentioned this issue Feb 2, 2023

Add scaling section to A2C documentation #1250

Merged

4 tasks

araffin closed this as completed Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SB2 vs SB3 - Performance difference #1124

SB2 vs SB3 - Performance difference #1124

MatPoliquin commented Oct 16, 2022 •

edited

Loading

araffin commented Oct 17, 2022

MatPoliquin commented Oct 18, 2022

araffin commented Oct 19, 2022 •

edited

Loading

MatPoliquin commented Oct 20, 2022

araffin commented Oct 20, 2022

araffin commented Oct 22, 2022

araffin commented Oct 22, 2022

MatPoliquin commented Nov 7, 2022 •

edited

Loading

araffin commented Nov 7, 2022 •

edited

Loading

araffin commented Dec 17, 2022

araffin commented Jun 15, 2023

SB2 vs SB3 - Performance difference #1124

SB2 vs SB3 - Performance difference #1124

Comments

MatPoliquin commented Oct 16, 2022 • edited Loading

❓ Question

Checklist

araffin commented Oct 17, 2022

MatPoliquin commented Oct 18, 2022

araffin commented Oct 19, 2022 • edited Loading

MatPoliquin commented Oct 20, 2022

araffin commented Oct 20, 2022

araffin commented Oct 22, 2022

araffin commented Oct 22, 2022

MatPoliquin commented Nov 7, 2022 • edited Loading

araffin commented Nov 7, 2022 • edited Loading

araffin commented Dec 17, 2022

araffin commented Jun 15, 2023

MatPoliquin commented Oct 16, 2022 •

edited

Loading

araffin commented Oct 19, 2022 •

edited

Loading

MatPoliquin commented Nov 7, 2022 •

edited

Loading

araffin commented Nov 7, 2022 •

edited

Loading