Implement HER #120

megan-klaiber · 2020-07-23T14:51:26Z

Description

HER inherits from OffPolicyAlgorithm and takes the model as an argument. It also implements its own collect_rollout function.
HER can operate in two modes for now. online_sampling being True or False. If True, HER samples are added while sampling, otherwise they are added at the end of an episode. If online sampling is used, a custom HerReplayBuffer will be used which stores the transitions episode-wise.

Motivation and Context

I have raised an issue to propose this change (required for new features and bug fixes)

closes #8

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

I've read the CONTRIBUTION guide (required)
I have updated the changelog accordingly (required).
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.
I have reformatted the code using make format (required)
I have checked the codestyle using make check-codestyle and make lint (required)
I have ensured make pytest and make type both pass. (required)

Missing:

Handle info dict in reward computation
Refactor offline version to use her replay buffer
VecNormalize support
more tests for VecNormalize support
benchmark sb2 vs sb3 on parking env
test to check errors for max episode length

Note: we are using a maximum length of 127 characters per line

Results

Results on https://github.com/eleurent/highway-env

her_parking.pdf

her.pdf

…r dimensions.

araffin · 2020-07-24T12:14:07Z

docs/misc/changelog.rst

@@ -25,6 +25,7 @@ New Features:
 - Refactored opening paths for saving and loading to use strings, pathlib or io.BufferedIOBase (@PartiallyTyped)
 - Added ``DDPG`` algorithm as a special case of ``TD3``.
 - Introduced ``BaseModel`` abstract parent for ``BasePolicy``, which critics inherit from.
+- Added Hindsight Experience Replay ``HER``. (@megan-klaiber)


you will need also to update the documentation: add HER to the module and to the examples (you can mostly copy-paste what was done in SB2 documentation ;))

araffin · 2020-07-24T12:15:03Z

stable_baselines3/her/her.py

+        use_sde: bool = False,
+        sde_sample_freq: int = -1,
+        use_sde_at_warmup: bool = False,
+        sde_support: bool = True,


sde support should not be here

araffin · 2020-07-24T12:16:36Z

stable_baselines3/her/her.py

+            self.goal_strategy, GoalSelectionStrategy
+        ), "Invalid goal selection strategy," "please use one of {}".format(list(GoalSelectionStrategy))
+
+        self.env = ObsWrapper(env)


you should wrap it only afterward and check if the wrapper is needed or not

araffin · 2020-07-24T12:30:23Z

stable_baselines3/her/her.py

+
+        assert isinstance(
+            self.goal_strategy, GoalSelectionStrategy
+        ), "Invalid goal selection strategy," "please use one of {}".format(list(GoalSelectionStrategy))


Suggested change

), "Invalid goal selection strategy," "please use one of {}".format(list(GoalSelectionStrategy))

), f"Invalid goal selection strategy, please use one of {list(GoalSelectionStrategy))}"

we require python 3.6+, so you can use f-strings

araffin · 2020-07-24T12:32:32Z

stable_baselines3/her/her.py

+        # get arguments for the model initialization
+        model_signature = signature(model.__init__)
+        arguments = locals()
+        model_init_dict = {


you need that because HER inherits from the off-policy class? I would make it inherit from the BaseAlgorithm then.
It seems that you are initializing two models (and two replay buffers, including one that you don't use)

or maybe keep OffpolicyAlgorithm as base class but initialize empty buffer, so you can re-use learn() from the base class

araffin · 2020-07-27T09:18:03Z

stable_baselines3/her/her_replay_buffer.py

+
+        # buffer with episodes
+        self.buffer = []
+        # TODO just for typing reason , need another solution


araffin · 2020-07-27T09:18:48Z

stable_baselines3/her/her_replay_buffer.py

+        # TODO just for typing reason , need another solution
+        self.observations = np.zeros((self.buffer_size, self.n_envs,) + self.obs_shape, dtype=observation_space.dtype)
+        self.goal_strategy = goal_strategy
+        self.her_ratio = 1 - (1.0 / (1 + her_ratio))


missing comment, looks weird compared to what is described in the docstring

araffin · 2020-07-27T09:25:43Z

stable_baselines3/her/her_replay_buffer.py

+        ]
+
+        # concatenate observation with (desired) goal
+        obs = [np.concatenate([o["observation"], o["desired_goal"]], axis=1) for o in observations]


please avoid one character variable, you can use obs_ instead

araffin · 2020-07-27T09:26:32Z

stable_baselines3/her/her_replay_buffer.py

+        her_episode_lenghts = episode_lengths[her_idxs]
+
+        # get new goals with goal selection strategy
+        if self.goal_strategy == GoalSelectionStrategy.FINAL:


this logic cannot be shared with the "offline" version?

araffin · 2020-07-27T09:27:28Z

stable_baselines3/her/obs_wrapper.py

+    def close(self):
+        return self.venv.close()
+
+    def get_attr(self, attr_name, indices=None):


you don't need to re-implement those as they are already in the base wrapper class, no?

araffin · 2020-07-28T16:17:53Z

stable_baselines3/her/her.py

+                        self.model._last_original_obs, new_obs_, reward_ = observation, new_obs, reward
+
+                    # add current transition to episode storage
+                    self.episode_storage.append((self.model._last_original_obs, buffer_action, reward_, new_obs_, done))


Would be clearer to use NamedTuple (cf what is done for the replay buffer)

araffin · 2020-07-28T16:19:21Z

stable_baselines3/her/her.py

+                    self.model.actor.reset_noise()
+
+                # Select action randomly or according to policy
+                action, buffer_action = self.model._sample_action(learning_starts, action_noise)


after thinking more about it, I think we need to define __get__attr_ to automaticaly retrieve the attribute from self.model if present. This would allow to write directly self._sample_action() .

araffin · 2020-07-28T17:05:05Z

stable_baselines3/her/her_replay_buffer.py

+        new_rewards = np.array(rewards)
+        new_rewards[her_idxs] = [
+            self.env.env_method("compute_reward", ag, her_new_goals, None)
+            for ag, new_goal in zip(achieved_goals, her_new_goals)


Please avoid name without meaning: achieved_goal instead of ag ;)

araffin · 2020-07-28T17:05:35Z

stable_baselines3/her/her_replay_buffer.py

+                self.buffer[idx] = episode
+                self.n_transitions_stored -= self.buffer[idx] - episode_length
+
+        if self.n_transitions_stored == self.size():


can be simplified

araffin · 2020-07-28T17:06:43Z

stable_baselines3/her/her_replay_buffer.py

+    def get_current_size(self):
+        return self.n_transitions_stored
+
+    def get_transitions_stored(self):


Maybe make self.n_transitions_stored private and create a getter using @property

araffin · 2020-07-28T17:07:06Z

stable_baselines3/her/her_replay_buffer.py

+    def get_transitions_stored(self):
+        return self.n_transitions_stored
+
+    def clear_buffer(self):


you need to re-initialize the number of transitions stored too, no?

…o her

araffin · 2020-08-06T12:53:16Z

stable_baselines3/her/her.py

+    def get_torch_variables(self) -> Tuple[List[str], List[str]]:
+        return self.model.get_torch_variables()
+
+    def save(


is there no quicker way of doing that? (without duplicating too much code)

I would store HER specific arguments in the model (self.model.dict) , see what is done in SB2.

araffin · 2020-10-20T17:17:25Z

stable_baselines3/her/her.py

+                    # sample virtual transitions and store them in replay buffer
+                    self._sample_her_transitions()
+                    # clear storage for current episode
+                    self._episode_storage.reset()


this is not properly defined in the HER replay buffer

araffin

LGTM =)

Tested on simulated envs and on a real robot, time to merge now.

avandekleut · 2020-10-30T19:40:19Z

LGTM =)

Tested on simulated envs and on a real robot, time to merge now.
@araffin what environments and hyperparameters did you use? I'm trying to get this working on the Fetch environments.

araffin · 2020-10-30T21:28:46Z

@araffin what environments and hyperparameters did you use? I'm trying to get this working on the Fetch environments.

depends on which Fetch please look at the rl zoo: https://github.com/DLR-RM/rl-baselines3-zoo.
For env that are not FetchReach, multiprocessing is usually required but currently not implemented, otherwise you can take a look at the SB2 zoo:
araffin/rl-baselines-zoo#53
and
araffin/rl-baselines-zoo#50

* Added working her version, Online sampling is missing. * Updated test_her. * Added first version of online her sampling. Still problems with tensor dimensions. * Reformat * Fixed tests * Added some comments. * Updated changelog. * Add missing init file * Fixed some small bugs. * Reduced arguments for HER, small changes. * Added getattr. Fixed bug for online sampling. * Updated save/load funtions. Small changes. * Added her to init. * Updated save method. * Updated her ratio. * Move obs_wrapper * Added DQN test. * Fix potential bug * Offline and online her share same sample_goal function. * Changed lists into arrays. * Updated her test. * Fix online sampling * Fixed action bug. Updated time limit for episodes. * Updated convert_dict method to take keys as arguments. * Renamed obs dict wrapper. * Seed bit flipping env * Remove get_episode_dict * Add fast online sampling version * Added documentation. * Vectorized reward computation * Vectorized goal sampling * Update time limit for episodes in online her sampling. * Fix max episode length inference * Bug fix for Fetch envs * Fix for HER + gSDE * Reformat (new black version) * Added info dict to compute new reward. Check her_replay_buffer again. * Fix info buffer * Updated done flag. * Fixes for gSDE * Offline her version uses now HerReplayBuffer as episode storage. * Fix num_timesteps computation * Fix get torch params * Vectorized version for offline sampling. * Modified offline her sampling to use sample method of her_replay_buffer * Updated HER tests. * Updated documentation * Cleanup docstrings * Updated to review comments * Fix pytype * Update according to review comments. * Removed random goal strategy. Updated sample transitions. * Updated migration. Removed time signal removal. * Update doc * Fix potential load issue * Add VecNormalize support for dict obs * Updated saving/loading replay buffer for HER. * Fix test memory usage * Fixed save/load replay buffer. * Fixed save/load replay buffer * Fixed transition index after loading replay buffer in online sampling * Better error handling * Add tests for get_time_limit * More tests for VecNormalize with dict obs * Update doc * Improve HER description * Add test for sde support * Add comments * Add comments * Remove check that was always valid * Fix for terminal observation * Updated buffer size in offline version and reset of HER buffer * Reformat * Update doc * Remove np.empty + add doc * Fix loading * Updated loading replay buffer * Separate online and offline sampling + bug fixes * Update tensorboard log name * Version bump * Bug fix for special case Co-authored-by: Antonin Raffin <antonin.raffin@dlr.de> Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>

phiresky · 2020-11-30T13:15:18Z

It looks like the code added in this PR seems to break normal envs with dict observation space since it assumes that whenever the observation space is a dictionary the user wants HER:

 File ".../lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 203, in _wrap_env
    env = ObsDictWrapper(env)
  File ".../lib/python3.8/site-packages/stable_baselines3/common/vec_env/obs_dict_wrapper.py", line 28, in __init__
    self.obs_dim = venv.observation_space.spaces["observation"].shape[0]
KeyError: 'observation'

specifically, this code:

        # check if wrapper for dict support is needed when using HER
        if isinstance(env.observation_space, gym.spaces.dict.Dict):
            env = ObsDictWrapper(env)

doesn't verify at all that HER is what I want, and assumes the dict has a specific purpose, breaking training any envs with dict obs space

araffin · 2020-11-30T13:30:18Z

It looks like the code added in this PR seems to break normal envs with dict observation space since it assumes that whenever the observation space is a dictionary the user wants HER:

Please read #216
Dict observation spaces other than GoalEnv are currently not supported by SB3 (but there is an active PR for that here #243)

phiresky · 2020-11-30T14:19:59Z

@araffin thanks for the info and links. might be good to throw a more readable error in that case. For now this works fine for me:

class FlattenVecWrapper(VecEnvWrapper):
    def __init__(self, env):
        super().__init__(env)
        self.observation_space = gym.spaces.flatten_space(self.venv.observation_space)

    def reset(self, **kwargs):
        observation = self.venv.reset(**kwargs)
        return self.observation(observation)

    def step_wait(self):
        observation, reward, done, info = self.venv.step_wait()
        return self.observation(observation), reward, done, info

    def observation(self, observation):
        return [gym.spaces.flatten(self.venv.observation_space, o) for o in observation]

* Added working her version, Online sampling is missing. * Updated test_her. * Added first version of online her sampling. Still problems with tensor dimensions. * Reformat * Fixed tests * Added some comments. * Updated changelog. * Add missing init file * Fixed some small bugs. * Reduced arguments for HER, small changes. * Added getattr. Fixed bug for online sampling. * Updated save/load funtions. Small changes. * Added her to init. * Updated save method. * Updated her ratio. * Move obs_wrapper * Added DQN test. * Fix potential bug * Offline and online her share same sample_goal function. * Changed lists into arrays. * Updated her test. * Fix online sampling * Fixed action bug. Updated time limit for episodes. * Updated convert_dict method to take keys as arguments. * Renamed obs dict wrapper. * Seed bit flipping env * Remove get_episode_dict * Add fast online sampling version * Added documentation. * Vectorized reward computation * Vectorized goal sampling * Update time limit for episodes in online her sampling. * Fix max episode length inference * Bug fix for Fetch envs * Fix for HER + gSDE * Reformat (new black version) * Added info dict to compute new reward. Check her_replay_buffer again. * Fix info buffer * Updated done flag. * Fixes for gSDE * Offline her version uses now HerReplayBuffer as episode storage. * Fix num_timesteps computation * Fix get torch params * Vectorized version for offline sampling. * Modified offline her sampling to use sample method of her_replay_buffer * Updated HER tests. * Updated documentation * Cleanup docstrings * Updated to review comments * Fix pytype * Update according to review comments. * Removed random goal strategy. Updated sample transitions. * Updated migration. Removed time signal removal. * Update doc * Fix potential load issue * Add VecNormalize support for dict obs * Updated saving/loading replay buffer for HER. * Fix test memory usage * Fixed save/load replay buffer. * Fixed save/load replay buffer * Fixed transition index after loading replay buffer in online sampling * Better error handling * Add tests for get_time_limit * More tests for VecNormalize with dict obs * Update doc * Improve HER description * Add test for sde support * Add comments * Add comments * Remove check that was always valid * Fix for terminal observation * Updated buffer size in offline version and reset of HER buffer * Reformat * Update doc * Remove np.empty + add doc * Fix loading * Updated loading replay buffer * Separate online and offline sampling + bug fixes * Update tensorboard log name * Version bump * Bug fix for special case Co-authored-by: Antonin Raffin <antonin.raffin@dlr.de> Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>

megan-klaiber and others added 9 commits July 21, 2020 00:44

Added working her version, Online sampling is missing.

f0e03de

Updated test_her.

f2b0645

Added first version of online her sampling. Still problems with tenso…

f7d5f88

…r dimensions.

Reformat

88771b8

Fixed tests

2e436a2

Added some comments.

c0a82fc

Updated changelog.

e6263b2

Add missing init file

257b8fc

Fixed some small bugs.

90f6e2c

araffin self-requested a review July 23, 2020 15:05

araffin changed the title ~~Her~~ Implement HER Jul 23, 2020

araffin added 2 commits July 24, 2020 14:11

Merge branch 'master' into her

b864f34

Merge branch 'her' of github.com:DLR-RM/stable-baselines3 into her

fb4351c

araffin reviewed Jul 27, 2020

View reviewed changes

araffin reviewed Jul 28, 2020

View reviewed changes

megan-klaiber added 2 commits July 29, 2020 12:54

Reduced arguments for HER, small changes.

7b22e68

Added getattr. Fixed bug for online sampling.

501b1c4

araffin mentioned this pull request Aug 3, 2020

HER Implementation #42

Closed

12 tasks

araffin added 2 commits August 4, 2020 10:14

Merge branch 'master' into her

15d5511

Merge branch 'her' of github.com:DLR-RM/stable-baselines3 into her

7411668

araffin mentioned this pull request Aug 4, 2020

[Question] HER+DDPG is much slower than vanilla DDPG hill-a/stable-baselines#969

Closed

megan-klaiber added 3 commits August 6, 2020 02:03

Updated save/load funtions. Small changes.

5d09619

Merge branch 'her' of https://github.com/DLR-RM/stable-baselines3 int…

cb6a650

…o her

Added her to init.

cb9026f

araffin reviewed Aug 6, 2020

View reviewed changes

araffin added 4 commits October 20, 2020 18:10

Improve HER description

ba0a7e4

Add test for sde support

907bcff

Add comments

f650934

Add comments

03c4104

araffin reviewed Oct 20, 2020

View reviewed changes

araffin and others added 13 commits October 20, 2020 19:21

Remove check that was always valid

6c18e4c

Fix for terminal observation

28b281d

Updated buffer size in offline version and reset of HER buffer

d196aa2

Reformat

1f7ab9f

Update doc

7da274f

Remove np.empty + add doc

8bb5c7c

Fix loading

d884f9c

Updated loading replay buffer

0ba1272

Separate online and offline sampling + bug fixes

4034217

Update tensorboard log name

aacd936

Version bump

940ee2c

Bug fix for special case

3bb19a7

Merge branch 'master' into her

fb92b22

araffin approved these changes Oct 22, 2020

View reviewed changes

araffin merged commit dd6e361 into master Oct 22, 2020

araffin deleted the her branch October 22, 2020 09:56

araffin mentioned this pull request Oct 27, 2020

Save/load error due to model and env observation space mismatch #202

Closed

araffin mentioned this pull request Mar 1, 2021

[Question] HER+SAC different results to SB2 #335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement HER #120

Implement HER #120

megan-klaiber commented Jul 23, 2020 •

edited by araffin

Loading

araffin Jul 24, 2020

araffin Jul 24, 2020

araffin Jul 24, 2020

araffin Jul 24, 2020

araffin Jul 24, 2020

araffin Jul 24, 2020

araffin Jul 27, 2020

araffin Jul 27, 2020

araffin Jul 27, 2020

araffin Jul 27, 2020

araffin Jul 27, 2020

araffin Jul 28, 2020

araffin Jul 28, 2020

araffin Jul 28, 2020

araffin Jul 28, 2020

araffin Jul 28, 2020

araffin Jul 28, 2020

araffin Aug 6, 2020

araffin Oct 20, 2020

araffin left a comment

avandekleut commented Oct 30, 2020

araffin commented Oct 30, 2020

phiresky commented Nov 30, 2020

araffin commented Nov 30, 2020

phiresky commented Nov 30, 2020

	), "Invalid goal selection strategy," "please use one of {}".format(list(GoalSelectionStrategy))
	), f"Invalid goal selection strategy, please use one of {list(GoalSelectionStrategy))}"

Implement HER #120

Implement HER #120

Conversation

megan-klaiber commented Jul 23, 2020 • edited by araffin Loading

Description

Motivation and Context

Types of changes

Checklist:

Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

araffin left a comment

Choose a reason for hiding this comment

avandekleut commented Oct 30, 2020

araffin commented Oct 30, 2020

phiresky commented Nov 30, 2020

araffin commented Nov 30, 2020

phiresky commented Nov 30, 2020

megan-klaiber commented Jul 23, 2020 •

edited by araffin

Loading