Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HER #120

Merged
merged 109 commits into from
Oct 22, 2020
Merged

Implement HER #120

merged 109 commits into from
Oct 22, 2020

Conversation

megan-klaiber
Copy link
Collaborator

@megan-klaiber megan-klaiber commented Jul 23, 2020

Description

HER inherits from OffPolicyAlgorithm and takes the model as an argument. It also implements its own collect_rollout function.
HER can operate in two modes for now. online_sampling being True or False. If True, HER samples are added while sampling, otherwise they are added at the end of an episode. If online sampling is used, a custom HerReplayBuffer will be used which stores the transitions episode-wise.

Motivation and Context

  • I have raised an issue to propose this change (required for new features and bug fixes)

closes #8

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • I've read the CONTRIBUTION guide (required)
  • I have updated the changelog accordingly (required).
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.
  • I have reformatted the code using make format (required)
  • I have checked the codestyle using make check-codestyle and make lint (required)
  • I have ensured make pytest and make type both pass. (required)

Missing:

  • Handle info dict in reward computation
  • Refactor offline version to use her replay buffer
  • VecNormalize support
  • more tests for VecNormalize support
  • benchmark sb2 vs sb3 on parking env
  • test to check errors for max episode length

Note: we are using a maximum length of 127 characters per line

Results

Results on https://github.com/eleurent/highway-env
her
her_parking.pdf
Results_parking-v0

her.pdf

@araffin araffin self-requested a review July 23, 2020 15:05
@araffin araffin changed the title Her Implement HER Jul 23, 2020
@@ -25,6 +25,7 @@ New Features:
- Refactored opening paths for saving and loading to use strings, pathlib or io.BufferedIOBase (@PartiallyTyped)
- Added ``DDPG`` algorithm as a special case of ``TD3``.
- Introduced ``BaseModel`` abstract parent for ``BasePolicy``, which critics inherit from.
- Added Hindsight Experience Replay ``HER``. (@megan-klaiber)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need also to update the documentation: add HER to the module and to the examples (you can mostly copy-paste what was done in SB2 documentation ;))

use_sde: bool = False,
sde_sample_freq: int = -1,
use_sde_at_warmup: bool = False,
sde_support: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sde support should not be here

self.goal_strategy, GoalSelectionStrategy
), "Invalid goal selection strategy," "please use one of {}".format(list(GoalSelectionStrategy))

self.env = ObsWrapper(env)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should wrap it only afterward and check if the wrapper is needed or not


assert isinstance(
self.goal_strategy, GoalSelectionStrategy
), "Invalid goal selection strategy," "please use one of {}".format(list(GoalSelectionStrategy))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
), "Invalid goal selection strategy," "please use one of {}".format(list(GoalSelectionStrategy))
), f"Invalid goal selection strategy, please use one of {list(GoalSelectionStrategy))}"

we require python 3.6+, so you can use f-strings

# get arguments for the model initialization
model_signature = signature(model.__init__)
arguments = locals()
model_init_dict = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need that because HER inherits from the off-policy class? I would make it inherit from the BaseAlgorithm then.
It seems that you are initializing two models (and two replay buffers, including one that you don't use)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe keep OffpolicyAlgorithm as base class but initialize empty buffer, so you can re-use learn() from the base class


# buffer with episodes
self.buffer = []
# TODO just for typing reason , need another solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

# TODO just for typing reason , need another solution
self.observations = np.zeros((self.buffer_size, self.n_envs,) + self.obs_shape, dtype=observation_space.dtype)
self.goal_strategy = goal_strategy
self.her_ratio = 1 - (1.0 / (1 + her_ratio))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing comment, looks weird compared to what is described in the docstring

]

# concatenate observation with (desired) goal
obs = [np.concatenate([o["observation"], o["desired_goal"]], axis=1) for o in observations]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid one character variable, you can use obs_ instead

her_episode_lenghts = episode_lengths[her_idxs]

# get new goals with goal selection strategy
if self.goal_strategy == GoalSelectionStrategy.FINAL:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic cannot be shared with the "offline" version?

def close(self):
return self.venv.close()

def get_attr(self, attr_name, indices=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to re-implement those as they are already in the base wrapper class, no?

self.model._last_original_obs, new_obs_, reward_ = observation, new_obs, reward

# add current transition to episode storage
self.episode_storage.append((self.model._last_original_obs, buffer_action, reward_, new_obs_, done))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be clearer to use NamedTuple (cf what is done for the replay buffer)

self.model.actor.reset_noise()

# Select action randomly or according to policy
action, buffer_action = self.model._sample_action(learning_starts, action_noise)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after thinking more about it, I think we need to define __get__attr_ to automaticaly retrieve the attribute from self.model if present. This would allow to write directly self._sample_action() .

new_rewards = np.array(rewards)
new_rewards[her_idxs] = [
self.env.env_method("compute_reward", ag, her_new_goals, None)
for ag, new_goal in zip(achieved_goals, her_new_goals)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid name without meaning: achieved_goal instead of ag ;)

self.buffer[idx] = episode
self.n_transitions_stored -= self.buffer[idx] - episode_length

if self.n_transitions_stored == self.size():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be simplified

def get_current_size(self):
return self.n_transitions_stored

def get_transitions_stored(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make self.n_transitions_stored private and create a getter using @property

def get_transitions_stored(self):
return self.n_transitions_stored

def clear_buffer(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to re-initialize the number of transitions stored too, no?

@araffin araffin mentioned this pull request Aug 3, 2020
12 tasks
def get_torch_variables(self) -> Tuple[List[str], List[str]]:
return self.model.get_torch_variables()

def save(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no quicker way of doing that? (without duplicating too much code)

I would store HER specific arguments in the model (self.model.dict) , see what is done in SB2.

# sample virtual transitions and store them in replay buffer
self._sample_her_transitions()
# clear storage for current episode
self._episode_storage.reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not properly defined in the HER replay buffer

Copy link
Member

@araffin araffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM =)

Tested on simulated envs and on a real robot, time to merge now.

@avandekleut
Copy link

LGTM =)

Tested on simulated envs and on a real robot, time to merge now.
@araffin what environments and hyperparameters did you use? I'm trying to get this working on the Fetch environments.

@araffin
Copy link
Member

araffin commented Oct 30, 2020

@araffin what environments and hyperparameters did you use? I'm trying to get this working on the Fetch environments.

depends on which Fetch please look at the rl zoo: https://github.com/DLR-RM/rl-baselines3-zoo.
For env that are not FetchReach, multiprocessing is usually required but currently not implemented, otherwise you can take a look at the SB2 zoo:
araffin/rl-baselines-zoo#53
and
araffin/rl-baselines-zoo#50

leor-c pushed a commit to leor-c/stable-baselines3 that referenced this pull request Nov 12, 2020
* Added working her version, Online sampling is missing.

* Updated test_her.

* Added first version of online her sampling. Still problems with tensor dimensions.

* Reformat

* Fixed tests

* Added some comments.

* Updated changelog.

* Add missing init file

* Fixed some small bugs.

* Reduced arguments for HER, small changes.

* Added getattr. Fixed bug for online sampling.

* Updated save/load funtions. Small changes.

* Added her to init.

* Updated save method.

* Updated her ratio.

* Move obs_wrapper

* Added DQN test.

* Fix potential bug

* Offline and online her share same sample_goal function.

* Changed lists into arrays.

* Updated her test.

* Fix online sampling

* Fixed action bug. Updated time limit for episodes.

* Updated convert_dict method to take keys as arguments.

* Renamed obs dict wrapper.

* Seed bit flipping env

* Remove get_episode_dict

* Add fast online sampling version

* Added documentation.

* Vectorized reward computation

* Vectorized goal sampling

* Update time limit for episodes in online her sampling.

* Fix max episode length inference

* Bug fix for Fetch envs

* Fix for HER + gSDE

* Reformat (new black version)

* Added info dict to compute new reward. Check her_replay_buffer again.

* Fix info buffer

* Updated done flag.

* Fixes for gSDE

* Offline her version uses now HerReplayBuffer as episode storage.

* Fix num_timesteps computation

* Fix get torch params

* Vectorized version for offline sampling.

* Modified offline her sampling to use sample method of her_replay_buffer

* Updated HER tests.

* Updated documentation

* Cleanup docstrings

* Updated to review comments

* Fix pytype

* Update according to review comments.

* Removed random goal strategy. Updated sample transitions.

* Updated migration. Removed time signal removal.

* Update doc

* Fix potential load issue

* Add VecNormalize support for dict obs

* Updated saving/loading replay buffer for HER.

* Fix test memory usage

* Fixed save/load replay buffer.

* Fixed save/load replay buffer

* Fixed transition index after loading replay buffer in online sampling

* Better error handling

* Add tests for get_time_limit

* More tests for VecNormalize with dict obs

* Update doc

* Improve HER description

* Add test for sde support

* Add comments

* Add comments

* Remove check that was always valid

* Fix for terminal observation

* Updated buffer size in offline version and reset of HER buffer

* Reformat

* Update doc

* Remove np.empty + add doc

* Fix loading

* Updated loading replay buffer

* Separate online and offline sampling + bug fixes

* Update tensorboard log name

* Version bump

* Bug fix for special case

Co-authored-by: Antonin Raffin <antonin.raffin@dlr.de>
Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
@phiresky
Copy link

It looks like the code added in this PR seems to break normal envs with dict observation space since it assumes that whenever the observation space is a dictionary the user wants HER:

 File ".../lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 203, in _wrap_env
    env = ObsDictWrapper(env)
  File ".../lib/python3.8/site-packages/stable_baselines3/common/vec_env/obs_dict_wrapper.py", line 28, in __init__
    self.obs_dim = venv.observation_space.spaces["observation"].shape[0]
KeyError: 'observation'

specifically, this code:

        # check if wrapper for dict support is needed when using HER
        if isinstance(env.observation_space, gym.spaces.dict.Dict):
            env = ObsDictWrapper(env)

doesn't verify at all that HER is what I want, and assumes the dict has a specific purpose, breaking training any envs with dict obs space

@araffin
Copy link
Member

araffin commented Nov 30, 2020

It looks like the code added in this PR seems to break normal envs with dict observation space since it assumes that whenever the observation space is a dictionary the user wants HER:

Please read #216
Dict observation spaces other than GoalEnv are currently not supported by SB3 (but there is an active PR for that here #243)

@phiresky
Copy link

@araffin thanks for the info and links. might be good to throw a more readable error in that case. For now this works fine for me:

class FlattenVecWrapper(VecEnvWrapper):
    def __init__(self, env):
        super().__init__(env)
        self.observation_space = gym.spaces.flatten_space(self.venv.observation_space)

    def reset(self, **kwargs):
        observation = self.venv.reset(**kwargs)
        return self.observation(observation)

    def step_wait(self):
        observation, reward, done, info = self.venv.step_wait()
        return self.observation(observation), reward, done, info

    def observation(self, observation):
        return [gym.spaces.flatten(self.venv.observation_space, o) for o in observation]

leor-c pushed a commit to leor-c/stable-baselines3 that referenced this pull request Aug 26, 2021
* Added working her version, Online sampling is missing.

* Updated test_her.

* Added first version of online her sampling. Still problems with tensor dimensions.

* Reformat

* Fixed tests

* Added some comments.

* Updated changelog.

* Add missing init file

* Fixed some small bugs.

* Reduced arguments for HER, small changes.

* Added getattr. Fixed bug for online sampling.

* Updated save/load funtions. Small changes.

* Added her to init.

* Updated save method.

* Updated her ratio.

* Move obs_wrapper

* Added DQN test.

* Fix potential bug

* Offline and online her share same sample_goal function.

* Changed lists into arrays.

* Updated her test.

* Fix online sampling

* Fixed action bug. Updated time limit for episodes.

* Updated convert_dict method to take keys as arguments.

* Renamed obs dict wrapper.

* Seed bit flipping env

* Remove get_episode_dict

* Add fast online sampling version

* Added documentation.

* Vectorized reward computation

* Vectorized goal sampling

* Update time limit for episodes in online her sampling.

* Fix max episode length inference

* Bug fix for Fetch envs

* Fix for HER + gSDE

* Reformat (new black version)

* Added info dict to compute new reward. Check her_replay_buffer again.

* Fix info buffer

* Updated done flag.

* Fixes for gSDE

* Offline her version uses now HerReplayBuffer as episode storage.

* Fix num_timesteps computation

* Fix get torch params

* Vectorized version for offline sampling.

* Modified offline her sampling to use sample method of her_replay_buffer

* Updated HER tests.

* Updated documentation

* Cleanup docstrings

* Updated to review comments

* Fix pytype

* Update according to review comments.

* Removed random goal strategy. Updated sample transitions.

* Updated migration. Removed time signal removal.

* Update doc

* Fix potential load issue

* Add VecNormalize support for dict obs

* Updated saving/loading replay buffer for HER.

* Fix test memory usage

* Fixed save/load replay buffer.

* Fixed save/load replay buffer

* Fixed transition index after loading replay buffer in online sampling

* Better error handling

* Add tests for get_time_limit

* More tests for VecNormalize with dict obs

* Update doc

* Improve HER description

* Add test for sde support

* Add comments

* Add comments

* Remove check that was always valid

* Fix for terminal observation

* Updated buffer size in offline version and reset of HER buffer

* Reformat

* Update doc

* Remove np.empty + add doc

* Fix loading

* Updated loading replay buffer

* Separate online and offline sampling + bug fixes

* Update tensorboard log name

* Version bump

* Bug fix for special case

Co-authored-by: Antonin Raffin <antonin.raffin@dlr.de>
Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement HER
4 participants