Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N-step updates for off-policy methods #81

Closed
wants to merge 47 commits into from
Closed

N-step updates for off-policy methods #81

wants to merge 47 commits into from

Conversation

m-rph
Copy link
Contributor

@m-rph m-rph commented Jun 30, 2020

Changes

OffPolicyAlgorithm

  • Now accepts buffer class as argument type
    • Rationale: allows for other types of replay buffers besides the current implementations
  • Now accepts arguments for the replay buffer
    • Rationale: makes it easier to pass arguments because we don't need to check every scenario.
    • I haven't moved buffer_size in it

NStepReplayBuffer

  • Derives ReplayBuffer and adds two more arguments, n_step and gamma.
  • Overrides _get_samples(self, batch_inds, env) -> ReplayBufferSamples to compute NStepReturns.
    • Computation is done purely through numpy without python loops. This makes the implementation easier and faster.

Computing Nstep returns

  1. Find correct transition indices
  2. Create a 2d mask for the rewards using done
  3. Filter the mask so that self.pos is ignored because it contains older experience.
  4. Multiply the filter with the reward, the discounting, and sum across.

Motivation and Context

  • I have raised an issue to propose this change (required for new features and bug fixes)

Closes #47

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • I've read the CONTRIBUTION guide (required)
  • I have updated the changelog accordingly (required).
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.
  • I have checked the codestyle using make lint
  • I have ensured make pytest and make type both pass.

Nstep Implementation details

@araffin
Copy link
Member

araffin commented Jun 30, 2020

Funny, I also recently gave it a try here: https://github.com/DLR-RM/stable-baselines3/tree/feat/n-steps

@m-rph
Copy link
Contributor Author

m-rph commented Jun 30, 2020

The implementation appears mostly the same. I would like to add the class and the arguments as parameters to the OffPolicyAlgorithm along with a variable named replay_args, in a similar fashion to how policies are passed. This will make it easy to add other methods as well. IIRC energy based replay buffer was also considered at some point.

@araffin araffin added the experimental Experimental Feature label Jul 1, 2020
@m-rph
Copy link
Contributor Author

m-rph commented Jul 4, 2020

@araffin
So I was working on this and there's a bug with both your and my implementations.
The bug is due to % self.buffer_size.

next_idx = batch_inds.copy()
# Compute n-step discounted return and associated next-observation
last_return = 0.0
for step in reversed(range(self.n_steps)):
current_step = (batch_inds + step) % self.buffer_size
if step == self.n_steps - 1:
next_idx = current_step.copy()
last_return = self._normalize_reward(self.rewards[current_step], env)
else:
next_non_terminal = (1.0 - self.dones[(current_step + 1) % self.buffer_size]).flatten()
next_idx = next_idx * next_non_terminal + current_step * (1 - next_non_terminal)
last_return = (self._normalize_reward(self.rewards[current_step], env)
+ self.gamma * last_return * next_non_terminal.reshape(-1, 1))

The bug is basically the same as with memory optimization. To be exact, going around the buffer (% self.buffer_size) gives no quarantine that the transition is part of the same trajectory.

My implementation is WIP and there's a couple of issues that I am working on.

** Edit: **

Completed it. Removed loops, added tests, using just numpy arrays, no loops. Should be quite fast.

@araffin
Copy link
Member

araffin commented Jul 10, 2020

The bug is basically the same as with memory optimization.

🙈
Yep, I did only quick test with it and could not see any improvement yet.

Completed it. Removed loops, added tests, using just numpy arrays, no loops. Should be quite fast.

Sounds good =)

Ping me when you want a review

@m-rph
Copy link
Contributor Author

m-rph commented Jul 10, 2020

I merged master and removed the generic from offpolicyalgorithm since it wasn't discussed. I think it's ready for a review.

Ps. sorry for autoformatting >.< ...

@araffin
Copy link
Member

araffin commented Jul 10, 2020

Ps. sorry for autoformatting >.< ...

I will do a PR soon for that ;)
Apparently will be with black.

@araffin araffin self-requested a review July 10, 2020 08:39
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
stable_baselines3/common/buffers.py Outdated Show resolved Hide resolved
@Miffyli
Copy link
Collaborator

Miffyli commented Jul 11, 2020

On a quick glimpse it looks good, like arrafin already mentioned, but I would be even more exhaustive with the comments. The numpy tricks are compact and can be performant, but the code becomes harder to understand. I would also include the shape of arrays in the comments with a definition of contents, e.g. # (batch_size, num_actions) of {0,1}.

@araffin araffin mentioned this pull request Jul 15, 2020
12 tasks
@m-rph
Copy link
Contributor Author

m-rph commented Aug 29, 2020

Fair point! I have a couple of weeks before grad classes pick up the pace so hopefully I will have everything tied up before then.

@m-rph
Copy link
Contributor Author

m-rph commented Aug 29, 2020

@araffin
I am considering the following changes:

  1. Use a component called say "batch generator" that given the indices produces the batch.
  2. Instead of subclassing the buffer class for the different buffers, just pass buffer_kwargs, the constructor of the buffer will then create the correct component.

The component can be anything from simple n_steps to n_steps with entropy.

This removes the need to create all buffer combinations, e.g. if we were to have prioritized replay, we wouldn't need to create n_step_prioritized_replay.

What do you think?

@araffin
Copy link
Member

araffin commented Aug 29, 2020

Use a component called say "batch generator" that given the indices produces the batch.

How is that different from _get_samples() ?

Instead of subclassing the buffer class for the different buffers, just pass buffer_kwargs, the constructor of the buffer will then create the correct component.

Not sure what you mean, it sounds like a _get_samples() with a lot of branching?

@m-rph
Copy link
Contributor Author

m-rph commented Aug 29, 2020

How is that different from _get_samples() ?

_get_samples() is bound to the class of the buffer. I would like to use composition instead of sub-classing the buffer to create new _get_samples().

very rough example

class ReplayBuffer(BaseBuffer):
     def __init__(self, ..., batch_method, **batch_kwargs):
          self._sampler = Sampler.make(batch_method, **batch_kwargs) # Will create the correct sampler depending on args
          # make is a class method.

The consequence here is that we can do:

env = ...
agent_classes = [TD3, SAC] # SAC will add `use_entropy=True` in `buffer_kwargs` unless specified otherwise
for cls in agent_classes:
    agent = cls(env, buffer_kwargs={'batch_method':'nsteps', 'nsteps':5}) # all these can be trivially logged because they are strings and ints

The pros are:

  • We remove the need to provide the buffer class
  • The sampler can be swapped on the fly
  • We don't need to create multiple classes for buffers that just combine other classes, e.g. for Nstep+PER, we will not have to add NStepPer, and EntropyNStepPer, for Retrace(lambda) we will not have to add RetracePer, EntropicRetracePer
  • We can make it trivial to create new sampler classes and register them.
  • SAC will not actually require a new buffer class for itself that would need to be specified in its arguments to ensure correctness.
  • The number of classes is linear to the number of sampler methods.

The cons are:

  • More types
  • Need to write a function to construct the correct sampler (fairly easy)

@araffin
Copy link
Member

araffin commented Sep 16, 2020

I would like to use composition instead of sub-classing the buffer to create new _get_samples().

I see, the idea would be to create a class in charge of sampling instead of putting everything in the get_sample() method. (sorry for the delay, was on holidays)
That way the replay buffer is composed of two parts: a storage (will be a tree for PER) and a sampler.

sounds reasonable, @Miffyli or @AdamGleave do you have any comment on that?

@Miffyli
Copy link
Collaborator

Miffyli commented Sep 16, 2020

Just to reiterate: The storage would just store the samples and possibly priorities, and sampler reads samples from the storage and computes the desired return (i.e. single step, n-step, retrace, etc etc)? This sounds good to me if they can be separated cleanly, as then other samplers can be added without changing the storage code (as mentioned by PartiallyTyped). I am mainly worried if there could be some sort of memory/performance improvement if the two are combined, but I still prefer separating the two if possible.

One thing: It should be called something else than "sampler" if the storage defines how samples are sampled, but this "sampler" just computes the returns and returns correct successor states.

@m-rph
Copy link
Contributor Author

m-rph commented Sep 16, 2020

To address the issue of overhead, it would be minimal, perhaps an additional function call or attribute lookup.

PS i couldnt come up with a better name.

@AdamGleave
Copy link
Collaborator

It seems reasonable to separate these. I second @Miffyli's confusion regarding the name "sampler" for something that computes returns. Perhaps "buffer" and "postprocessor" (instead of "sampler").

@geyang
Copy link

geyang commented Dec 18, 2021

For n-step, testing it on DMC Quadruped would be good, because n-step is expected to show some gain there.

@m-rph
Copy link
Contributor Author

m-rph commented Apr 25, 2022

Hey @araffin, I have been quite busy with grad school so the branch's become stale.
What would be the minimum requirements to merge this?

@araffin
Copy link
Member

araffin commented Apr 25, 2022

Hey @araffin, I have been quite busy with grad school so the branch's become stale. What would be the minimum requirements to merge this?

Hey,

I have been quite busy with grad school so the branch's become stale.

no worry.

let me try to go back to you this week if I have time.
In the meantime, I tested a more general version in contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/tree/feat/peng-q-lambda
but haven't had time to benchmark it properly.

What would be the minimum requirements to merge this?

now, I'm thinking the place should be better in SB3-contrib (what do you think @Miffyli ?)
and anyway, some benchmarking would be needed + enough testing before merging it.
The first thing you need to do is solve the merge conflicts I think... the biggest difference is the handling of timelimit now.

@emrul
Copy link

emrul commented Feb 10, 2023

Just out of curiosity I tried this buffer implementation and I think it does not handle multiple environments correctly. Specifically the line:

not_dones = np.squeeze(1 - self.dones[indices], axis=-1)

causes a ValueError: cannot select an axis to squeeze out which has size not equal to one error.

As an aside, any please don't take this as criticism of this project which I enjoy using but its quite sad to see potentially really great contributions/PRs sitting open for nearing 3 years. In that time perhaps a lot of people could have benefitted - even if this was rushed into the contrib project it would have been more useful than as-is.

@m-rph
Copy link
Contributor Author

m-rph commented Feb 11, 2023 via email

@emrul
Copy link

emrul commented Feb 11, 2023

I'm still a bit new to numpy and pytorch manipulation but keen to help with little things if you need (e.g. docs, etc.)

@richardjozsa
Copy link

what is missing from this PR to be complete?

@emrul
Copy link

emrul commented Oct 19, 2023

I'm not sure to be honest, I couldn't get it to work (with vectorised environments as previously mentioned) and wasn't sure how to go about fixing it.

@richardjozsa
Copy link

I'm not sure to be honest, I couldn't get it to work (with vectorised environments as previously mentioned) and wasn't sure how to go about fixing it.

Yeah, I also tested it, it can't work with vectorized or multiprocessed environments currently. This is why I asked maybe someone knows where is the problem and we could get it fixed and maybe merged, it would be a great addition to the Q learning section.

@m-rph m-rph closed this Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental Experimental Feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature-request] N-step returns for TD methods
7 participants