N-step updates for off-policy methods #81

m-rph · 2020-06-30T15:55:00Z

Changes

OffPolicyAlgorithm

Now accepts buffer class as argument type
- Rationale: allows for other types of replay buffers besides the current implementations
Now accepts arguments for the replay buffer
- Rationale: makes it easier to pass arguments because we don't need to check every scenario.
- I haven't moved buffer_size in it

NStepReplayBuffer

Derives ReplayBuffer and adds two more arguments, n_step and gamma.
Overrides _get_samples(self, batch_inds, env) -> ReplayBufferSamples to compute NStepReturns.
- Computation is done purely through numpy without python loops. This makes the implementation easier and faster.

Computing Nstep returns

Find correct transition indices
Create a 2d mask for the rewards using done
Filter the mask so that self.pos is ignored because it contains older experience.
Multiply the filter with the reward, the discounting, and sum across.

Motivation and Context

I have raised an issue to propose this change (required for new features and bug fixes)

Closes #47

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

I've read the CONTRIBUTION guide (required)
I have updated the changelog accordingly (required).
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.
I have checked the codestyle using make lint
I have ensured make pytest and make type both pass.

Nstep Implementation details

araffin · 2020-06-30T15:56:53Z

Funny, I also recently gave it a try here: https://github.com/DLR-RM/stable-baselines3/tree/feat/n-steps

m-rph · 2020-06-30T16:34:57Z

The implementation appears mostly the same. I would like to add the class and the arguments as parameters to the OffPolicyAlgorithm along with a variable named replay_args, in a similar fashion to how policies are passed. This will make it easy to add other methods as well. IIRC energy based replay buffer was also considered at some point.

m-rph · 2020-07-04T09:01:33Z

@araffin
So I was working on this and there's a bug with both your and my implementations.
The bug is due to % self.buffer_size.

stable-baselines3/stable_baselines3/common/buffers.py

Lines 441 to 454 in 41eb0cb

    
           next_idx = batch_inds.copy() 
        
           # Compute n-step discounted return and associated next-observation 
        
           last_return = 0.0 
        
           for step in reversed(range(self.n_steps)): 
        
               current_step = (batch_inds + step) % self.buffer_size 
        
               if step == self.n_steps - 1: 
        
                   next_idx = current_step.copy() 
        
                   last_return = self._normalize_reward(self.rewards[current_step], env) 
        
               else: 
        
                   next_non_terminal = (1.0 - self.dones[(current_step + 1) % self.buffer_size]).flatten() 
        
                   next_idx = next_idx * next_non_terminal + current_step * (1 - next_non_terminal) 
        
                   last_return = (self._normalize_reward(self.rewards[current_step], env) 
        
                                  + self.gamma * last_return * next_non_terminal.reshape(-1, 1))

The bug is basically the same as with memory optimization. To be exact, going around the buffer (% self.buffer_size) gives no quarantine that the transition is part of the same trajectory.

My implementation is WIP and there's a couple of issues that I am working on.

** Edit: **

Completed it. Removed loops, added tests, using just numpy arrays, no loops. Should be quite fast.

araffin · 2020-07-10T08:31:45Z

The bug is basically the same as with memory optimization.

🙈
Yep, I did only quick test with it and could not see any improvement yet.

Completed it. Removed loops, added tests, using just numpy arrays, no loops. Should be quite fast.

Sounds good =)

Ping me when you want a review

m-rph · 2020-07-10T08:34:29Z

I merged master and removed the generic from offpolicyalgorithm since it wasn't discussed. I think it's ready for a review.

Ps. sorry for autoformatting >.< ...

araffin · 2020-07-10T08:39:16Z

Ps. sorry for autoformatting >.< ...

I will do a PR soon for that ;)
Apparently will be with black.

stable_baselines3/common/buffers.py

Miffyli · 2020-07-11T15:13:18Z

On a quick glimpse it looks good, like arrafin already mentioned, but I would be even more exhaustive with the comments. The numpy tricks are compact and can be performant, but the code becomes harder to understand. I would also include the shape of arrays in the comments with a definition of contents, e.g. # (batch_size, num_actions) of {0,1}.

m-rph · 2020-08-29T16:40:06Z

Fair point! I have a couple of weeks before grad classes pick up the pace so hopefully I will have everything tied up before then.

m-rph · 2020-08-29T16:48:41Z

@araffin
I am considering the following changes:

Use a component called say "batch generator" that given the indices produces the batch.
Instead of subclassing the buffer class for the different buffers, just pass buffer_kwargs, the constructor of the buffer will then create the correct component.

The component can be anything from simple n_steps to n_steps with entropy.

This removes the need to create all buffer combinations, e.g. if we were to have prioritized replay, we wouldn't need to create n_step_prioritized_replay.

What do you think?

araffin · 2020-08-29T18:03:46Z

Use a component called say "batch generator" that given the indices produces the batch.

How is that different from _get_samples() ?

Instead of subclassing the buffer class for the different buffers, just pass buffer_kwargs, the constructor of the buffer will then create the correct component.

Not sure what you mean, it sounds like a _get_samples() with a lot of branching?

m-rph · 2020-08-29T19:09:43Z

How is that different from _get_samples() ?

_get_samples() is bound to the class of the buffer. I would like to use composition instead of sub-classing the buffer to create new _get_samples().

very rough example

class ReplayBuffer(BaseBuffer):
     def __init__(self, ..., batch_method, **batch_kwargs):
          self._sampler = Sampler.make(batch_method, **batch_kwargs) # Will create the correct sampler depending on args
          # make is a class method.

The consequence here is that we can do:

env = ...
agent_classes = [TD3, SAC] # SAC will add `use_entropy=True` in `buffer_kwargs` unless specified otherwise
for cls in agent_classes:
    agent = cls(env, buffer_kwargs={'batch_method':'nsteps', 'nsteps':5}) # all these can be trivially logged because they are strings and ints

The pros are:

We remove the need to provide the buffer class
The sampler can be swapped on the fly
We don't need to create multiple classes for buffers that just combine other classes, e.g. for Nstep+PER, we will not have to add NStepPer, and EntropyNStepPer, for Retrace(lambda) we will not have to add RetracePer, EntropicRetracePer
We can make it trivial to create new sampler classes and register them.
SAC will not actually require a new buffer class for itself that would need to be specified in its arguments to ensure correctness.
The number of classes is linear to the number of sampler methods.

The cons are:

More types
Need to write a function to construct the correct sampler (fairly easy)

araffin · 2020-09-16T11:20:28Z

I would like to use composition instead of sub-classing the buffer to create new _get_samples().

I see, the idea would be to create a class in charge of sampling instead of putting everything in the get_sample() method. (sorry for the delay, was on holidays)
That way the replay buffer is composed of two parts: a storage (will be a tree for PER) and a sampler.

sounds reasonable, @Miffyli or @AdamGleave do you have any comment on that?

Miffyli · 2020-09-16T14:20:58Z

Just to reiterate: The storage would just store the samples and possibly priorities, and sampler reads samples from the storage and computes the desired return (i.e. single step, n-step, retrace, etc etc)? This sounds good to me if they can be separated cleanly, as then other samplers can be added without changing the storage code (as mentioned by PartiallyTyped). I am mainly worried if there could be some sort of memory/performance improvement if the two are combined, but I still prefer separating the two if possible.

One thing: It should be called something else than "sampler" if the storage defines how samples are sampled, but this "sampler" just computes the returns and returns correct successor states.

m-rph · 2020-09-16T16:54:14Z

To address the issue of overhead, it would be minimal, perhaps an additional function call or attribute lookup.

PS i couldnt come up with a better name.

AdamGleave · 2020-09-20T15:00:13Z

It seems reasonable to separate these. I second @Miffyli's confusion regarding the name "sampler" for something that computes returns. Perhaps "buffer" and "postprocessor" (instead of "sampler").

geyang · 2021-12-18T17:33:43Z

For n-step, testing it on DMC Quadruped would be good, because n-step is expected to show some gain there.

m-rph · 2022-04-25T07:14:43Z

Hey @araffin, I have been quite busy with grad school so the branch's become stale.
What would be the minimum requirements to merge this?

araffin · 2022-04-25T15:50:00Z

Hey @araffin, I have been quite busy with grad school so the branch's become stale. What would be the minimum requirements to merge this?

Hey,

I have been quite busy with grad school so the branch's become stale.

no worry.

let me try to go back to you this week if I have time.
In the meantime, I tested a more general version in contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/tree/feat/peng-q-lambda
but haven't had time to benchmark it properly.

What would be the minimum requirements to merge this?

now, I'm thinking the place should be better in SB3-contrib (what do you think @Miffyli ?)
and anyway, some benchmarking would be needed + enough testing before merging it.
The first thing you need to do is solve the merge conflicts I think... the biggest difference is the handling of timelimit now.

emrul · 2023-02-10T20:51:03Z

Just out of curiosity I tried this buffer implementation and I think it does not handle multiple environments correctly. Specifically the line:

not_dones = np.squeeze(1 - self.dones[indices], axis=-1)

causes a ValueError: cannot select an axis to squeeze out which has size not equal to one error.

As an aside, any please don't take this as criticism of this project which I enjoy using but its quite sad to see potentially really great contributions/PRs sitting open for nearing 3 years. In that time perhaps a lot of people could have benefitted - even if this was rushed into the contrib project it would have been more useful than as-is.

m-rph · 2023-02-11T15:27:58Z

I had totally forgotten about this between work and my master’s. I will spend a weekend or two to fix and ship it, indeed it’s a shame to leave it here.

…

On Fri, Feb 10, 2023 at 21:51, Emrul ***@***.***> wrote: Just out of curiosity I tried this buffer implementation and I think it does not handle multiple environments correctly. Specifically the line: not_dones = np.squeeze(1 - self.dones[indices], axis=-1) causes a ValueError: cannot select an axis to squeeze out which has size not equal to one error. As an aside, any please don't take this as criticism of this project which I enjoy using but its quite sad to see potentially really great contributions/PRs sitting open for nearing 3 years. In that time perhaps a lot of people could have benefitted - even if this was rushed into the contrib project it would have been more useful than as-is. — Reply to this email directly, [view it on GitHub](#81 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AMPSKHPWZNKNTC4HRBRTQH3WW2S4FANCNFSM4OMMB6YQ). You are receiving this because you commented.Message ID: ***@***.***>

emrul · 2023-02-11T22:32:30Z

I'm still a bit new to numpy and pytorch manipulation but keen to help with little things if you need (e.g. docs, etc.)

richardjozsa · 2023-10-19T10:42:21Z

what is missing from this PR to be complete?

emrul · 2023-10-19T12:53:30Z

I'm not sure to be honest, I couldn't get it to work (with vectorised environments as previously mentioned) and wasn't sure how to go about fixing it.

richardjozsa · 2023-10-19T13:09:25Z

I'm not sure to be honest, I couldn't get it to work (with vectorised environments as previously mentioned) and wasn't sure how to go about fixing it.

Yeah, I also tested it, it can't work with vectorized or multiprocessed environments currently. This is why I asked maybe someone knows where is the problem and we could get it fixed and maybe merged, it would be a great addition to the Q learning section.

Stelios Tymvios added 3 commits June 30, 2020 18:39

initial implementation of n-step

9cc70f4

corrected a bug where the end of the buffer wasn't considered

3861434

linting

8b37f80

Changed argument order

5a2f061

araffin added the experimental Experimental Feature label Jul 1, 2020

Stelios Tymvios added 2 commits July 3, 2020 17:11

Merge branch 'master' into nstep

3203f0d

wip

3e28cf5

Stelios Tymvios added 7 commits July 4, 2020 13:44

Implemented n-step replay buffer

376d3c9

explanations

1709ebe

more tests, corrected a bug incrementing was wrong when done=1

3907ef6

linting

77d842b

fixed bug with done indexing, fixed bug with reward type

641a330

reverted generic offpolicy

e317ba6

Merge remote-tracking branch 'dlr/master' into nstep

81fb354

araffin self-requested a review July 10, 2020 08:39

Merge branch 'master' into nstep

8ca5179

araffin reviewed Jul 11, 2020

View reviewed changes

Stelios Tymvios added 3 commits July 11, 2020 21:07

corrected miscalculation of filter

2057a10

clarifications

0879a87

docstring

f087d33

araffin mentioned this pull request Jul 15, 2020

HER Implementation #42

Closed

12 tasks

araffin added 2 commits July 16, 2020 11:56

Merge branch 'master' into nstep

1ac9da5

Merge branch 'master' into nstep

c56ef2e

Merge branch 'master' into nstep

4370e91

araffin added 2 commits August 29, 2020 20:03

Merge branch 'master' into nstep

8ce2fe1

Merge branch 'master' into nstep

449d07c

araffin added 2 commits September 1, 2020 09:53

Merge branch 'master' into nstep

6333847

Merge branch 'master' into nstep

f749884

araffin added 4 commits September 24, 2020 14:29

Merge branch 'master' into nstep

6fbd55c

Merge branch 'master' into nstep

e63949f

Merge branch 'master' into nstep

9d04b26

Merge branch 'master' into nstep

56f83bb

araffin mentioned this pull request Nov 4, 2020

[Feature Request] basic dict/tuple support for observations #216

Closed

J-Travnik mentioned this pull request Jan 6, 2021

Dictionary Observations #243

Merged

18 tasks

araffin mentioned this pull request Mar 13, 2023

Multiprocessing support for HerReplayBuffer #704

Merged

14 tasks

m-rph closed this Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N-step updates for off-policy methods #81

N-step updates for off-policy methods #81

m-rph commented Jun 30, 2020 •

edited

Loading

araffin commented Jun 30, 2020

m-rph commented Jun 30, 2020

m-rph commented Jul 4, 2020 •

edited

Loading

araffin commented Jul 10, 2020

m-rph commented Jul 10, 2020 •

edited

Loading

araffin commented Jul 10, 2020

Miffyli commented Jul 11, 2020

m-rph commented Aug 29, 2020

m-rph commented Aug 29, 2020 •

edited

Loading

araffin commented Aug 29, 2020

m-rph commented Aug 29, 2020

araffin commented Sep 16, 2020 •

edited

Loading

Miffyli commented Sep 16, 2020

m-rph commented Sep 16, 2020 •

edited

Loading

AdamGleave commented Sep 20, 2020

geyang commented Dec 18, 2021 •

edited

Loading

m-rph commented Apr 25, 2022

araffin commented Apr 25, 2022

emrul commented Feb 10, 2023

m-rph commented Feb 11, 2023 via email

emrul commented Feb 11, 2023

richardjozsa commented Oct 19, 2023

emrul commented Oct 19, 2023

richardjozsa commented Oct 19, 2023

N-step updates for off-policy methods #81

N-step updates for off-policy methods #81

Conversation

m-rph commented Jun 30, 2020 • edited Loading

Changes

OffPolicyAlgorithm

NStepReplayBuffer

Motivation and Context

Types of changes

Checklist:

Nstep Implementation details

araffin commented Jun 30, 2020

m-rph commented Jun 30, 2020

m-rph commented Jul 4, 2020 • edited Loading

araffin commented Jul 10, 2020

m-rph commented Jul 10, 2020 • edited Loading

araffin commented Jul 10, 2020

Miffyli commented Jul 11, 2020

m-rph commented Aug 29, 2020

m-rph commented Aug 29, 2020 • edited Loading

araffin commented Aug 29, 2020

m-rph commented Aug 29, 2020

araffin commented Sep 16, 2020 • edited Loading

Miffyli commented Sep 16, 2020

m-rph commented Sep 16, 2020 • edited Loading

AdamGleave commented Sep 20, 2020

geyang commented Dec 18, 2021 • edited Loading

m-rph commented Apr 25, 2022

araffin commented Apr 25, 2022

emrul commented Feb 10, 2023

m-rph commented Feb 11, 2023 via email

emrul commented Feb 11, 2023

richardjozsa commented Oct 19, 2023

emrul commented Oct 19, 2023

richardjozsa commented Oct 19, 2023

m-rph commented Jun 30, 2020 •

edited

Loading

m-rph commented Jul 4, 2020 •

edited

Loading

m-rph commented Jul 10, 2020 •

edited

Loading

m-rph commented Aug 29, 2020 •

edited

Loading

araffin commented Sep 16, 2020 •

edited

Loading

m-rph commented Sep 16, 2020 •

edited

Loading

geyang commented Dec 18, 2021 •

edited

Loading