-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
N-step updates for off-policy methods #81
Conversation
Funny, I also recently gave it a try here: https://github.com/DLR-RM/stable-baselines3/tree/feat/n-steps |
The implementation appears mostly the same. I would like to add the class and the arguments as parameters to the |
@araffin stable-baselines3/stable_baselines3/common/buffers.py Lines 441 to 454 in 41eb0cb
The bug is basically the same as with memory optimization. To be exact, going around the buffer (% self.buffer_size) gives no quarantine that the transition is part of the same trajectory. My implementation is WIP and there's a couple of issues that I am working on. ** Edit: ** Completed it. Removed loops, added tests, using just numpy arrays, no loops. Should be quite fast. |
🙈
Sounds good =) Ping me when you want a review |
I merged master and removed the generic from Ps. sorry for autoformatting >.< ... |
I will do a PR soon for that ;) |
On a quick glimpse it looks good, like arrafin already mentioned, but I would be even more exhaustive with the comments. The numpy tricks are compact and can be performant, but the code becomes harder to understand. I would also include the shape of arrays in the comments with a definition of contents, e.g. |
Fair point! I have a couple of weeks before grad classes pick up the pace so hopefully I will have everything tied up before then. |
@araffin
The component can be anything from simple This removes the need to create all buffer combinations, e.g. if we were to have prioritized replay, we wouldn't need to create What do you think? |
How is that different from
Not sure what you mean, it sounds like a |
very rough example class ReplayBuffer(BaseBuffer):
def __init__(self, ..., batch_method, **batch_kwargs):
self._sampler = Sampler.make(batch_method, **batch_kwargs) # Will create the correct sampler depending on args
# make is a class method. The consequence here is that we can do: env = ...
agent_classes = [TD3, SAC] # SAC will add `use_entropy=True` in `buffer_kwargs` unless specified otherwise
for cls in agent_classes:
agent = cls(env, buffer_kwargs={'batch_method':'nsteps', 'nsteps':5}) # all these can be trivially logged because they are strings and ints The pros are:
The cons are:
|
I see, the idea would be to create a class in charge of sampling instead of putting everything in the sounds reasonable, @Miffyli or @AdamGleave do you have any comment on that? |
Just to reiterate: The storage would just store the samples and possibly priorities, and sampler reads samples from the storage and computes the desired return (i.e. single step, n-step, retrace, etc etc)? This sounds good to me if they can be separated cleanly, as then other samplers can be added without changing the storage code (as mentioned by PartiallyTyped). I am mainly worried if there could be some sort of memory/performance improvement if the two are combined, but I still prefer separating the two if possible. One thing: It should be called something else than "sampler" if the storage defines how samples are sampled, but this "sampler" just computes the returns and returns correct successor states. |
To address the issue of overhead, it would be minimal, perhaps an additional function call or attribute lookup. PS i couldnt come up with a better name. |
It seems reasonable to separate these. I second @Miffyli's confusion regarding the name "sampler" for something that computes returns. Perhaps "buffer" and "postprocessor" (instead of "sampler"). |
For n-step, testing it on DMC Quadruped would be good, because n-step is expected to show some gain there. |
Hey @araffin, I have been quite busy with grad school so the branch's become stale. |
Hey,
no worry. let me try to go back to you this week if I have time.
now, I'm thinking the place should be better in SB3-contrib (what do you think @Miffyli ?) |
Just out of curiosity I tried this buffer implementation and I think it does not handle multiple environments correctly. Specifically the line:
causes a As an aside, any please don't take this as criticism of this project which I enjoy using but its quite sad to see potentially really great contributions/PRs sitting open for nearing 3 years. In that time perhaps a lot of people could have benefitted - even if this was rushed into the contrib project it would have been more useful than as-is. |
I had totally forgotten about this between work and my master’s. I will spend a weekend or two to fix and ship it, indeed it’s a shame to leave it here.
…On Fri, Feb 10, 2023 at 21:51, Emrul ***@***.***> wrote:
Just out of curiosity I tried this buffer implementation and I think it does not handle multiple environments correctly. Specifically the line:
not_dones = np.squeeze(1 - self.dones[indices], axis=-1)
causes a ValueError: cannot select an axis to squeeze out which has size not equal to one error.
As an aside, any please don't take this as criticism of this project which I enjoy using but its quite sad to see potentially really great contributions/PRs sitting open for nearing 3 years. In that time perhaps a lot of people could have benefitted - even if this was rushed into the contrib project it would have been more useful than as-is.
—
Reply to this email directly, [view it on GitHub](#81 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AMPSKHPWZNKNTC4HRBRTQH3WW2S4FANCNFSM4OMMB6YQ).
You are receiving this because you commented.Message ID: ***@***.***>
|
I'm still a bit new to numpy and pytorch manipulation but keen to help with little things if you need (e.g. docs, etc.) |
what is missing from this PR to be complete? |
I'm not sure to be honest, I couldn't get it to work (with vectorised environments as previously mentioned) and wasn't sure how to go about fixing it. |
Yeah, I also tested it, it can't work with vectorized or multiprocessed environments currently. This is why I asked maybe someone knows where is the problem and we could get it fixed and maybe merged, it would be a great addition to the Q learning section. |
Changes
OffPolicyAlgorithm
buffer_size
in itNStepReplayBuffer
n_step
andgamma
._get_samples(self, batch_inds, env) -> ReplayBufferSamples
to compute NStepReturns.Computing Nstep returns
self.pos
is ignored because it contains older experience.Motivation and Context
Closes #47
Types of changes
Checklist:
make lint
make pytest
andmake type
both pass.Nstep Implementation details