Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature-request] N-step returns for TD methods #821

Closed
m-rph opened this issue Apr 21, 2020 · 6 comments
Closed

[feature-request] N-step returns for TD methods #821

m-rph opened this issue Apr 21, 2020 · 6 comments
Labels
enhancement New feature or request v3 Discussion about V3

Comments

@m-rph
Copy link

m-rph commented Apr 21, 2020

N-step returns allow for much better stability, and improve performance when training DQN, DDPG etc, so it will be quite useful to have this feature.

A simple implementation of this would be as a wrapper around ReplayBuffer so it would work with both Prioritized and Uniform sampling. The wrapper keeps a queue of observed experiences compute the returns and add the experience to the buffer.

An issue with this approach is that it will not work with MPI. An alternative solution is to have the agent keep track of the experiences and while that will work for mpi, it will not work for VecEnv.

However, mpi is dropped from V3, so perhaps this can be postponed for v3.1?

@Miffyli Miffyli added the enhancement New feature or request label Apr 21, 2020
@Miffyli
Copy link
Collaborator

Miffyli commented Apr 21, 2020

Indeed this would be a sensible upgrade we could include in v3.1. However it will also require bit of algorithm-specific tuning when it comes to bootstrapping to the current value function (estimating the value from N:th step onwards).

@Miffyli Miffyli added the v3 Discussion about V3 label Apr 21, 2020
@m-rph
Copy link
Author

m-rph commented Apr 21, 2020

Why so? Having s_t,a,r[t:t+n],d,s_tpn in the RB, we can use s_tpn as the next state in the target network. This is how it was done in Rainbow, so I assume it will be okay?

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 21, 2020

Ah yes, if the replay buffer provides it like that, and then we have one function to compute the n-step returns (like in the Rainbow and other DM codes), then yup, shouldn't be more complicated than that. Things could get bit hairier with recurrent policies, but that is a problem for when this is going to be implemented.

@araffin
Copy link
Collaborator

araffin commented Apr 21, 2020

I'm totally for this feature (it is on my internal roadmap for the v3) but as @Miffyli suggest, more after the first stable release.

@m-rph
Copy link
Author

m-rph commented Apr 29, 2020

In the initial post I considered using just the truncated sum of the returns, however, it would be invaluable to also have: retrace(lambda), treebackup(lambda) and importance sampling. However, I understand that this introduces a non trivial amount of additional complexity.

Edit: Off-policy methods with retrace (and friends) with a replay buffer can use LazyArrays to store the observations with minimal memory overhead. Will need to also store action probabilities which, while simple enough, prevents the algorithms from being plug and play.

@araffin
Copy link
Collaborator

araffin commented Jun 1, 2020

Closing this as it is now in the roadmap of SB3 v1.1+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v3 Discussion about V3
Projects
None yet
Development

No branches or pull requests

3 participants