-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature-request] N-step returns for TD methods #821
Comments
Indeed this would be a sensible upgrade we could include in v3.1. However it will also require bit of algorithm-specific tuning when it comes to bootstrapping to the current value function (estimating the value from N:th step onwards). |
Why so? Having |
Ah yes, if the replay buffer provides it like that, and then we have one function to compute the n-step returns (like in the Rainbow and other DM codes), then yup, shouldn't be more complicated than that. Things could get bit hairier with recurrent policies, but that is a problem for when this is going to be implemented. |
I'm totally for this feature (it is on my internal roadmap for the v3) but as @Miffyli suggest, more after the first stable release. |
In the initial post I considered using just the truncated sum of the returns, however, it would be invaluable to also have: retrace(lambda), treebackup(lambda) and importance sampling. However, I understand that this introduces a non trivial amount of additional complexity. Edit: Off-policy methods with retrace (and friends) with a replay buffer can use LazyArrays to store the observations with minimal memory overhead. Will need to also store action probabilities which, while simple enough, prevents the algorithms from being plug and play. |
Closing this as it is now in the roadmap of SB3 v1.1+ |
N-step returns allow for much better stability, and improve performance when training DQN, DDPG etc, so it will be quite useful to have this feature.
A simple implementation of this would be as a wrapper around
ReplayBuffer
so it would work with both Prioritized and Uniform sampling. The wrapper keeps a queue of observed experiences compute the returns and add the experience to the buffer.An issue with this approach is that it will not work with MPI.
An alternative solution is to have the agent keep track of the experiences and while that will work for mpi, it will not work for VecEnv.However, mpi is dropped from V3, so perhaps this can be postponed for v3.1?
The text was updated successfully, but these errors were encountered: