[feature-request] N-step returns for TD methods #821

m-rph · 2020-04-21T18:22:58Z

N-step returns allow for much better stability, and improve performance when training DQN, DDPG etc, so it will be quite useful to have this feature.

A simple implementation of this would be as a wrapper around ReplayBuffer so it would work with both Prioritized and Uniform sampling. The wrapper keeps a queue of observed experiences compute the returns and add the experience to the buffer.

An issue with this approach is that it will not work with MPI. ~~An alternative solution is to have the agent keep track of the experiences and while that will work for mpi, it will not work for VecEnv.~~

However, mpi is dropped from V3, so perhaps this can be postponed for v3.1?

The text was updated successfully, but these errors were encountered:

Miffyli · 2020-04-21T18:25:48Z

Indeed this would be a sensible upgrade we could include in v3.1. However it will also require bit of algorithm-specific tuning when it comes to bootstrapping to the current value function (estimating the value from N:th step onwards).

m-rph · 2020-04-21T18:48:01Z

Why so? Having s_t,a,r[t:t+n],d,s_tpn in the RB, we can use s_tpn as the next state in the target network. This is how it was done in Rainbow, so I assume it will be okay?

Miffyli · 2020-04-21T18:52:57Z

Ah yes, if the replay buffer provides it like that, and then we have one function to compute the n-step returns (like in the Rainbow and other DM codes), then yup, shouldn't be more complicated than that. Things could get bit hairier with recurrent policies, but that is a problem for when this is going to be implemented.

araffin · 2020-04-21T19:36:48Z

I'm totally for this feature (it is on my internal roadmap for the v3) but as @Miffyli suggest, more after the first stable release.

m-rph · 2020-04-29T10:17:58Z

In the initial post I considered using just the truncated sum of the returns, however, it would be invaluable to also have: retrace(lambda), treebackup(lambda) and importance sampling. However, I understand that this introduces a non trivial amount of additional complexity.

Edit: Off-policy methods with retrace (and friends) with a replay buffer can use LazyArrays to store the observations with minimal memory overhead. Will need to also store action probabilities which, while simple enough, prevents the algorithms from being plug and play.

araffin · 2020-06-01T20:54:16Z

Closing this as it is now in the roadmap of SB3 v1.1+

Miffyli added the enhancement New feature or request label Apr 21, 2020

Miffyli added the v3 Discussion about V3 label Apr 21, 2020

araffin closed this as completed Jun 1, 2020

araffin mentioned this issue Jun 8, 2020

[feature-request] N-step returns for TD methods DLR-RM/stable-baselines3#47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature-request] N-step returns for TD methods #821

[feature-request] N-step returns for TD methods #821

m-rph commented Apr 21, 2020 •

edited

Loading

Miffyli commented Apr 21, 2020

m-rph commented Apr 21, 2020 •

edited

Loading

Miffyli commented Apr 21, 2020

araffin commented Apr 21, 2020

m-rph commented Apr 29, 2020 •

edited

Loading

araffin commented Jun 1, 2020 •

edited

Loading

[feature-request] N-step returns for TD methods #821

[feature-request] N-step returns for TD methods #821

Comments

m-rph commented Apr 21, 2020 • edited Loading

Miffyli commented Apr 21, 2020

m-rph commented Apr 21, 2020 • edited Loading

Miffyli commented Apr 21, 2020

araffin commented Apr 21, 2020

m-rph commented Apr 29, 2020 • edited Loading

araffin commented Jun 1, 2020 • edited Loading

m-rph commented Apr 21, 2020 •

edited

Loading

m-rph commented Apr 21, 2020 •

edited

Loading

m-rph commented Apr 29, 2020 •

edited

Loading

araffin commented Jun 1, 2020 •

edited

Loading