Add TRPO #40

cyprienc · 2021-09-08T17:00:34Z

Description

This PR adds TRPO: https://arxiv.org/abs/1502.05477
It's still a work in progress (see TODO list below)

Context

closes #38

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

MuJoCo 1M Benchmark

Mujoco v2.1.0
v3 envs

Environments	TRPO
	logs/
HalfCheetah	1803 +/- 46
Ant	3554 +/- 591
Hopper	3372 +/- 215
Walker2d	4502 +/- 234
Swimmer	359 +/- 2

WIP - Trust Region Policy Algorithm Currently the Hessian vector product is not working (see inline comments for more detail)

Adding no_grad block for the line search Additional assert in the conjugate solver to help debugging

- Adding ActorCriticPolicy.get_distribution - Using the Distribution object to compute the KL divergence - Checking for objective improvement in the line search - Moving magic numbers to instance variables

Improving numerical stability of the conjugate gradient algorithm Critic updates

Changes around the alpha of the line search Adding TRPO to __init__ files

araffin · 2021-09-09T11:11:46Z

Thanks for the PR.
before I help you to have larger scale experiments, could you first match the TRPO results from SB2 on simple env (classic control envs: CartPole, Pendulum, LunarLander) ?
You can find tuned params in the SB2 zoo: https://github.com/araffin/rl-baselines-zoo/blob/master/hyperparams/trpo.yml

Once this is done, you should focus on documentation and test and I will run experiments on pybullet envs + atari games ;)

sb3_contrib/common/policies.py

sb3_contrib/common/utils.py

sb3_contrib/trpo/trpo.py

- renaming cg_solver to conjugate_gradient_solver and renaming parameter Avp_fun to matrix_vector_dot_func + docstring - extra comments + better variable names in trpo.py - defining a method for the hessian vector product instead of an inline function - fix registering correct policies for TRPO and using correct policy base in constructor

- refactoring sb3_contrib.common.policies to reuse as much code as possible from sb3

sb3_contrib/common/policies.py

- get_distribution will be added directly to the SB3 version of ActorCriticPolicy, this commit reflects this

sb3_contrib/trpo/trpo.py

araffin · 2021-09-13T08:40:26Z

Could you remove protection on your master branch so I can push changes?
While waiting for that, you can find them here: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/tree/feat/get-dist

(next time, please use another branch ;))

cyprienc · 2021-09-13T17:42:51Z

Here are the results using the SB2 Hyper-parameters (I'll update the PR on zoo with the parameters used):

Environments	TRPO
	logs
CartPole-v1	468 +/- 27
Pendulum-v0	-330 +/- 110
LunarLander-v2	53 +/- 57

araffin · 2021-09-13T17:55:58Z

Here are the results using the SB2 Hyper-parameters (I'll update the PR on zoo with the parameters used):

Environments TRPO
logs
CartPole-v1 468 +/- 27
Pendulum-v0 -330 +/- 110
LunarLander-v2 53 +/- 57

thanks =), could you also add here learning curve/results comparison with SB2? (plot scripts are included in the zoo, i can help with them if needed)

araffin · 2021-09-29T16:17:33Z

Hi,
so I took a closer look and started experimenting with bullet envs. After some fixes, results look good =D (I updated the hyperparams in DLR-RM/rl-baselines3-zoo#163)
The entropy coeff is still missing though (important for Atari games I think).

Could you start adding the documentation page + more tests ?

Btw, I'm thinking about renaming some variables (related to backtracking line search) so we are more consistent with other implementations, but this is just details...

araffin

LGTM, thanks =)

araffin · 2022-10-14T09:07:02Z

@cyprienc I think it's time to move TRPO to SB3 =)!
Could you do a PR? (that adds TRPO to SB3 and remove it from sb3 contrib, while maintaining backward compat doing from stable_baselines3 import TRPO in the init)

cyprienc · 2022-10-18T08:05:33Z

@araffin sure, will do.

cyprienc added 5 commits September 8, 2021 17:58

Feat: adding TRPO algorithm (WIP)

f779a9f

WIP - Trust Region Policy Algorithm Currently the Hessian vector product is not working (see inline comments for more detail)

Feat: adding TRPO algorithm (WIP)

98bc5b2

Adding no_grad block for the line search Additional assert in the conjugate solver to help debugging

Feat: adding TRPO algorithm (WIP)

97ece67

- Adding ActorCriticPolicy.get_distribution - Using the Distribution object to compute the KL divergence - Checking for objective improvement in the line search - Moving magic numbers to instance variables

Feat: adding TRPO algorithm (WIP)

799b140

Improving numerical stability of the conjugate gradient algorithm Critic updates

Feat: adding TRPO algorithm (WIP)

dc73462

Changes around the alpha of the line search Adding TRPO to __init__ files

cyprienc mentioned this pull request Sep 8, 2021

[Feature Request] Implement TRPO #38

Closed