Skip to content

nslyubaykin/ppo_with_dqn_critic

Repository files navigation

Two Actors at the Price of One

or Training PPO with DQN as a critic with ReLAx

This repository contains an implementation of PPO+DDQN training loop for discrete control tasks.

Overall Idea

PPO needs and estimation of advantages to run the training process. Typically advantages for PPO are estimated with GAE-lambda algorithm.

This notebook explores the possibility of training PPO in pair with DDQN critic.

While PPO is trained in on-policy mode using transitions sampled with its policy network, DQN is trained on off-policy data stored in replay buffer (which is filled with train batches sampled with PPO actor)

Theoretically, such procedure should allow us to train 2 agents (PPO+DQN and ArgmaxQValue+DQN) on the same samples.

Plot below shows smoothed training runs (evaluated on a separate environments) for PPO+DQN and ArgmaxQValue+DQN:

ppo_dqn_training

As we can see, PPO+DQN outperforms ArgmaxQValue+DQN over the entire course of training:

Trained policies

PPO

ppo_actor.mp4

DQN

dqn_actor.mp4