Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

BCQ

1. Introduction

Batch-Constrained deep Q-learning (BCQ) [1] is a batch reinforcement learning method for continuous control. BCQ aims to perform Q-learning while constraining the action space to eliminate actions which are unlikely to be selected by the behavioral policy , and are therefore unlikely to be contained in the batch. At its core, BCQ uses a state-conditioned generative model to model the distribution of data in the batch, akin to a behavioral cloning model. As it is easier to sample from than model exactly in a continuous action space, the policy is defined by sampling actions from and selecting the highest valued action according to a Q-network. Since BCQ was designed for continuous actions, the method also includes a perturbation model , which is a residual added to the sampled actions in the range , and trained with the deterministic policy gradient. Finally the authors include a weighted version of Clipped Double Q-learning to penalize high variance estimates and reduce overestimation bias, using with :

where During evaluation, the policy is defined similarly, by sampling $N$ actions from the generative model, perturbing them and selecting the argmax:

2. Instruction

python bcq-train.py --dataset=walker2d-random-v2 --seed=0 --gpu=0

3. Performance

img

Reference

  1. Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration[C]//International Conference on Machine Learning. PMLR, 2019: 2052-2062.