Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

BEAR

1. Introduction

BEAR-QL. Bootstrapping Error Accumulation Reduction Q-Learning (BEAR) [1] is an actor-critic algorithm which builds on the core idea of BCQ, but instead of using a perturbation model, samples actions from a learned actor. As in BCQ, BEAR trains a generative model of the data distribution in the batch. Using the generative model , the actor is trained using the deterministic policy gradient, while minimizing the variance over an ensemble of Q-networks, and constraining the maximum mean discrepancy (MMD) between and through dual gradient descent:

where and the MMD is computed over some choice of kernel. The update rule for the ensemble of Q-networks matches BCQ, except the actions can be sampled from the single actor network rather than sampling from a generative model and perturbing:

The policy used during evaluation is defined similarly to BCQ, but again samples actions directly from the actor:

2. Instruction

python bear-train.py --dataset=walker2d-random-v2 --seed=0 --gpu=0

3. Performance

img

Reference

  1. Kumar A, Fu J, Soh M, et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction[J]. Advances in Neural Information Processing Systems, 2019, 32: 11784-11794.