Name	Name	Last commit message	Last commit date
parent directory ..
imgs	imgs
README.md	README.md
bear-train.py	bear-train.py

BEAR

1. Introduction

BEAR-QL. Bootstrapping Error Accumulation Reduction Q-Learning (BEAR) [1] is an actor-critic algorithm which builds on the core idea of BCQ, but instead of using a perturbation model, samples actions from a learned actor. As in BCQ, BEAR trains a generative model of the data distribution in the batch. Using the generative model , the actor is trained using the deterministic policy gradient, while minimizing the variance over an ensemble of Q-networks, and constraining the maximum mean discrepancy (MMD) between and through dual gradient descent:

$https://latex.codecogs.com/svg.image?\mathcal{L}(\phi)=-\left(\frac{1}{K} \sum_{k} Q_{\theta}^{k}(s, \hat{a})-\tau \operatorname{var}_{k} Q_{\theta}^{k}(s, \hat{a})\right) \text { s.t. } \operatorname{MMD}\left(G_{\omega}(s), \pi_{\phi}(s)\right) \leq \epsilon,$

where and the MMD is computed over some choice of kernel. The update rule for the ensemble of Q-networks matches BCQ, except the actions can be sampled from the single actor network rather than sampling from a generative model and perturbing:

$https://latex.codecogs.com/svg.image?\mathcal{L}(\theta)=\sum_{k}\left(r+\gamma \max _{\hat{a} \sim \pi_{\phi}\left(s^{\prime}\right)}\left(\lambda \min _{k^{\prime}} Q_{\theta^{\prime}}^{k^{\prime}}\left(s^{\prime}, \hat{a}\right)+(1-\lambda) \max _{k^{\prime}} Q_{\theta^{\prime}}^{k^{\prime}}\left(s^{\prime}, \hat{a}\right)\right)-Q_{\theta}^{k}(s, a)\right)^{2}$

The policy used during evaluation is defined similarly to BCQ, but again samples actions directly from the actor:

$https://latex.codecogs.com/svg.image?\pi(s)=\underset{\hat{a}}{\operatorname{argmax}} \frac{1}{K} \sum_{k} Q_{\theta}^{k}(s, \hat{a}), \quad \hat{a} \sim \pi_{\phi}(s).$

2. Instruction

python bear-train.py --dataset=walker2d-random-v2 --seed=0 --gpu=0

3. Performance

Reference

Kumar A, Fu J, Soh M, et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction[J]. Advances in Neural Information Processing Systems, 2019, 32: 11784-11794.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BEAR

BEAR

README.md

BEAR

1. Introduction

2. Instruction

3. Performance

Reference

Files

BEAR

Directory actions

More options

Directory actions

More options

Latest commit

History

BEAR

Folders and files

parent directory

README.md

BEAR

1. Introduction

2. Instruction

3. Performance

Reference