Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Results vastly different for an agent created with Stable Baselines3 using hyperparameters optimized in RL Baselines3 Zoo. #458

Open
5 tasks done
mzelazko opened this issue May 31, 2024 · 1 comment
Labels
question Further information is requested

Comments

@mzelazko
Copy link

mzelazko commented May 31, 2024

❓ Question

Hello,
I first optimize A2C on 1mln steps using RL Baselines3 Zoo:

Firstly i have changed a2c.yml in RL Baselines3 Zoo to work with RAM version of Seaquest:

atari:
  policy: 'MlpPolicy'
  n_envs: 16
  policy_kwargs: "dict(optimizer_class=RMSpropTFLike, optimizer_kwargs=dict(eps=1e-5))"

Then wrote command:

python -m train --algo a2c --env ALE/Seaquest-ram-v5 -n 1000000 -optimize --n-trials 100 --n-startup-trials 10
--sampler tpe --pruner median  --n-evaluations 4 --n-eval-envs 16 --storage "some_valid_database" --study-name test

Top 3 results:
mysqlsh_ugrHTRYZRL
Then using for example these hyperparameters:
mysqlsh_4UsJxMM74z
and using this code:

def linear_decay_lr(progress_remaining):
    return 0.00027232300584036946 * progress_remaining
if __name__ == "__main__":
    vec_env = make_vec_env("ALE/Seaquest-ram-v5", n_envs=16)
    model = A2C(
        "MlpPolicy",
        vec_env,
        learning_rate=linear_decay_lr,
        n_steps=256,
        gamma=0.999,
        gae_lambda=0.98,
        ent_coef=0.00001753537605091099,
        vf_coef=0.19195701505334234,
        max_grad_norm=0.5,
        use_rms_prop=True,
        normalize_advantage=False,
        verbose=1,
        tensorboard_log="./seaquest/107",
        policy_kwargs=dict(activation_fn=torch.nn.Tanh, net_arch=dict(pi=[256, 256], vf=[256, 256]), ortho_init=True,
                                      optimizer_class=RMSpropTFLike, optimizer_kwargs=dict(eps=1e-5))
    )
    model.learn(total_timesteps=1000000, log_interval=1)

I get results:
firefox_ETT9C6jsTI

As picture shows, result is long way from 456 that RL Baselines Zoo got to. I have used more hyperparameters, but scores are always much lower.
What I'm aware of that can have impact on this issue is seed, as I didn't pick the same. Nevertheless I have tried many instances of A2C and the problem remains.

Checklist

@mzelazko mzelazko added the question Further information is requested label May 31, 2024
@araffin
Copy link
Member

araffin commented Jun 7, 2024

Probably a duplicate of #314 #204 and others (see link in the other issues)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants