Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MC Control with Epsilon-Greedy Policies ---Epsilon Value and Best Action prob error #252

Open
hardik-kansal opened this issue Dec 23, 2023 · 2 comments

Comments

@hardik-kansal
Copy link

hardik-kansal commented Dec 23, 2023

  • Epsilon value is not decreased hyperbolically
    At end of each episode ,there should be epsilion=epsilon/1.1
@AbhinavSharma07
Copy link

Ensure proper epsilon decay by verifying correct division by 1.1, initialization, data types, and episode end triggers. Adjust decay rate if necessary .

@lucasbasquerotto
Copy link

If you are referring to the 2nd exercise of the Monte Carlo methods, https://github.com/dennybritz/reinforcement-learning/tree/master/MC (Implement the on-policy first-visit Monte Carlo Control algorithm, https://github.com/dennybritz/reinforcement-learning/blob/master/MC/MC%20Control%20with%20Epsilon-Greedy%20Policies%20Solution.ipynb), then there's no need to implement an epsilon decay.

The intention is to refine the state-action values of an epsilon-greedy policy toward the optimal policy (it won't become optimal because it's a soft policy). The requirement is to use a soft policy that approximates to the optimal greedy policy over its action-state values. The epsilon-greedy policy satisfies that requirement, even with a constant epsilon.

Although in a real world scenario an epsilon value with a decay would normally be better (especially in stationary environments, like the environment used in the exercise, blackjack), there's no need for use decay in this exercise. Actually, I think it's better to not include decay here, because in the book (Chapter 5) it specifies just an epsilon-greedy policy without decay, so it conforms more with the book, and focuses more on the control algorithm itself, instead of the possible policies that could be used (like Decay Schedules for 𝜖, Upper Confidence Bound (UCB), Boltzmann Exploration (Softmax), etc.), even if they would be a better fit and converge faster into the optimal policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@lucasbasquerotto @AbhinavSharma07 @hardik-kansal and others