Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questionable result in Gamblers Problem Solution #172

Open
bminixhofer opened this issue Aug 20, 2018 · 12 comments
Open

Questionable result in Gamblers Problem Solution #172

bminixhofer opened this issue Aug 20, 2018 · 12 comments

Comments

@bminixhofer
Copy link

I encountered an interesting issue in the Gamblers Problem notebook .

The graph of capital vs policy looks like this with p_h = 0.4:
image

In the book (on page 66), we see how the graph is supposed to look. The general notion appears to be the same, but the strange spikes in the notebook do not appear:

image

If we reduce theta in value_iteration_for_gamblers from 0.0001 to 1e-30 the spikes appear to become less, but they are still significant:

image

Does anyone know why this happens?
Thanks!

@puzzler10
Copy link

The book does mention that there is a family of ideal solutions, not just one. Perhaps this is another valid solution from the family? I'm not too sure but it sounds plausible.

@ArikVoronov
Copy link

I'm getting the exact book solution with p_h = 0.25:

image

But what's shown in the uploaded solution is more like what you're getting for p_h=0.4.

The difference in my code is that I do synchronous updates of V (only at the end of each state loop).

If you plot out the values of A[stake] , you'll see that in some cases there are multiple stakes for which A[s1] ~= A[s2], for example for s = 30:
image

So when you run np.argmax to get the policy, you could get several solutions but the function only returns the first one, or there is a small numerical difference which tips the scales towards some stake even though in the theoretical converged solution they are the same.

The difference in the final results is probably because of small differences in V[s] for lower states getting updated as the s loop runs. I think in principal both results are analytically correct.

@bminixhofer
Copy link
Author

bminixhofer commented Oct 2, 2018

That makes sense. But, to be honest, I don't understand why both policies (the noisy one and the one from the book) are optimal.
That might be answered by Exercise 4.8 from the book:

Exercise 4.8 
Why does the optimal policy for the gambler’s problem have such a curious
form? In particular, for capital of 50 it bets it all on one flip, but for capital of 51 it does
not. Why is this a good policy?

Does anyone know the answer to this exercise?
Intuitively, I would think that the stake should be equal to s when s < 50, and afterwards equal to 100 - s, such that when you win the bet, you exactly reach your goal.
The pattern with oscillations between 0 and ~12 with spikes at 25, 50 and 75 does not make much sense to me.

@ArikVoronov
Copy link

I think I do have one answer to that, you can check easily: Put the discount factor to ANYTHING lower than 1, and you get pretty much what you expected here : a pyramid with a maximum value at 50.
That's because the algorithm will "try" to get to the goal with as few steps as possible, and that means hoping you'll win as much as you can at every step.

On the other hand, if the p_h is <0.5 and gamma =1 , the algorithm doesn't really mind taking its time to win, and so it tries to maximize value at every step:
It's easiest to think about the first iteration of the value function: Assuming you initialize with V[s] = 0, on the first iteration, only steps >50 get any value, because only those steps can reach the reward in one step and get v[s] = 1p_h (=0.4, for example)
Now, Let's say you're at step 80 - you can bet 20 - then you either win +1
p_h value OR you lose, move to 60 and get extra 0.4*(1-p_h) value
But if you're at step 70, you can bet 30 and win +1p_h reward OR you lose and get 0. On the other hand , you could bet 20 and then win +1p_h reward or lose and get +0.4*(1-p_h)
If you lose more than you win p_h<0.5 , it's sometimes better to bet safely that you'll get another direct chance to go for the win (again, unless you're penalized by some discount).

In particular, at step 50, it doesn't matter if you lose 1 or lose 49, so you might as well go for as much as possible to maximize the potential value. on the other hand, on step 51 you could lose 1 then get back to step 50 and try again.

As I see it, the cyclical values are a balance between betting on winning and betting on losing.

@onetree1994
Copy link

@ArikVoronov That has confused me for a long time, thx for your explanation.

@Huixxi
Copy link

Huixxi commented Apr 20, 2019

Why the figures of 'value estimate' and 'final policy' become so weird when the 'ph' is bigger than 0.5?

@sanakdag
Copy link

The graph you're getting with the erroneous spikes in the first post are probably about floating point arithmetic. I was able to output the graph that looked just like the textbook by comparing values that are rounded to ten digits when taking the max over the sum for updating the value function

@nrhodes
Copy link

nrhodes commented Feb 20, 2020

The book gets the solution it does because it chooses the least amount to bet rather than the most (using argmax)

There's an alternative optimal solution that looks like an upside-down V (obtained by using argmax but choosing the highest of equal bets rather than the lowest).

And, there are mixes of those strategies. When you have an amount of money less than $50, a winning strategy is always to bet it all, but you can also use the value from the figure in the book. They key is to minimize the total number of bets (since you don't have an edge).

The original figure shown by bminixhofer is a combination of the two: as you can see it is a mixture of the book figure along with an upside-down V. As stated earlier, this is due to slightly different floating-point values for what should actually end up as equal values. Thus, sometimes that figure shows the lower of the two equally-valid bets and sometimes it shows the higher.

@nrhodes
Copy link

nrhodes commented Feb 20, 2020

The original figure shown by bminixhofer is actually correct, as there are a family of correct solutions.

Since the gambler doesn't have an edge, the optimal policy is to minimize the number of bets. With $51, optimal policy is either:

  • Bet $1 and try to get that to $50. If you lose it, bet the $50
  • Bet $49. If you lose it, you've got $2 left. Try to build it up to $100.

The book shows a policy that always shows the minimum optimal bet. Here's a figure that shows the maximum optimal bet:

image

Note that bminixhofer's figure is a mix between the two bets. That's likely caused, as sanakdag points out, because the two actions that should have identical value aren't exactly identical (off in one of the least significant digits). Thus, the figure is showing whichever action happens to be slightly higher. Rounding does address this issue.

Huixxi asks why the figure looks weird when ph is > 0.5. That's because the winning strategy is to maximize the number of bets. Thus, no matter what the capital stake, bet $1.

@umutisik
Copy link

The graph you're getting with the erroneous spikes in the first post are probably about floating point arithmetic. I was able to output the graph that looked just like the textbook by comparing values that are rounded to ten digits when taking the max over the sum for updating the value function

Wow! This was the issue I had. The plot in the book assumes that you choose the least amount when outcomes are equal, except that floating point issues sometimes make the larger bet amount better in the calculation. Thanks!

@tniccum21
Copy link

Can confirm the floating point issue - I rounded to 7 digits and now my graph looks like the book....

@lucasbasquerotto
Copy link

A bit late to the discussion, but the reason is as posted before, with some more details:

  • Choosing the smallest stake gives a graph like the book when ph < 0.5, and a value of 1 for ph >= 0.5 (actually, for ph = 0.5, any action is optimal).
  • Choosing the highest stake gives a piramid graph for ph <= 0.5, and a value of 1 for ph > 0.5, but very close to 0.5, and gives a graph similar to a piramid for high capitals as ph increases, returning to the graph of a piramid when ph=1 (every action is optimal in this case: you always win).

It's important to note that the action value for a given action might be a very little bit smaller than another action, but extremely close. Choosing the exact optimal smallest action may not be a graph like the book (it gives the graph of the solution of this repository), but the value of the policy actions like shown in the book are so close to the optimal that they may be considered equal.

You can add a value (some epsilon variable or the like) with a very small value to ignore such small differences in the floating point values.

I made a python notebook based on this exercise that gives the original graph of this repository (exact best smallest stake), the piramid graph (approximate, almost exact, best highest stake) and the graph of the book (approximate, almost exact, best smallest stake).

You can see it in the link (I used a parameter named almost_zero to define the behaviour):

https://github.com/lucasbasquerotto/ml-demos/blob/main/rl/others/DP/Gamblers%20Problem.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants