Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy iteration solution only show 1 optimal solution #212

Open
duongnhatthang opened this issue Aug 16, 2019 · 2 comments
Open

Policy iteration solution only show 1 optimal solution #212

duongnhatthang opened this issue Aug 16, 2019 · 2 comments

Comments

@duongnhatthang
Copy link

Hey guy, I just start learning RL, so this is more like a question.
I notice that in the slide/lecture, the gridworld problem have more than one optimal solution. But in the DP>policy_iteration_solution.ipynb, you used argmax() to choose only 1 action for each state (while there could be 2 actions with the same max value)

=> The final answer is still correct, but the "Policy Probability Distribution" shown is kind of bugging me.

Should this be changed?

Btw, here is my code

def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0):   
    # Start with a random policy
    policy = np.ones([env.nS, env.nA]) / env.nA
    V = np.zeros(env.nS)
    
    while True:
        # Implement this!
        V = policy_eval_fn(policy, env, discount_factor)
        new_policy = np.zeros_like(policy)
        for s in range(env.nS):
            action_choose = []
            max_action_value = -1e5
            for a in range(env.nA):
                action_value = 0
                for prob, next_state, reward, done in env.P[s][a]:
                    action_value += prob*V[next_state]
                if action_value>=max_action_value:
                    if action_value == max_action_value: 
                        action_choose.append(a)
                    else: 
                        action_choose = [a]
                    max_action_value = action_value
            new_policy[s][action_choose]=1/len(action_choose)
        delta = (new_policy == policy).all()
        policy = new_policy
        if delta == True:
            break
    
    return policy, policy_eval_fn(policy, env)

And the "Policy Probability Distribution" should look like this:

[[0.25 0.25 0.25 0.25]
 [0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [0.   0.   0.5  0.5 ]
 [1.   0.   0.   0.  ]
 [0.5  0.   0.   0.5 ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   1.   0.  ]
 [1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.5  0.5  0.  ]
 [0.   0.   1.   0.  ]
 [0.5  0.5  0.   0.  ]
 [0.   1.   0.   0.  ]
 [0.   1.   0.   0.  ]
 [0.25 0.25 0.25 0.25]]
@radumihai28
Copy link

The Probabilty Distribution for end game must be 0.

@radumihai28
Copy link

def policy_eval(policy, env,V = np.zeros(env.nS), discount_factor=1.0, theta=0.00001):
"""
Evaluate a policy given an environment and a full description of the environment's dynamics.

Args:
    policy: [S, A] shaped matrix representing the policy.
    env: OpenAI env. env.P represents the transition probabilities of the environment.
        env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
        env.nS is a number of states in the environment. 
        env.nA is a number of actions in the environment.
    theta: We stop evaluation once our value function change is less than theta for all states.
    discount_factor: Gamma discount factor.

Returns:
    Vector of length env.nS representing the value function.
"""
# Start with a random (all 0) value function
while True:
    delta = 0
    # For each state, perform a "full backup"
    for s in range(env.nS):
        v = 0
        # Look at the possible next actions
        for a, action_prob in enumerate(policy[s]):
            # For each action, look at the possible next states...
            for  prob, next_state, reward, done in env.P[s][a]:
                # Calculate the expected value
                v += action_prob * prob * (reward + discount_factor * V[next_state])
        # How much our value function changed (across any states)
        delta = max(delta, np.abs(v - V[s]))
        V[s] = v
    # Stop evaluating once our value function change is below a threshold
    print(V)
    if delta < theta:
        break
    else:
        return np.array(V)
return np.array(V)

def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0):
"""
Policy Improvement Algorithm. Iteratively evaluates and improves a policy
until an optimal policy is found.

Args:
    env: The OpenAI envrionment.
    policy_eval_fn: Policy Evaluation function that takes 3 arguments:
        policy, env, discount_factor.
    discount_factor: gamma discount factor.
    
Returns:
    A tuple (policy, V). 
    policy is the optimal policy, a matrix of shape [S, A] where each state s
    contains a valid probability distribution over actions.
    V is the value function for the optimal policy.
    
"""
# Start with a random policy
policy = np.ones([env.nS, env.nA]) / env.nA

#Finds the value function for the random policy
V = policy_eval(policy, env)

while True:
   
    #Updates the new policy from the value function
    policy_stable = True
    
    #For evry state
    for s in range(env.nS):
        
        #Get the present policy
        old_action = policy[s].copy()
        
        #Variable for best action
        best_action = np.zeros([env.nA])
        for a, action_prob in enumerate(policy[s]):
            # For each action, look at the possible next states...
            for  prob, next_state, reward, done in env.P[s][a]:
                
                # Calculate the value function for every action
                best_action[a] = prob * (reward + discount_factor * V[next_state])
        #Pick the best actions
        best_action = (best_action == best_action.max()).astype(int)
        
        #Change the best actions in probabilty
        policy[s] = [float(i)/sum(best_action ) for i in  best_action]
        
        #Change the end game in 0 probabilty
        if s == 0 or s == env.nS-1:
            policy[s] = 0
        
        #Verifiy if policy improves
        if (old_action != policy[s]).all():
            policy_stable = False
    
    #If policy dosent improve return best policy and V
    if policy_stable:
        return (policy,V)
        break
    
    #Policy evaluation with next policy
    V = policy_eval(policy, env,V)
return policy, np.zeros(env.nS)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants