Policy iteration solution only show 1 optimal solution #212

duongnhatthang · 2019-08-16T09:07:51Z

Hey guy, I just start learning RL, so this is more like a question.
I notice that in the slide/lecture, the gridworld problem have more than one optimal solution. But in the DP>policy_iteration_solution.ipynb, you used argmax() to choose only 1 action for each state (while there could be 2 actions with the same max value)

=> The final answer is still correct, but the "Policy Probability Distribution" shown is kind of bugging me.

Should this be changed?

Btw, here is my code

def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0):   
    # Start with a random policy
    policy = np.ones([env.nS, env.nA]) / env.nA
    V = np.zeros(env.nS)
    
    while True:
        # Implement this!
        V = policy_eval_fn(policy, env, discount_factor)
        new_policy = np.zeros_like(policy)
        for s in range(env.nS):
            action_choose = []
            max_action_value = -1e5
            for a in range(env.nA):
                action_value = 0
                for prob, next_state, reward, done in env.P[s][a]:
                    action_value += prob*V[next_state]
                if action_value>=max_action_value:
                    if action_value == max_action_value: 
                        action_choose.append(a)
                    else: 
                        action_choose = [a]
                    max_action_value = action_value
            new_policy[s][action_choose]=1/len(action_choose)
        delta = (new_policy == policy).all()
        policy = new_policy
        if delta == True:
            break
    
    return policy, policy_eval_fn(policy, env)

And the "Policy Probability Distribution" should look like this:

[[0.25 0.25 0.25 0.25]
 [0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [0.   0.   0.5  0.5 ]
 [1.   0.   0.   0.  ]
 [0.5  0.   0.   0.5 ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   1.   0.  ]
 [1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.5  0.5  0.  ]
 [0.   0.   1.   0.  ]
 [0.5  0.5  0.   0.  ]
 [0.   1.   0.   0.  ]
 [0.   1.   0.   0.  ]
 [0.25 0.25 0.25 0.25]]

The text was updated successfully, but these errors were encountered:

radumihai28 · 2020-04-22T19:55:18Z

The Probabilty Distribution for end game must be 0.

radumihai28 · 2020-04-22T20:04:03Z

def policy_eval(policy, env,V = np.zeros(env.nS), discount_factor=1.0, theta=0.00001):
"""
Evaluate a policy given an environment and a full description of the environment's dynamics.

Args:
    policy: [S, A] shaped matrix representing the policy.
    env: OpenAI env. env.P represents the transition probabilities of the environment.
        env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
        env.nS is a number of states in the environment. 
        env.nA is a number of actions in the environment.
    theta: We stop evaluation once our value function change is less than theta for all states.
    discount_factor: Gamma discount factor.

Returns:
    Vector of length env.nS representing the value function.
"""
# Start with a random (all 0) value function
while True:
    delta = 0
    # For each state, perform a "full backup"
    for s in range(env.nS):
        v = 0
        # Look at the possible next actions
        for a, action_prob in enumerate(policy[s]):
            # For each action, look at the possible next states...
            for  prob, next_state, reward, done in env.P[s][a]:
                # Calculate the expected value
                v += action_prob * prob * (reward + discount_factor * V[next_state])
        # How much our value function changed (across any states)
        delta = max(delta, np.abs(v - V[s]))
        V[s] = v
    # Stop evaluating once our value function change is below a threshold
    print(V)
    if delta < theta:
        break
    else:
        return np.array(V)
return np.array(V)

def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0):
"""
Policy Improvement Algorithm. Iteratively evaluates and improves a policy
until an optimal policy is found.

Args:
    env: The OpenAI envrionment.
    policy_eval_fn: Policy Evaluation function that takes 3 arguments:
        policy, env, discount_factor.
    discount_factor: gamma discount factor.
    
Returns:
    A tuple (policy, V). 
    policy is the optimal policy, a matrix of shape [S, A] where each state s
    contains a valid probability distribution over actions.
    V is the value function for the optimal policy.
    
"""
# Start with a random policy
policy = np.ones([env.nS, env.nA]) / env.nA

#Finds the value function for the random policy
V = policy_eval(policy, env)

while True:
   
    #Updates the new policy from the value function
    policy_stable = True
    
    #For evry state
    for s in range(env.nS):
        
        #Get the present policy
        old_action = policy[s].copy()
        
        #Variable for best action
        best_action = np.zeros([env.nA])
        for a, action_prob in enumerate(policy[s]):
            # For each action, look at the possible next states...
            for  prob, next_state, reward, done in env.P[s][a]:
                
                # Calculate the value function for every action
                best_action[a] = prob * (reward + discount_factor * V[next_state])
        #Pick the best actions
        best_action = (best_action == best_action.max()).astype(int)
        
        #Change the best actions in probabilty
        policy[s] = [float(i)/sum(best_action ) for i in  best_action]
        
        #Change the end game in 0 probabilty
        if s == 0 or s == env.nS-1:
            policy[s] = 0
        
        #Verifiy if policy improves
        if (old_action != policy[s]).all():
            policy_stable = False
    
    #If policy dosent improve return best policy and V
    if policy_stable:
        return (policy,V)
        break
    
    #Policy evaluation with next policy
    V = policy_eval(policy, env,V)
return policy, np.zeros(env.nS)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy iteration solution only show 1 optimal solution #212

Policy iteration solution only show 1 optimal solution #212

duongnhatthang commented Aug 16, 2019

radumihai28 commented Apr 22, 2020

radumihai28 commented Apr 22, 2020

Policy iteration solution only show 1 optimal solution #212

Policy iteration solution only show 1 optimal solution #212

Comments

duongnhatthang commented Aug 16, 2019

radumihai28 commented Apr 22, 2020

radumihai28 commented Apr 22, 2020