Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monte Carlo AssertionError: defaultdict(<function mc_control_importance_sampling.<locals>.<lambda> at 0x7f31699ffe18>, {}) (<class 'collections.defaultdict'>) #231

Open
NC25 opened this issue Aug 13, 2020 · 0 comments

Comments

@NC25
Copy link

NC25 commented Aug 13, 2020

I have been working on a DQN using stable baselines and a discrete environment with 3 actions.

I am using the RL tutorial https://github.com/dennybritz/reinforcement-learning/blob/master/MC/MC%20Control%20with%20Epsilon-Greedy%20Policies%20Solution.ipynb

for reference


env = gym.make('fishing-v0')
model = DQN(MlpPolicy, env , verbose=2)
trained_model = model.learn(total_timesteps=10000)
  

But i am having some issues with my helper functions for the Monte Carlo methods


def mc_control_importance_sampling(env, num_episodes, discount = .99):
    """
    Monte Carlo Control Off-Policy Control using Weights for Sampling.
    Finds an optimal greedy policy.

    """
    
    # creates Q dictionary that maps obs to action values
    Q = defaultdict(lambda: np.zeros(env.action_space))
    #dictionary for weights
    C = defaultdict(lambda: np.zeros(env.action_space))
    
    # learn greedy policy
    target_policy = env.step(Q)
        
    for i_episode in range(1, num_episodes + 1):
        if i_episode % 1 == 0:
            print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")

        # Generate an episode to be tuple (state, action, reward) tuples
        episode = []
        obs = env.reset()
        for t in range(100):
            # Sample an action from our policy
            action, _states = trained_model.predict(obs)
            next_state, reward, done, _ = env.step(action)
            episode.append((state, action, reward))
            if done:
                break
            obs = next_obs
        
        # Sum of discounted returns
        G = 0.0
        # weights for return
        W = 1.0
        for t in range(len(episode))[::-1]:
            obs, action, reward = episode[t]
            G = discount * G + reward
            #  Add weights
            C[obs][action] += W
            # Update policy
            Q[obs][action] += (W / C[obs][action]) * (G - Q[obs][action]
                                                      
            if action !=  np.argmax(target_policy(obs)):
                break
            W = W * 1./behavior_policy(obs)[action]
        
    return Q, target_policy

When I call the function,

Q, policy = mc_control_importance_sampling(env, num_episodes=500000)

I get the error


AssertionError                            Traceback (most recent call last)
<ipython-input-58-eb968b9ff6e3> in <module>()
----> 1 Q, policy = mc_control_importance_sampling(env, num_episodes=500000)

1 frames
/content/gym_fishing/gym_fishing/envs/fishing_env.py in step(self, action)
     76     def step(self, action):
     77 
---> 78         assert self.action_space.contains(action), "%r (%s) invalid"%(action, type(action))
     79 
     80         if self.n_actions > 3:

AssertionError: defaultdict(<function mc_control_importance_sampling.<locals>.<lambda> at 0x7f31699ffe18>, {}) (<class 'collections.defaultdict'>) invalid
How do I fix this, your help would be appreciated.


I am not sure how to fix this,

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant