Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing max_depth and planning_time for POMCP #32

Closed
Hororohoruru opened this issue Aug 24, 2023 · 10 comments
Closed

Changing max_depth and planning_time for POMCP #32

Hororohoruru opened this issue Aug 24, 2023 · 10 comments

Comments

@Hororohoruru
Copy link
Contributor

Hororohoruru commented Aug 24, 2023

Hello!

I am working on the same problem discussed in #27 and, as the problem includes a limit of time-steps for each trial, I am trying to model it as a finite-horizon POMDP. As such, I would like to initialize POMCP with a max_depth equal to the maximum number of time-steps for the trial, and then change it every time step so each simulation takes into account the horizon when planning online. I noticed that the belief for a given state reaches 1 at times and the agent does not make a decision until the last or second-to-last time-step, and I thought that may be a potential cause.

However, I am getting an error saying that POMCP has no attribute max_depth (or _max_depth). What can I do?

On a similar fashion, I would like to change the planning time. At the beginning of each trial, the model has 0.5 seconds to plan, then 0.1 seconds at each time-step.

@Hororohoruru Hororohoruru changed the title Changing max_depth and Changing max_depth and planning_time for POMCP Aug 24, 2023
@zkytony
Copy link
Collaborator

zkytony commented Aug 24, 2023

then change it every time step so each simulation takes into account the horizon when planning online.

I’m not following. Pseudocode would help. What exactly goes on in the for loop?

I noticed that the belief for a given state reaches 1 at times and the agent does not make a decision until the last or second-to-last time-step, and I thought that may be a potential cause.

Same comment as above.

However, I am getting an error saying that POMCP has no attribute max_depth (or _max_depth). What can I do?

I think I get what you’re trying to do. You’d like to change POMCP’s hyper parameters such as depth or sim time between planning steps. Currently, POMCP/POUCT doesn’t support changing hyper parameters after creation. But as I commented in the other thread, it is not costly to create these instances. The search tree is saved in the agent, not in the planner.

@Hororohoruru
Copy link
Contributor Author

Hororohoruru commented Aug 24, 2023

I think you understood, but let me provide pseudocode anyway:

for trial_n in range(n_trials):
    # Get information about true state and apply transition
    next_label = int(y_test[trial_n])
    true_state = TDState(next_label, 0)  # (state, time_step)
    bci_problem.env.apply_transition(true_state)

    for step_n in range(total_steps):
        # Here I would like to change the max_depth hyperparam
        remaining_steps = pomdp_steps - step_n
        POMCP._max_depth = remaining_steps  

        # Same for planning time
        if remaining steps == pomdp_steps:
            planning_time = 0.5
        else: 
            planning_time = 0.1 

        POMCP._planning_time = planning_time

I noticed that the belief for a given state reaches 1 at times and the agent does not make a decision until the last or second-to-last time-step, and I thought that may be a potential cause.

Regarding this, I can show you a print of the belief I am doing for every trial at every time-step:

TRIAL 11 (true state s_0-t_0)
--------------------

  STEP 0 (6 steps remaining)
  Current belief:
        s_0-t_0 -> 0.086
        s_1-t_0 -> 0.091
        s_2-t_0 -> 0.069
        s_3-t_0 -> 0.083
        s_4-t_0 -> 0.095
        s_5-t_0 -> 0.095
        s_6-t_0 -> 0.085
        s_7-t_0 -> 0.084
        s_8-t_0 -> 0.078
        s_9-t_0 -> 0.063
        s_10-t_0 -> 0.079
        s_11-t_0 -> 0.092

  Action: a_wait
  Reward: -1.0. Transition to s_0-t_1
  Observation: o_1
Particle reinvigoration for 818 particles

  STEP 1 (5 steps remaining)
  Current belief:
        s_0-t_1 -> 0.02
        s_1-t_1 -> 0.671
        s_2-t_1 -> 0.015
        s_3-t_1 -> 0.019
        s_4-t_1 -> 0.072
        s_5-t_1 -> 0.022
        s_6-t_1 -> 0.026
        s_7-t_1 -> 0.036
        s_8-t_1 -> 0.05
        s_9-t_1 -> 0.013
        s_10-t_1 -> 0.025
        s_11-t_1 -> 0.031

  Action: a_wait
  Reward: -1.0. Transition to s_0-t_2
  Observation: o_3
Particle reinvigoration for 905 particles

  STEP 2 (4 steps remaining)
  Current belief:
        s_0-t_2 -> 0.021
        s_1-t_2 -> 0.581
        s_2-t_2 -> 0.011
        s_3-t_2 -> 0.25
        s_4-t_2 -> 0.033
        s_5-t_2 -> 0.019
        s_6-t_2 -> 0.018
        s_7-t_2 -> 0.021
        s_8-t_2 -> 0.014
        s_9-t_2 -> 0.01
        s_11-t_2 -> 0.022

  Action: a_wait
  Reward: -1.0. Transition to s_0-t_3
  Observation: o_1
Particle reinvigoration for 48 particles

  STEP 3 (3 steps remaining)
  Current belief:
        s_0-t_3 -> 0.001
        s_1-t_3 -> 0.964
        s_2-t_3 -> 0.002
        s_3-t_3 -> 0.025
        s_4-t_3 -> 0.001
        s_5-t_3 -> 0.001
        s_6-t_3 -> 0.001
        s_7-t_3 -> 0.002
        s_9-t_3 -> 0.001
        s_11-t_3 -> 0.002

  Action: a_wait
  Reward: -1.0. Transition to s_0-t_4
  Observation: o_1

  STEP 4 (2 steps remaining)
  Current belief:
        s_1-t_4 -> 1.0

  Action: a_wait
  Reward: -1.0. Transition to s_0-t_5
  Observation: o_0
Particle reinvigoration for 951 particles

  STEP 5 (1 steps remaining)
  Current belief:
        s_1-t_5 -> 1.0

  Action: a_wait
  Reward: -1.0. Transition to s_0-t_6
  Observation: o_0
Particle reinvigoration for 927 particles

  STEP 6 (0 steps remaining)
  Current belief:
        s_1-t_6 -> 1.0

  Action: a_1
  Reward: -100.0. Transition to s_7-t_0
Particle reinvigoration for 801 particles

Trial ended with decision a_1.
Decision took 0.9999999999999999s

...it is not costly to create these instances. The search tree is saved in the agent, not in the planner.

Ah, I see! So the planner just reads and writes the tree from/to the agent. Then if I understood correctly, I can just create an instance of the planner whenever I want to change the parameters? As in

for trial_n in range(n_trials):
    # Get information about true state and apply transition
    next_label = int(y_test[trial_n])
    true_state = TDState(next_label, 0)  # (state, time_step)
    bci_problem.env.apply_transition(true_state)

    for step_n in range(total_steps):
        # Here I would like to change the max_depth hyperparameter
        remaining_steps = pomdp_steps - step_n

        # Same for planning time
        if remaining steps == pomdp_steps:
            planning_time = 0.5
        else: 
            planning_time = 0.1 

        planner = POMCP(max_depth=remaining_steps, discount_factor=gamma,
                        planning_time=planning_time, exploration_const=110,
                        rollout_policy=agent.policy_model) 

        action = planner.plan(problem.agent)

Would this suffice to have the problem modeled as finite-horizon? Or do I forcefully have to add a terminal state to the model?

@zkytony
Copy link
Collaborator

zkytony commented Aug 24, 2023

Yes in this case you would be planning with a finite horizon. You can give the last code block a try. Btw, if you haven’t checked out, the tree debugger feature should be very helpful in this case for you to inspect the search tree and see if it is doing what you expect.

@Hororohoruru
Copy link
Contributor Author

Thank you. It is nice that you bring up the tree debugger, as you mentioned it also on #27 when I asked about offline planning (i.e. planning only at the beginning of the trial and then using the tree for the rest of the trial). I have checked the documentation but I am not really sure what to check in this case.

I guess after each plan() call I can check agent.tree and check the depth to ensure it is not going past the intended horizon?

@zkytony
Copy link
Collaborator

zkytony commented Aug 24, 2023

Yes, you can definitely check that. You can also use it to debug / trace down why a certain decision was made.

@Hororohoruru
Copy link
Contributor Author

You can also use it to debug / trace down why a certain decision was made.

Could you explain further how to do this?

@zkytony
Copy link
Collaborator

zkytony commented Aug 29, 2023

I'm mostly referring to traversing the tree through indexing, explained here.

You can find the definition of the search tree in Figure 1 of the POMCP paper. But to explain it in a nutshell, the search tree contains two types of nodes, VNode and QNode. Each VNode corresponds to some history h, and each QNode corresponds some history ha (that is, taking action a at history h). An action whose QNode has the maximum value is chosen.

@Hororohoruru
Copy link
Contributor Author

Thanks you for the explanation. After exploring the tree, I realized almost all simulations were done on the 'wait' action (the equivalent of 'listen') for my model. As a result, the value for a given action was only higher than wait on the last step of the model, even when all the observations are consistent with that action.

I changed the exploration constant to the default value (it was 110 before from the Tiger example), and that took care of the issue with the belief staying at 1.0 for several time steps.

On a separate note, I am now experimenting with how the confusion matrix is penalized in different time steps. Since each time step obtains observations that are based on more data, I wanted to smooth the confusion matrix as done in Park and Kim, 2012 (bottom of page 7, the equation is not numbered). I am using the same q0 parameter (0.3) for the last time step, increasing by 0.05 for each previous time step. With this modification, trials that receive incorrect observations on the beginning time steps now take longer to increase the belief on the corresponding state, giving the model time to (hopefully), receive the correct observations at later time steps and avoid false positives.

After I did that and explored the tree, I noticed the trials that produce false positives do so with a belief of 0.85. Could this be related to the exploration contant as well? Since the default value is a small number.

@zkytony
Copy link
Collaborator

zkytony commented Aug 30, 2023

With this modification, trials that receive incorrect observations on the beginning time steps now take longer to increase the belief on the corresponding state, giving the model time to (hopefully), receive the correct observations at later time steps and avoid false positives.

This sounds like a good thing.

After I did that and explored the tree, I noticed the trials that produce false positives do so with a belief of 0.85.

I don't follow what "false positives" mean in this context. Are you receiving "incorrect" observation when the belief of the true environment state is 0.85? Did you generate the observation based on a sampled state from the belief, or from the environment's true state?

Regarding exploration constant, I can't really comment about your case. I can say that setting exploration constant too high will result in visiting every node equally often, while setting it too low may lead to unable to find a solution or finding a highly suboptimal solution. There is a lot of literature on UCB1 exploration constant. The heuristic from the POMCP paper is to set it to be R_hi - R_lo (but of course this is not a rule).

@Hororohoruru
Copy link
Contributor Author

This sounds like a good thing.

Thank you! The only thing I don't like is that the values for the smoothing are arbitrary. I will also try to penalize the matrix row-wise, so instead of an heuristic prior on how much later steps are more precise than earlier steps, I let the uncertainty observed by the observation model decide how much each class needs to be penalized at each step.

I don't follow what "false positives" mean in this context

Sorry. It means to decide on taking any action that does not correspond with the true state of the environment. In the tiger example, it would be equivalent of the agent choosing 'open-left' when the tiger is on the right, or vice-versa.

Are you receiving "incorrect" observation when the belief of the true environment state is 0.85?

I am receiving incorrect observations for several time steps in succession, and that causes the belief for the state the observation corresponds with to reach 0.85. My intuition is that, even if that happens in earlier time steps, later time steps should (or are more likely to) receive the correct observation.

There is a lot of literature on UCB1 exploration constant. The heuristic from the POMCP paper is to set it to be R_hi - R_lo (but of course this is not a rule).

According to what I read from the POMCP paper, this requires running any given problem with an exploration constant of 0 to obtain one of the two parameters. That is not really practical in the case I am studying, so I will turn to the literature to see if I find any methods that are based on the observation model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants