Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Redundant) code in POUCT implementation #10

Closed
zkytony opened this issue Nov 6, 2020 · 0 comments · Fixed by #12
Closed

(Redundant) code in POUCT implementation #10

zkytony opened this issue Nov 6, 2020 · 0 comments · Fixed by #12

Comments

@zkytony
Copy link
Collaborator

zkytony commented Nov 6, 2020

In the _simulate function of po_uct.pyx:

        root[action].num_visits += 1
        root.value = root.value + (total_reward - root.value) / (root.num_visits)
        root[action].value = root[action].value + (total_reward - root[action].value) / (root[action].num_visits)

Both the value of root (VNode) and root[action] (QNode) are updated based on total_reward. However, in fact, the algorithm in the paper only requires updating the value of the QNode, i.e. root[action].

I also noticed in the source code of the original author the expected discounted cumulative value is also not maintained in both the VNode and the QNode.

Also in the current POUCT implementation in pomdp_py, commenting out root.value = ... and stick to only updating the QNode's value according to the paper, does not change the output behavior of the planner, since it eventually outputs an action based on the values of the QNodes that are immediate children of the root node. So we should remove this redundant line because it causes confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant