(Redundant) code in POUCT implementation #10

zkytony · 2020-11-06T16:26:19Z

In the _simulate function of po_uct.pyx:

        root[action].num_visits += 1
        root.value = root.value + (total_reward - root.value) / (root.num_visits)
        root[action].value = root[action].value + (total_reward - root[action].value) / (root[action].num_visits)

Both the value of root (VNode) and root[action] (QNode) are updated based on total_reward. However, in fact, the algorithm in the paper only requires updating the value of the QNode, i.e. root[action].

I also noticed in the source code of the original author the expected discounted cumulative value is also not maintained in both the VNode and the QNode.

Also in the current POUCT implementation in pomdp_py, commenting out root.value = ... and stick to only updating the QNode's value according to the paper, does not change the output behavior of the planner, since it eventually outputs an action based on the values of the QNodes that are immediate children of the root node. So we should remove this redundant line because it causes confusion.

The text was updated successfully, but these errors were encountered:

zkytony mentioned this issue Jan 13, 2021

Bug fixes and interfacing with SARSOP, pomdp-solve #12

Merged

zkytony closed this as completed in #12 Jan 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Redundant) code in POUCT implementation #10

(Redundant) code in POUCT implementation #10

zkytony commented Nov 6, 2020

(Redundant) code in POUCT implementation #10

(Redundant) code in POUCT implementation #10

Comments

zkytony commented Nov 6, 2020