Fix bug that prevented Executor::spin_some() from exiting. #472

tj10200 · 2018-05-04T21:33:44Z

Added batching functionality where all pending executables are
collected prior to being executed. This should prevent
long running executables from backing up the processing when calling
spin_some().

Connects to #471

Added batching functionality where all pending executables are collected prior to being executed. This should prevent long running executables from backing up the processing.

wjwwood

The basic approach is good in my opinion. There are some code-style issues which our tests and tools can help you work out (you can use the ament_uncrustify path/to/rclcpp --reformat and ament_cpplint path/to/cpplint tools to help you).

I'd also like to see a test which tests this case and ensures the desired behavior is happening. @tj10200 are you interested in doing that too?

If so, I think you could create a test case which does something like this:

create a node
create three condition variables: A, B, C
create bool test_finished = false
create a callback_count = 0
create a subscription and matching publisher
- ensure the subscription has a queue of at least 10 (either keep all or keep last >=10)
- in the callback for the subscription:
  - have it first increment callback_count
  - then notify the condition variable A
  - then if !test_finished:
    - wait on B
  - these steps allow you to known when the callback has started and then asynchronously control when the callback finishes
add node to executor (single threaded should be ok)
publish twice, to "queue up" two pieces of work
create a thread that:
- calls spin_some
- notify C
wait for A
- with a predicate that callback_count == 1
- this will let us know when spin_some has started executing the callback for the first time
publish again (queuing more work during spin_some which should not be handled)
notify B
wait for A
- with a predicate that callback_count == 2, and
- with a high timeout (~10s?)
- if timeout (callback never called again)
  - FAIL test
  - cancel executor
  - join thread
test_finished = true (to prevent callback from waiting in the future)
notify B (in case already waiting)
wait for C
- with a high timeout (~10s?)
- if timeout (spin_some never exited)
  - FAIL test
  - cancel executor
  - join thread
gtest ASSERT_EQ(callback_count, 2), if 3 then the spin_some did too much work
join thread

Sorry, I know that was a lot... lol.

I put "good first issue", but it's deceptively complex. I think if you fix the style issues and make a test then this will be a good incremental improvement.

However, reviewing this code I found three more issues with our implementation. So I'd merge this and make new issues for those.

Briefly I'll describe them here:

spinning should be checked between execution of each piece of work
- otherwise, it might take a long time for Executor::cancel to stop an in-progress spin_some

if spin_some stops before finishing all work, claimed work is not returned to the executor and may be lost

if spin_some is changed to consider spinning more often, then there is a race condition on spinning
- consider two threads both doing if (spinning.load()) executor.cancel(); executor.spin_some();
- consider this sequence:
  - >> enter thread 1
  - start spin_some()
  - spin_some() sets spinning = true
  - >> context switch to thread 2
  - calls executor.cancel() because spinning is true
  - start second spin_some()
  - second spin_some() sets spinning = true again
  - >> context switch to thread 1
  - spin_some()'s "on scope exit" sets spinning to false again
  - >> thread 1 ends, switch back to thread 2
  - second spin_some() exits because subsequent spinning check finds it false (yet no cancel has been called)
- expectation is that second spin_some() runs uninterrupted to completion
- actually second spin_some() sometimes exits early as-if canceled, despite not being canceled

Those don't need to be addressed in this pr or by you, but I will make issues for those cases when this is finished.

In summary, I'm "requesting changes (red X)" for style reasons and because I think it needs a test.

Thanks for the contribution so far!

Reformatted executor.cpp using ament_uncrustify and ament_cpplint

tj10200 · 2018-05-08T11:58:01Z

Hi @wjwwood - Yes I'd like to write a test before getting this in. I assume you want to commit the test on this same pull request?

wjwwood · 2018-05-08T18:35:45Z

Hi @wjwwood - Yes I'd like to write a test before getting this in. I assume you want to commit the test on this same pull request?

Yes, please.

tj10200 · 2018-05-11T19:50:26Z

hi @wjwwood, I was having some trouble getting this all to hook together in the unit test. I'm hoping you might be able to point out where I've messed up.

So I have something that looks like this (trimmed down version):

  rclcpp::init(0, nullptr);
  rclcpp::executors::SingleThreadedExecutor executor;
  rclcpp::Node::SharedPtr subscriber =  std::make_shared<SpinSomeTestSubscriber>();
  rclcpp::Node::SharedPtr publisher =  std::make_shared<SpinSomeTestPublisher>();
  executor.add_node(publisher);
  executor.add_node(subscriber);

  publisher->publish_test_data();

  executor.spin_some();

The problem I'm having is that when calling spin_some(), the function get_next_executable() always returns false and the subscriber callback is never called. The publisher and subscriber are the same ones under the minimal_publisher/subscriber lambda examples, I've basically just renamed them and changed publisher to use an explicit callback rather than use a timer.

Did I miss something here in the initialization steps? Looking through the other examples, I can't really see anything that's different in my UT.

tj10200 · 2018-05-17T04:00:53Z

Started looking at this again today. I noticed through gdb that occasionally the executor would pick up the published messages. Through some debugging I found that it was related to no having the intra process flag switched on when I created my nodes. Saying that, I ran into another problem immediately after initializing the nodes correctly.

In the executor on the collection of the second published message: It aborts on this line here - https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/executor.cpp#L595

I think I can make a change to the publisher and subscribers in the test to make the groups Reentrant types, but does this seem like the right behavior? It would mean that the application would abort any time a pub/sub was still mutually exclusive callback group types.

It seems that I might need to take a different approach to my original solution. I'm thinking of trying to work out a method to count the number of pending messages, and using that as a way to break out of the spin_some loop. This would be very similar to the current pull request, with the exception that the AnyExecutable objects would be collected and executed as a single operation.

Sudo Code:

Executor::spin_some()
...
  int pending_executables = get_pending_executables();
  while (spinning.load() && pending_executables) {
    //Same as https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/executor.cpp#L210
    --pending_executables;
  }
...

Thoughts?

cottsay · 2019-02-14T03:13:09Z

@tj10200, it looks like there hasn't been much progress on this in a while. Have you been able to make any progress, or do you have any questions we can help answer?

cottsay · 2019-02-21T04:28:11Z

I'm going to move this to the backlog for now, since there doesn't appear to be much activity. @tj10200, if you find time to revisit this, please feel free to let us know.

ivanpauno · 2019-11-22T16:24:11Z

#844 addressed the same problem this PR was trying to solve.
Closing as #844 has been merged.

@tj10200 Feel free to reopen or open a new issue/PR if you think that something has to be changed.

Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

Fix bug that prevented Executor::spin_some() from exiting.

ff9c32f

Added batching functionality where all pending executables are collected prior to being executed. This should prevent long running executables from backing up the processing.

tfoote added the in review Waiting for review (Kanban column) label May 4, 2018

wjwwood requested changes May 4, 2018

View reviewed changes

Fixing code style issues

2fdae53

Reformatted executor.cpp using ament_uncrustify and ament_cpplint

mikaelarguedas added requires-changes in progress Actively being worked on (Kanban column) and removed in review Waiting for review (Kanban column) labels May 10, 2018

mikaelarguedas requested a review from wjwwood July 5, 2018 17:48

mikaelarguedas added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Jul 5, 2018

cottsay added backlog and removed in review Waiting for review (Kanban column) labels Feb 21, 2019

ivanpauno closed this Nov 22, 2019

wuffle-ros bot removed the backlog label Nov 22, 2019

DensoADAS pushed a commit to DensoADAS/rclcpp that referenced this pull request Aug 5, 2022

catch parent exception (ros2#472)

59506e1

Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug that prevented Executor::spin_some() from exiting. #472

Fix bug that prevented Executor::spin_some() from exiting. #472

tj10200 commented May 4, 2018 •

edited by wjwwood

Loading

wjwwood left a comment •

edited

Loading

tj10200 commented May 8, 2018

wjwwood commented May 8, 2018

tj10200 commented May 11, 2018

tj10200 commented May 17, 2018 •

edited

Loading

cottsay commented Feb 14, 2019

cottsay commented Feb 21, 2019

ivanpauno commented Nov 22, 2019

Fix bug that prevented Executor::spin_some() from exiting. #472

Fix bug that prevented Executor::spin_some() from exiting. #472

Conversation

tj10200 commented May 4, 2018 • edited by wjwwood Loading

wjwwood left a comment • edited Loading

Choose a reason for hiding this comment

tj10200 commented May 8, 2018

wjwwood commented May 8, 2018

tj10200 commented May 11, 2018

tj10200 commented May 17, 2018 • edited Loading

cottsay commented Feb 14, 2019

cottsay commented Feb 21, 2019

ivanpauno commented Nov 22, 2019

tj10200 commented May 4, 2018 •

edited by wjwwood

Loading

wjwwood left a comment •

edited

Loading

tj10200 commented May 17, 2018 •

edited

Loading