-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug that prevented Executor::spin_some() from exiting. #472
Conversation
Added batching functionality where all pending executables are collected prior to being executed. This should prevent long running executables from backing up the processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic approach is good in my opinion. There are some code-style issues which our tests and tools can help you work out (you can use the ament_uncrustify path/to/rclcpp --reformat
and ament_cpplint path/to/cpplint
tools to help you).
I'd also like to see a test which tests this case and ensures the desired behavior is happening. @tj10200 are you interested in doing that too?
If so, I think you could create a test case which does something like this:
- create a node
- create three condition variables: A, B, C
- create bool test_finished = false
- create a callback_count = 0
- create a subscription and matching publisher
- ensure the subscription has a queue of at least 10 (either keep all or keep last >=10)
- in the callback for the subscription:
- have it first increment callback_count
- then notify the condition variable A
- then if !test_finished:
- wait on B
- these steps allow you to known when the callback has started and then asynchronously control when the callback finishes
- add node to executor (single threaded should be ok)
- publish twice, to "queue up" two pieces of work
- create a thread that:
- calls spin_some
- notify C
- wait for A
- with a predicate that callback_count == 1
- this will let us know when spin_some has started executing the callback for the first time
- publish again (queuing more work during spin_some which should not be handled)
- notify B
- wait for A
- with a predicate that callback_count == 2, and
- with a high timeout (~10s?)
- if timeout (callback never called again)
- FAIL test
- cancel executor
- join thread
- test_finished = true (to prevent callback from waiting in the future)
- notify B (in case already waiting)
- wait for C
- with a high timeout (~10s?)
- if timeout (spin_some never exited)
- FAIL test
- cancel executor
- join thread
- gtest ASSERT_EQ(callback_count, 2), if 3 then the spin_some did too much work
- join thread
Sorry, I know that was a lot... lol.
I put "good first issue", but it's deceptively complex. I think if you fix the style issues and make a test then this will be a good incremental improvement.
However, reviewing this code I found three more issues with our implementation. So I'd merge this and make new issues for those.
Briefly I'll describe them here:
spinning
should be checked between execution of each piece of work- otherwise, it might take a long time for
Executor::cancel
to stop an in-progress spin_some
- otherwise, it might take a long time for
- if spin_some stops before finishing all work, claimed work is not returned to the executor and may be lost
- if spin_some is changed to consider spinning more often, then there is a race condition on
spinning
- consider two threads both doing
if (spinning.load()) executor.cancel(); executor.spin_some();
- consider this sequence:
- >> enter thread 1
- start spin_some()
- spin_some() sets spinning = true
- >> context switch to thread 2
- calls executor.cancel() because spinning is true
- start second spin_some()
- second spin_some() sets spinning = true again
- >> context switch to thread 1
- spin_some()'s "on scope exit" sets spinning to false again
- >> thread 1 ends, switch back to thread 2
- second spin_some() exits because subsequent spinning check finds it false (yet no cancel has been called)
- expectation is that second spin_some() runs uninterrupted to completion
- actually second spin_some() sometimes exits early as-if canceled, despite not being canceled
- consider two threads both doing
Those don't need to be addressed in this pr or by you, but I will make issues for those cases when this is finished.
In summary, I'm "requesting changes (red X)" for style reasons and because I think it needs a test.
Thanks for the contribution so far!
Reformatted executor.cpp using ament_uncrustify and ament_cpplint
Hi @wjwwood - Yes I'd like to write a test before getting this in. I assume you want to commit the test on this same pull request? |
Yes, please. |
hi @wjwwood, I was having some trouble getting this all to hook together in the unit test. I'm hoping you might be able to point out where I've messed up. So I have something that looks like this (trimmed down version):
The problem I'm having is that when calling spin_some(), the function get_next_executable() always returns false and the subscriber callback is never called. The publisher and subscriber are the same ones under the minimal_publisher/subscriber lambda examples, I've basically just renamed them and changed publisher to use an explicit callback rather than use a timer. Did I miss something here in the initialization steps? Looking through the other examples, I can't really see anything that's different in my UT. |
Started looking at this again today. I noticed through gdb that occasionally the executor would pick up the published messages. Through some debugging I found that it was related to no having the intra process flag switched on when I created my nodes. Saying that, I ran into another problem immediately after initializing the nodes correctly. In the executor on the collection of the second published message: It aborts on this line here - https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/executor.cpp#L595 I think I can make a change to the publisher and subscribers in the test to make the groups Reentrant types, but does this seem like the right behavior? It would mean that the application would abort any time a pub/sub was still mutually exclusive callback group types. It seems that I might need to take a different approach to my original solution. I'm thinking of trying to work out a method to count the number of pending messages, and using that as a way to break out of the spin_some loop. This would be very similar to the current pull request, with the exception that the AnyExecutable objects would be collected and executed as a single operation. Sudo Code:
Thoughts? |
@tj10200, it looks like there hasn't been much progress on this in a while. Have you been able to make any progress, or do you have any questions we can help answer? |
I'm going to move this to the backlog for now, since there doesn't appear to be much activity. @tj10200, if you find time to revisit this, please feel free to let us know. |
Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Added batching functionality where all pending executables are
collected prior to being executed. This should prevent
long running executables from backing up the processing when calling
spin_some().
Connects to #471