-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nav2 lifecycle manager won't startup nodes #2917
Comments
I have not run into this, but I'm also using Rolling binaries, not source. The binaries stopped updates a bit ago (not super long, but also its been a few weeks) since Rolling has moved to support 22.04 and releases binaries there for the Humble release in late May. More context #2689 |
Yes, this is the most probably the problem of newer ROS2 API change. I understood what was resolved in the #2689. And I am not sure that making nested timers belonging to different callback groups will correctly work further. That is why this ticket was initiated with lock-synchronization approach proposed: to cover initially raised in #2689 problem (I've got the same testcase and played with it). Since currently it is not related to binary-distributed ROS2 Rolling, we do not need to much care about it. If in near (or far) future this problem will appear again, we could return back to it. |
Might be worth spending an hour and figuring out what changed in Rolling to cause that issue and finding out if that's a regression or not. If it is a regression, we need to point that out ASAP to OR as part of the Humble testing to see if that was intentional or not. The rolling released version will be updated with what you built with source, so it will be an issue in the near term future. https://discourse.ros.org/t/branched-humble-call-for-beta-testing/25284 they're starting testing now, I'd try to see if you see it in the Humble setup that is intended for release. |
I can reproduce the issue, but I'm not that familiar with what is supposed to be happening. That said, do feel free to open a new bug against https://github.com/ros2/rclcpp, and we can triage it during the test phase here to see if that is intentional behavior. |
@AlexeyMerzlyakov can you do this and see if this is intentional behavior? Time is a little short during the testing phase |
The lock solution doesn't guarantee that shared pointer has been created when But I've had recent experiences when executor didn't check for new timers/subscriptions/etc if they were added from different thread. That could be the cause of the problem. Fortunately, rclcpp has a solution for that: you just need to notify the executor. @AlexeyMerzlyakov could you try to reset to the original code and add these lines after creating the inner timer and test if it works?
|
I've came with small update today on this problem. Rolling-back of It was discovered that issue appears since ros2/rmw_fastrtps@66926c7 commit where new
Yes, that how it should be. Thanks for noting this!
I've checked above lines of code with the testcase from issue header. Unfortunately, it does not work. Maybe, we need to notify inner timer executor from custom |
I was hoping that the node base interface notifies both callback groups (since it is the node base for both of them), but apparently it does not... A separate node for service_thread_ could be used but this seems too complicated
Yes, that could solve it - the variable could be set to true in the "outer" timer (since as far as I know there is no "callback" which is fired when the shared_ptr gets created) and the "inner" timer can be extracted out and its callback can be repeated until
|
But now comes to mind: we could create the
|
I think this should be the best solution: we are calling one thread executor directly from another. I've checked it on local testcase and in Despite it looks like the fix was found, I still not fully undestand the root cause of the problem. Working on it. |
That seems like a fine solution to me. Though I would like notify folks ASAP about this since Humble's full release in 2 weeks.
That sounds like a bug/change in behavior to me that we should notify folks about, @clalancette agree? |
Some info on the issue. The ros2/rmw_fastrtps@66926c7 commit just uncovered the hidden problem, but it is not a root cause of it. This commit contains the logic cycles for This could be easily confirmed by adding
Attaching updated testcase with removed |
Yes, that would be good! |
If I may chip in, as @AlexeyMerzlyakov pointed out, it can only be that ros2/rmw_fastrtps@66926c7 has surfaced and already existing problem, whether that problem lies in nav, rmw_fastrtps, or elsewhere I do not know. I think it'd be great if you could describe more precisely what is not working according to your expectations, what are those, and probably a minimal reproducer to make understanding and debugging easier. Thanks guys! |
@EduPonz, I'll try to figure out the information you are requesting:
Our expectations are to see both timers: outer and nested are being executed as it being when running with CycloneDDS. The simple repro-case is attached in a zip-archive at the end of this message. To execute the testcase: please build
|
@AlexeyMerzlyakov the reason this code does not work is simply that it is not written correctly, i.e. not regarding how executors work. Inside The executor can be waken up externally using the guard condition, as I first proposed, but seemingly, as you tested it without success, it works only when there is just one executor per node (otherwise maybe only one of the executors gets notified, which might be an issue of rclcpp to be raised). Another way around is to avoid spinning executors before assigning work to them, which is the solution that finally worked. I think that the reason why the code worked with CycloneDDS is that it was a bit faster/slower and |
@afrixs, thank you for the explanation. However, I still have some doubts about it. If the executor should never launch the timers registered after it being created, second timer placed in any place after executor creation shouldn't work, e.g. as in following part of code:
However, it works on FastRTPS: it does not work only if
For that I've added 1 second sleep |
@AlexeyMerzlyakov hmm, that's interesting, this leads me to more exploration, finding that the guard condition actually gets triggered automatically inside the So the difference is that in your last testcase (the one with non-nested |
Sounds exactly like the problem ros2/rclcpp#1640 is trying to fix! |
Just noting that I have this same problem in Galactic (binaries) when I switched to rmw_fastrtps_cpp to test another issue |
…#2960) Co-authored-by: Matej Vargovcik <vargovcik@robotechvision.com>
Bug report
Required Info:
Steps to reproduce issue
Run
nav2_lifecycle_manager
with any lifecylce node e.g.:Expected behavior
Lifecycle nodes are going to "Configured" and "Active" state:
Actual behavior
See nothing in console: Nav2 lifecycle manager will never switch nodes to configured or active state.
Problem analysis
Nodes are not going to configured-active states because of
LifecycleManger::startup()
routine is not being executed. This appears due to outer and inner wall timers are belonging to different callback groups:It looks like depending on which thread will be executed first (default Node's or service_thread_), inner part of code will or won't be executed. I've attached a simple testcase (dev_ws.zip) containing similar to above snap of code to easily check this situation:
There are two timers: outer and nested. In correct case both of them should be executed:
In failed case only outer wall timer lambda is being executed:
Possible solutions
Removing inner lambda callback from
callback_group_
will automatically add it to default callback group of Node, that resolves situation. However, it looks like it was made intentionally in 89499ab to cover some cases (@robotechvision could I ask you to share more details about which situation does it resolve?).If we can't move both outer and inner wall timers to one group, another solution here is to add lock-synchronization between
LifecycleManager
constructor andinit_timer_
callback in order to prevent initial crash from #2689. Something like this:However, I am not totally sure that this is right solution. For example, I could miss something or we could play with callback groups that will allow us to make more elegant solution. @robotechvision, @SteveMacenski, do you have any other thoughts on that?
The text was updated successfully, but these errors were encountered: