Fast-DDS Service Reliability sometimes hangs lifecycle manager #3033

Aposhian · 2022-06-21T18:53:00Z

Bug report

Operating System:
- Ubuntu 20.04
ROS2 Version:
- rolling (focal)
Version or commit hash:
- main (3ecb4d0)
DDS implementation:
- FastDDS

Steps to reproduce issue

Give lifecycle manager a node to manage that is erroring out, and for some reason doesn't create its lifecycle services properly.

Expected behavior

Failure to connect to lifecycle state services should eventually timeout, or indicate that the lifecycle node it is trying to manage is not working.

Actual behavior

lifecycle manager blocks indefinitely.

Additional information

I have a hard time getting this to reproduce reliably. I think the lifecycle services failing to come up has something to do with FastDDS shared mem. This is happening with just nav2 controller_server for me.

This is resulting in behavior observed in #3027, since the Controller server is unable to configure to provide the FollowPath action.

The text was updated successfully, but these errors were encountered:

Aposhian · 2022-06-21T18:58:34Z

Ok, actually I think this is partially a duplicate of #2917. However, I still think the infinite timeout is problematic.

SteveMacenski · 2022-06-21T19:27:24Z

I think the lifecycle services failing to come up has something to do with FastDDS shared mem

Do you see this with Cyclone? If not, then you should file this with Fast-DDS as a problem to resolve.

This is happening with just nav2 controller_server for me.

Try reordering the bringup sequence for lifecycle, my guess is its not actually an issue with the controller server, its just first in line.

How is it a duplicate of #2917? That was on nested timer issues.

Do you have a specific suggested change? From what I read in this, it seems like the service clients aren't being made properly, and without more specific information / lines / issue descriptions, I can only interpolate what I think you mean by that. I think you mean that they're just straight up not working, which is not a Nav2 issue, its a DDS/RMW issue. But maybe you mean something we can change here, but I'm not sure what that is without more detail.

Aposhian · 2022-06-21T19:55:32Z

No, this never happens with CycloneDDS.

SteveMacenski · 2022-06-21T20:04:10Z

File a ticket please with Fast-DDS and link back here so there's a thread connection. It sounds like that's really the problem -- unless you found something we can do in Nav2 to make it work? But it sounds like this is a communication issue that's out of this project's control. If there's something actionable for us here, I'd be happy to discuss it

SteveMacenski · 2022-06-21T20:05:36Z

CC @MiguelCompany @EduPonz Another DDS related issue has cropped up

Aposhian · 2022-06-21T20:22:13Z

My suggestion is to not use wait_for_service with an essentially infinite timeout. Even if it is an RMW failure, using an infinite timeout masks it and just results in a hanging process. I would propose using a large, but finite timeout. It could default to be something reasonable (5 minutes?) but maybe it would be a good parameter.

SteveMacenski · 2022-06-21T21:09:09Z

The fact that this is an issue in Fast-DDS is an issue that should be reported to them, though. We can add a timeout so in case of failure we don't hang forever, but is problematic that we're failing in the first place when moving to Fast-DDS.

vinnnyr · 2022-06-21T21:11:16Z

Well I don't think they are the same thing -- since they are in different places. Unless you want it the idea to be "throughout Nav2 don't have infinite timeouts waiting for action server / service servers"

SteveMacenski · 2022-06-21T21:13:12Z

Ah you're right, backed that out

SteveMacenski · 2022-06-21T21:18:03Z

My suggestion is to not use wait_for_service with an essentially infinite timeout. Even if it is an RMW failure, using an infinite timeout masks it and just results in a hanging process. I would propose using a large, but finite timeout. It could default to be something reasonable (5 minutes?) but maybe it would be a good parameter.

Feel free to propose a PR. 5 min seems a little nuts, 10 sec seems more in line with reasonable even if respawned from a remote server, but yeah, I think a parameter would be good

Aposhian · 2022-06-22T16:11:44Z

Here is where the controller server backtrace is showing in this state where the lifecycle manager is waiting for it:

#1  0x00007f01b3e21e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007f01b320997f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/rolling/lib/librmw_fastrtps_shared_cpp.so
#3  0x00007f01b325810a in rmw_wait () from /opt/ros/rolling/lib/librmw_fastrtps_cpp.so
#4  0x00007f01b3b02577 in rcl_wait () from /opt/ros/rolling/lib/librcl.so
#5  0x00007f01b428031e in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    () from /opt/ros/rolling/lib/librclcpp.so
#6  0x00007f01b42807d3 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/rolling/lib/librclcpp.so
#7  0x00007f01b4284ead in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/rolling/lib/librclcpp.so
#8  0x00007f01b4283f28 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) ()
   from /opt/ros/rolling/lib/librclcpp.so
#9  0x00005557bc0643b8 in main (argc=<optimized out>, argv=0x7ffef583d478)
    at /usr/include/c++/9/bits/shared_ptr_base.h:1020

meanwhile, the lifecycle manager backtrace is here:

#0  0x00007fa09d893376 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fa09dd91e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fa09d3e597f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/rolling/lib/librmw_fastrtps_shared_cpp.so
#3  0x00007fa09d43410a in rmw_wait () from /opt/ros/rolling/lib/librmw_fastrtps_cpp.so
#4  0x00007fa09d9aa577 in rcl_wait () from /opt/ros/rolling/lib/librcl.so
#5  0x00007fa09e00431e in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    () from /opt/ros/rolling/lib/librclcpp.so
#6  0x00007fa09e0047d3 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/rolling/lib/librclcpp.so
#7  0x00007fa09e008ead in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/rolling/lib/librclcpp.so
#8  0x00007fa09e007f28 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) ()
   from /opt/ros/rolling/lib/librclcpp.so
#9  0x00007fa09e0081d0 in rclcpp::spin(std::shared_ptr<rclcpp::Node>) () from /opt/ros/rolling/lib/librclcpp.so
#10 0x000055c20de0d492 in main (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/9/ext/atomicity.h:96

I can run ros2 service list and see

/controller_server/change_state
/controller_server/get_state

and if I try calling get_state, it appears to finish waiting, but it never gets a response from controller server:

$ ros2 service call /controller_server/get_state lifecycle_msgs/srv/GetState "{}"
waiting for service to become available...
requester: making request: lifecycle_msgs.srv.GetState_Request()

And I can't query via lifecycle

$ ros2 lifecycle get /controller_server                                          
Node not found

Aposhian · 2022-06-22T16:21:45Z

It is important to note that this error condition does not happen every time. It is inconsistent.

Aposhian · 2022-06-22T23:33:59Z

Here is a first effort at a reproduction. It reproduces the problem only some of the time though. https://gist.github.com/Aposhian/043359e09a203900e8db55407a8b5e38

MiguelCompany · 2022-06-23T12:45:59Z

@Aposhian One thing I don't understand. On the issue description you say this happens on rolling, but on the gist you just shared it uses ros:galactic as the base docker image.

So my question is: Does the issue only reproduce in galactic? Does it also reproduce on humble/rolling?

SteveMacenski · 2022-06-23T17:14:32Z

Note that Rolling if from binaries in 20.04 are outdated, since Rolling has moved to 22.04, so there be updates missing if not using 22.04 as the base or building from source

Aposhian · 2022-06-24T17:52:01Z

Yes, I am using old rolling on focal, so closer to galactic than to humble.

Aposhian · 2022-06-24T17:58:50Z

Yes, I cannot reproduce this issue on Humble images.

SteveMacenski · 2022-06-24T18:02:26Z

Keep in mind that Galactic is EOL in 5 months.

Since this is a galactic-specific problem, what do you want to do about it? Would it be possible to backport the patch @MiguelCompany? If not, we could try to patch something in just the galactic branch for you @Aposhian for the interim? If its just for Galactic to deal with just an issue with Fast-DDS, I'm OK being a bit more sloppy about it than I normally would since it's lifecycle is very limited and would not be pushed into future distributions -- the maintenance cost and technical reach is limited.

Aposhian · 2022-06-24T18:13:46Z

@MiguelCompany suggested I try building FastDDS from source on the 2.3.x branch, which has unreleased fixes for galactic. I don't think any Nav2 action is required. I will update if that is a viable workaround for galactic.

SteveMacenski · 2022-06-24T18:16:07Z

OK - if that fixes things feel free to close when ready!

Aposhian · 2022-06-24T18:58:43Z

Ok, I tried building from 2.3.x FastDDS, but it doesn't resolve the issue. https://gist.github.com/Aposhian/043359e09a203900e8db55407a8b5e38

rmw: rmw_cyclonedds_cpp | daemon: ✅
  Try #1: ✅
  Try #2: ✅
  Try #3: ✅
rmw: rmw_cyclonedds_cpp | daemon: ❌
  Try #1: ✅
  Try #2: ✅
  Try #3: ✅
rmw: rmw_fastrtps_cpp | daemon: ✅
  Try #1: ❌
  Try #2: ✅
  Try #3: ✅
rmw: rmw_fastrtps_cpp | daemon: ❌
  Try #1: ✅
  Try #2: ✅
  Try #3: ✅

Aposhian · 2022-07-01T18:19:17Z

@MiguelCompany while not specifically for Nav2, lifecycle services failing is still happening on humble intermittently. Can we migrate/copy this issue to Fast-DDS?

SteveMacenski · 2022-07-01T18:26:10Z

Agreed, some private conversations I've had in recent weeks have also expressed issues with services in Humble

@MiguelCompany what's the next step / status on this issue?

SteveMacenski · 2022-07-07T20:43:44Z

Any updates? This is a pretty critical issue and trying to find if we need to take action to resolve it 😄

SteveMacenski · 2022-07-07T22:07:16Z

In the same theme of #3032, I want to get this off the queue so trying to find out the actionable things we can do to resolve this. I know the main issue is Fast-DDS not handling lots of service calls well, but I suppose we can improve the way these failures occur. The lifecycle manager and BT Action Node in BT Navigator are the only 2 places where we have waits without timeouts intentionally since these are pretty critical operations.

For this (lifecycle manger), what should happen if a service for transitioning doesn't come up? Fail everything? I suppose we could also remove the checks for services being up in the constructors https://github.com/ros-planning/navigation2/blob/main/nav2_util/src/lifecycle_service_client.cpp#L39 knowing that we check if they're up before we submit requests in the change/get methods that were recently added due to supporting respawning. I'll say though that the current status of lifecycle services and the manager are in a stable but thoughtfully setup balance due to supporting Bond, Respawn, and Lifecycle all at the same time. Removing that could cause some real headaches other places, but I won't be sure until I try if that seems to be the best answer.

Like in the other ticket though, I think the priority should be on working with Fast-DDS to be able to resolve this issue outright.

Aposhian · 2022-07-07T23:20:53Z

Yeah I'm torn as to whether this is something that Nav2 should consider to fallible as a part of normal operation, and gracefully handle it. But once you start to question the underlying RMW, lots of other assumptions go out the window.
Even if we don't add in a configurable timeout, we could probably make the log messages more explicit as to what it is waiting on, so there is no ambiguity like "Waiting for lifecycle service..." or something like that.

jsan-rt · 2022-07-08T16:07:18Z

Hi @Aposhian @SteveMacenski

Since you also said this is happening on Humble, we have moved our investigations there with no luck so far. Using your reproducer on Humble yields no results, being able to run hundreds of times without error.

Do you have any other clue on how to have this issue reproduced on Humble?

We also have this PR open that aims to bring Fast DDS's Waitsets to the RMW. It's still under review but you could try using it and see if the situation improves.

Aposhian · 2022-07-08T20:33:57Z

I think that sounds reasonable. However, we won't be much use testing this at the moment, since we have decided to use CycloneDDS with Humble for our main operation at the moment.

SteveMacenski · 2022-07-09T00:34:47Z

OK, I can add that to my task list to change early next week to add the log / waits. I could actually do this too in the BT Action Node too so that in either case when waiting if it isn't connecting we're printing that for awareness. Do you think that's a good solution to #3032 as well?

Aposhian · 2022-07-11T14:38:35Z

Yes I think so.

SteveMacenski · 2022-07-11T19:05:28Z

Sweet, its on my queue for this afternoon

SteveMacenski · 2022-07-11T20:15:41Z

#3071 implements

SteveMacenski · 2022-07-13T02:09:48Z

Sorry, this should stay open until the service reliability is handled in Fast-DDS

SteveMacenski · 2022-08-24T18:08:20Z

@Aposhian that PR was merged that supposedly fixes this -- have you tested by chance?

Aposhian · 2022-08-24T18:09:07Z

No, but this is probably something we can get to testing in the next month.

vinnnyr · 2022-08-25T17:14:10Z

@SteveMacenski can you please point us to the upstream PR that we should make sure we test with?

Was it this one? ros2/rmw_fastrtps#619

SteveMacenski · 2022-08-25T17:18:34Z

yes!

SteveMacenski · 2022-10-13T20:41:18Z

@Aposhian I don't suppose you tested as part of your migration back to Fast-DDS to use the features you liked?

Aposhian · 2022-10-13T21:38:01Z

No not yet

EduPonz · 2022-11-24T10:52:45Z

Hi everyone,

The new Humble sync is out which contains the changes made to solve this issue.

SteveMacenski · 2022-11-28T19:36:48Z

@Aposhian / @vinnnyr if you could try that out since I know you want to be using Fast-DDS ideally, that would be fantastic. You can close this out if that resolves the issue.

vinnnyr · 2022-11-28T20:36:21Z

will report back if we end up having time to try it. unfortunately both of us seem to be a bit short on time for the rest of the year, so I cannot promise anything.

EduPonz · 2023-02-02T19:03:43Z

Hi everyone,

On January the 27th there was a Humble patch release which included additional fixes (mainly eProsima/Fast-DDS#3195). It'd great if someone could give this a try a see whether this ticket can be closed.

SteveMacenski · 2023-02-24T19:29:58Z

Agreed!

AlexeyMerzlyakov · 2023-03-07T09:40:55Z

On January the 27th there was a Humble patch release which included additional fixes (mainly eProsima/Fast-DDS#3195). It'd great if someone could give this a try a see whether this ticket can be closed.

Unfortunately, the problem was reproduced on Fast-DDS with PR3195 patch: very rare, but still appear typically 1 time per ~50-100 Nav2 stack runs. Checked on Ubuntu 20.04 with ROS2 Rolling built from sources (20230110 version, PR3195 is there).

Just for the reference: for verification I've used the Nav2 system tests that are deploying full Nav2 stack in each turn. The stress-test script is attached to current message: keepout_stress.sh.txt

ygoumaz · 2023-03-13T15:57:21Z

I can confirm that we still encounter this issue with the webots_ros2 package. Our tests that use the Nav2 package randomly fail. Failures occur more often on CI machines than locally. Changing to Cyclone DDS seems to resolve the issue.

MiguelCompany · 2023-04-13T08:54:23Z

@AlexeyMerzlyakov

Just for the reference: for verification I've used the Nav2 system tests that are deploying full Nav2 stack in each turn. The stress-test script is attached to current message: keepout_stress.sh.txt

This is killing the processes with kill -9, which does not let them the chance to free their resources (shared memory files in the case of Fast DDS, for instance).

I would check two things:

Change the signal being used to -2 (SIGINT)
Add rm /dev/shm/* at the end of the kill_ros function

SteveMacenski · 2024-04-05T00:02:15Z

This hasn't received any updated in about a year with anyone noting an issue and I also haven't seen it in my day to day development in some time.

I'm closing this ticket as fixed by Fast-DDS since it does not appear to continue to be a problem of reportable frequency anymore. Though, happy to reopen if anyone can show that this is still and issue plaguing their application.

Aposhian mentioned this issue Jun 21, 2022

BtActionServer should not create ActionServer until activated #3032

Closed

This was referenced Jun 21, 2022

BT Action Node initialization does not timeout #3027

Closed

adding timeout for action client initialization #3029

Merged

SteveMacenski mentioned this issue Jul 11, 2022

adding looping timeout for lifecycle service clients + adding string name of action for BT action nodes #3071

Merged

SteveMacenski linked a pull request Jul 11, 2022 that will close this issue

adding looping timeout for lifecycle service clients + adding string name of action for BT action nodes #3071

Merged

SteveMacenski closed this as completed in #3071 Jul 12, 2022

SteveMacenski reopened this Jul 13, 2022

SteveMacenski changed the title ~~lifecycle manager wait for service should have a finite, or at least configurable, timeout.~~ Fast-DDS Service Reliability sometimes hangs lifecycle manager Jul 13, 2022

AlexeyMerzlyakov mentioned this issue Nov 25, 2022

NAV2 Getting Started Example doesn't work for ROS Humble #3298

Closed

ygoumaz mentioned this issue Mar 13, 2023

Fix random Nav2 failure with Fast DDS cyberbotics/webots_ros2#672

Merged

royito55 mentioned this issue Jun 28, 2023

NTP does not work H.1.2 iRobotEducation/create3_docs#414

Closed

SteveMacenski closed this as completed Apr 5, 2024

Fast-DDS Service Reliability sometimes hangs lifecycle manager #3033

Fast-DDS Service Reliability sometimes hangs lifecycle manager #3033

Comments

Aposhian commented Jun 21, 2022 • edited Loading

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

Aposhian commented Jun 21, 2022

SteveMacenski commented Jun 21, 2022 • edited Loading

Aposhian commented Jun 21, 2022

SteveMacenski commented Jun 21, 2022 • edited Loading

SteveMacenski commented Jun 21, 2022

Aposhian commented Jun 21, 2022

SteveMacenski commented Jun 21, 2022 • edited Loading

vinnnyr commented Jun 21, 2022

SteveMacenski commented Jun 21, 2022

SteveMacenski commented Jun 21, 2022 • edited Loading

Aposhian commented Jun 22, 2022

Aposhian commented Jun 22, 2022

Aposhian commented Jun 22, 2022

MiguelCompany commented Jun 23, 2022

SteveMacenski commented Jun 23, 2022

Aposhian commented Jun 24, 2022

Aposhian commented Jun 24, 2022

SteveMacenski commented Jun 24, 2022

Aposhian commented Jun 24, 2022

SteveMacenski commented Jun 24, 2022

Aposhian commented Jun 24, 2022

Aposhian commented Jul 1, 2022

SteveMacenski commented Jul 1, 2022 • edited Loading

SteveMacenski commented Jul 7, 2022

SteveMacenski commented Jul 7, 2022 • edited Loading

Aposhian commented Jul 7, 2022

jsan-rt commented Jul 8, 2022

Aposhian commented Jul 8, 2022

SteveMacenski commented Jul 9, 2022

Aposhian commented Jul 11, 2022

SteveMacenski commented Jul 11, 2022

SteveMacenski commented Jul 11, 2022

SteveMacenski commented Jul 13, 2022

SteveMacenski commented Aug 24, 2022

Aposhian commented Aug 24, 2022

vinnnyr commented Aug 25, 2022 • edited Loading

SteveMacenski commented Aug 25, 2022

SteveMacenski commented Oct 13, 2022

Aposhian commented Oct 13, 2022

EduPonz commented Nov 24, 2022

SteveMacenski commented Nov 28, 2022 • edited Loading

vinnnyr commented Nov 28, 2022

EduPonz commented Feb 2, 2023

SteveMacenski commented Feb 24, 2023

AlexeyMerzlyakov commented Mar 7, 2023 • edited Loading

ygoumaz commented Mar 13, 2023

MiguelCompany commented Apr 13, 2023

SteveMacenski commented Apr 5, 2024

Aposhian commented Jun 21, 2022 •

edited

Loading

SteveMacenski commented Jun 21, 2022 •

edited

Loading

SteveMacenski commented Jun 21, 2022 •

edited

Loading

SteveMacenski commented Jun 21, 2022 •

edited

Loading

SteveMacenski commented Jun 21, 2022 •

edited

Loading

SteveMacenski commented Jul 1, 2022 •

edited

Loading

SteveMacenski commented Jul 7, 2022 •

edited

Loading

vinnnyr commented Aug 25, 2022 •

edited

Loading

SteveMacenski commented Nov 28, 2022 •

edited

Loading

AlexeyMerzlyakov commented Mar 7, 2023 •

edited

Loading