Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-DDS Service Reliability sometimes hangs lifecycle manager #3033

Closed
Aposhian opened this issue Jun 21, 2022 · 50 comments · Fixed by #3071
Closed

Fast-DDS Service Reliability sometimes hangs lifecycle manager #3033

Aposhian opened this issue Jun 21, 2022 · 50 comments · Fixed by #3071

Comments

@Aposhian
Copy link
Contributor

Aposhian commented Jun 21, 2022

Bug report

  • Operating System:
    • Ubuntu 20.04
  • ROS2 Version:
    • rolling (focal)
  • Version or commit hash:
  • DDS implementation:
    • FastDDS

Steps to reproduce issue

Give lifecycle manager a node to manage that is erroring out, and for some reason doesn't create its lifecycle services properly.

Expected behavior

Failure to connect to lifecycle state services should eventually timeout, or indicate that the lifecycle node it is trying to manage is not working.

Actual behavior

lifecycle manager blocks indefinitely.

Additional information

I have a hard time getting this to reproduce reliably. I think the lifecycle services failing to come up has something to do with FastDDS shared mem. This is happening with just nav2 controller_server for me.

This is resulting in behavior observed in #3027, since the Controller server is unable to configure to provide the FollowPath action.

@Aposhian
Copy link
Contributor Author

Ok, actually I think this is partially a duplicate of #2917. However, I still think the infinite timeout is problematic.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 21, 2022

I think the lifecycle services failing to come up has something to do with FastDDS shared mem

Do you see this with Cyclone? If not, then you should file this with Fast-DDS as a problem to resolve.

This is happening with just nav2 controller_server for me.

Try reordering the bringup sequence for lifecycle, my guess is its not actually an issue with the controller server, its just first in line.

How is it a duplicate of #2917? That was on nested timer issues.

Do you have a specific suggested change? From what I read in this, it seems like the service clients aren't being made properly, and without more specific information / lines / issue descriptions, I can only interpolate what I think you mean by that. I think you mean that they're just straight up not working, which is not a Nav2 issue, its a DDS/RMW issue. But maybe you mean something we can change here, but I'm not sure what that is without more detail.

@Aposhian
Copy link
Contributor Author

No, this never happens with CycloneDDS.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 21, 2022

File a ticket please with Fast-DDS and link back here so there's a thread connection. It sounds like that's really the problem -- unless you found something we can do in Nav2 to make it work? But it sounds like this is a communication issue that's out of this project's control. If there's something actionable for us here, I'd be happy to discuss it

@SteveMacenski
Copy link
Member

CC @MiguelCompany @EduPonz Another DDS related issue has cropped up

@Aposhian
Copy link
Contributor Author

My suggestion is to not use wait_for_service with an essentially infinite timeout. Even if it is an RMW failure, using an infinite timeout masks it and just results in a hanging process. I would propose using a large, but finite timeout. It could default to be something reasonable (5 minutes?) but maybe it would be a good parameter.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 21, 2022

The fact that this is an issue in Fast-DDS is an issue that should be reported to them, though. We can add a timeout so in case of failure we don't hang forever, but is problematic that we're failing in the first place when moving to Fast-DDS.

@vinnnyr
Copy link
Contributor

vinnnyr commented Jun 21, 2022

Well I don't think they are the same thing -- since they are in different places. Unless you want it the idea to be "throughout Nav2 don't have infinite timeouts waiting for action server / service servers"

@SteveMacenski
Copy link
Member

Ah you're right, backed that out

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 21, 2022

My suggestion is to not use wait_for_service with an essentially infinite timeout. Even if it is an RMW failure, using an infinite timeout masks it and just results in a hanging process. I would propose using a large, but finite timeout. It could default to be something reasonable (5 minutes?) but maybe it would be a good parameter.

Feel free to propose a PR. 5 min seems a little nuts, 10 sec seems more in line with reasonable even if respawned from a remote server, but yeah, I think a parameter would be good

@Aposhian
Copy link
Contributor Author

Here is where the controller server backtrace is showing in this state where the lifecycle manager is waiting for it:

#1  0x00007f01b3e21e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007f01b320997f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/rolling/lib/librmw_fastrtps_shared_cpp.so
#3  0x00007f01b325810a in rmw_wait () from /opt/ros/rolling/lib/librmw_fastrtps_cpp.so
#4  0x00007f01b3b02577 in rcl_wait () from /opt/ros/rolling/lib/librcl.so
#5  0x00007f01b428031e in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    () from /opt/ros/rolling/lib/librclcpp.so
#6  0x00007f01b42807d3 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/rolling/lib/librclcpp.so
#7  0x00007f01b4284ead in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/rolling/lib/librclcpp.so
#8  0x00007f01b4283f28 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) ()
   from /opt/ros/rolling/lib/librclcpp.so
#9  0x00005557bc0643b8 in main (argc=<optimized out>, argv=0x7ffef583d478)
    at /usr/include/c++/9/bits/shared_ptr_base.h:1020

meanwhile, the lifecycle manager backtrace is here:

#0  0x00007fa09d893376 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fa09dd91e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fa09d3e597f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/rolling/lib/librmw_fastrtps_shared_cpp.so
#3  0x00007fa09d43410a in rmw_wait () from /opt/ros/rolling/lib/librmw_fastrtps_cpp.so
#4  0x00007fa09d9aa577 in rcl_wait () from /opt/ros/rolling/lib/librcl.so
#5  0x00007fa09e00431e in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    () from /opt/ros/rolling/lib/librclcpp.so
#6  0x00007fa09e0047d3 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/rolling/lib/librclcpp.so
#7  0x00007fa09e008ead in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/rolling/lib/librclcpp.so
#8  0x00007fa09e007f28 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) ()
   from /opt/ros/rolling/lib/librclcpp.so
#9  0x00007fa09e0081d0 in rclcpp::spin(std::shared_ptr<rclcpp::Node>) () from /opt/ros/rolling/lib/librclcpp.so
#10 0x000055c20de0d492 in main (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/9/ext/atomicity.h:96

I can run ros2 service list and see

/controller_server/change_state
/controller_server/get_state

and if I try calling get_state, it appears to finish waiting, but it never gets a response from controller server:

$ ros2 service call /controller_server/get_state lifecycle_msgs/srv/GetState "{}"
waiting for service to become available...
requester: making request: lifecycle_msgs.srv.GetState_Request()

And I can't query via lifecycle

$ ros2 lifecycle get /controller_server                                          
Node not found

@Aposhian
Copy link
Contributor Author

It is important to note that this error condition does not happen every time. It is inconsistent.

@Aposhian
Copy link
Contributor Author

Here is a first effort at a reproduction. It reproduces the problem only some of the time though. https://gist.github.com/Aposhian/043359e09a203900e8db55407a8b5e38

@MiguelCompany
Copy link

@Aposhian One thing I don't understand. On the issue description you say this happens on rolling, but on the gist you just shared it uses ros:galactic as the base docker image.

So my question is: Does the issue only reproduce in galactic? Does it also reproduce on humble/rolling?

@SteveMacenski
Copy link
Member

Note that Rolling if from binaries in 20.04 are outdated, since Rolling has moved to 22.04, so there be updates missing if not using 22.04 as the base or building from source

@Aposhian
Copy link
Contributor Author

Yes, I am using old rolling on focal, so closer to galactic than to humble.

@Aposhian
Copy link
Contributor Author

Yes, I cannot reproduce this issue on Humble images.

@SteveMacenski
Copy link
Member

Keep in mind that Galactic is EOL in 5 months.

Since this is a galactic-specific problem, what do you want to do about it? Would it be possible to backport the patch @MiguelCompany? If not, we could try to patch something in just the galactic branch for you @Aposhian for the interim? If its just for Galactic to deal with just an issue with Fast-DDS, I'm OK being a bit more sloppy about it than I normally would since it's lifecycle is very limited and would not be pushed into future distributions -- the maintenance cost and technical reach is limited.

@Aposhian
Copy link
Contributor Author

@MiguelCompany suggested I try building FastDDS from source on the 2.3.x branch, which has unreleased fixes for galactic. I don't think any Nav2 action is required. I will update if that is a viable workaround for galactic.

@SteveMacenski
Copy link
Member

OK - if that fixes things feel free to close when ready!

@Aposhian
Copy link
Contributor Author

Ok, I tried building from 2.3.x FastDDS, but it doesn't resolve the issue. https://gist.github.com/Aposhian/043359e09a203900e8db55407a8b5e38

rmw: rmw_cyclonedds_cpp | daemon: ✅
  Try #1: ✅
  Try #2: ✅
  Try #3: ✅
rmw: rmw_cyclonedds_cpp | daemon: ❌
  Try #1: ✅
  Try #2: ✅
  Try #3: ✅
rmw: rmw_fastrtps_cpp | daemon: ✅
  Try #1: ❌
  Try #2: ✅
  Try #3: ✅
rmw: rmw_fastrtps_cpp | daemon: ❌
  Try #1: ✅
  Try #2: ✅
  Try #3: ✅

@Aposhian
Copy link
Contributor Author

Aposhian commented Jul 1, 2022

@MiguelCompany while not specifically for Nav2, lifecycle services failing is still happening on humble intermittently. Can we migrate/copy this issue to Fast-DDS?

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 1, 2022

Agreed, some private conversations I've had in recent weeks have also expressed issues with services in Humble

@MiguelCompany what's the next step / status on this issue?

@SteveMacenski
Copy link
Member

Any updates? This is a pretty critical issue and trying to find if we need to take action to resolve it 😄

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 7, 2022

In the same theme of #3032, I want to get this off the queue so trying to find out the actionable things we can do to resolve this. I know the main issue is Fast-DDS not handling lots of service calls well, but I suppose we can improve the way these failures occur. The lifecycle manager and BT Action Node in BT Navigator are the only 2 places where we have waits without timeouts intentionally since these are pretty critical operations.

For this (lifecycle manger), what should happen if a service for transitioning doesn't come up? Fail everything? I suppose we could also remove the checks for services being up in the constructors https://github.com/ros-planning/navigation2/blob/main/nav2_util/src/lifecycle_service_client.cpp#L39 knowing that we check if they're up before we submit requests in the change/get methods that were recently added due to supporting respawning. I'll say though that the current status of lifecycle services and the manager are in a stable but thoughtfully setup balance due to supporting Bond, Respawn, and Lifecycle all at the same time. Removing that could cause some real headaches other places, but I won't be sure until I try if that seems to be the best answer.

Like in the other ticket though, I think the priority should be on working with Fast-DDS to be able to resolve this issue outright.

@Aposhian
Copy link
Contributor Author

Aposhian commented Jul 7, 2022

Yeah I'm torn as to whether this is something that Nav2 should consider to fallible as a part of normal operation, and gracefully handle it. But once you start to question the underlying RMW, lots of other assumptions go out the window.
Even if we don't add in a configurable timeout, we could probably make the log messages more explicit as to what it is waiting on, so there is no ambiguity like "Waiting for lifecycle service..." or something like that.

@jsan-rt
Copy link

jsan-rt commented Jul 8, 2022

Hi @Aposhian @SteveMacenski

Since you also said this is happening on Humble, we have moved our investigations there with no luck so far. Using your reproducer on Humble yields no results, being able to run hundreds of times without error.

Do you have any other clue on how to have this issue reproduced on Humble?

We also have this PR open that aims to bring Fast DDS's Waitsets to the RMW. It's still under review but you could try using it and see if the situation improves.

@Aposhian
Copy link
Contributor Author

Aposhian commented Jul 8, 2022

I think that sounds reasonable. However, we won't be much use testing this at the moment, since we have decided to use CycloneDDS with Humble for our main operation at the moment.

@SteveMacenski
Copy link
Member

OK, I can add that to my task list to change early next week to add the log / waits. I could actually do this too in the BT Action Node too so that in either case when waiting if it isn't connecting we're printing that for awareness. Do you think that's a good solution to #3032 as well?

@Aposhian
Copy link
Contributor Author

Yes I think so.

@SteveMacenski
Copy link
Member

Sweet, its on my queue for this afternoon

@SteveMacenski
Copy link
Member

#3071 implements

@SteveMacenski
Copy link
Member

Sorry, this should stay open until the service reliability is handled in Fast-DDS

@SteveMacenski SteveMacenski reopened this Jul 13, 2022
@SteveMacenski SteveMacenski changed the title lifecycle manager wait for service should have a finite, or at least configurable, timeout. Fast-DDS Service Reliability sometimes hangs lifecycle manager Jul 13, 2022
@SteveMacenski
Copy link
Member

@Aposhian that PR was merged that supposedly fixes this -- have you tested by chance?

@Aposhian
Copy link
Contributor Author

No, but this is probably something we can get to testing in the next month.

@vinnnyr
Copy link
Contributor

vinnnyr commented Aug 25, 2022

@SteveMacenski can you please point us to the upstream PR that we should make sure we test with?

Was it this one? ros2/rmw_fastrtps#619

@SteveMacenski
Copy link
Member

yes!

@SteveMacenski
Copy link
Member

@Aposhian I don't suppose you tested as part of your migration back to Fast-DDS to use the features you liked?

@Aposhian
Copy link
Contributor Author

No not yet

@EduPonz
Copy link

EduPonz commented Nov 24, 2022

Hi everyone,

The new Humble sync is out which contains the changes made to solve this issue.

@SteveMacenski
Copy link
Member

SteveMacenski commented Nov 28, 2022

@Aposhian / @vinnnyr if you could try that out since I know you want to be using Fast-DDS ideally, that would be fantastic. You can close this out if that resolves the issue.

@vinnnyr
Copy link
Contributor

vinnnyr commented Nov 28, 2022

will report back if we end up having time to try it. unfortunately both of us seem to be a bit short on time for the rest of the year, so I cannot promise anything.

@EduPonz
Copy link

EduPonz commented Feb 2, 2023

Hi everyone,

On January the 27th there was a Humble patch release which included additional fixes (mainly eProsima/Fast-DDS#3195). It'd great if someone could give this a try a see whether this ticket can be closed.

@SteveMacenski
Copy link
Member

Agreed!

@AlexeyMerzlyakov
Copy link
Collaborator

AlexeyMerzlyakov commented Mar 7, 2023

On January the 27th there was a Humble patch release which included additional fixes (mainly eProsima/Fast-DDS#3195). It'd great if someone could give this a try a see whether this ticket can be closed.

Unfortunately, the problem was reproduced on Fast-DDS with PR3195 patch: very rare, but still appear typically 1 time per ~50-100 Nav2 stack runs. Checked on Ubuntu 20.04 with ROS2 Rolling built from sources (20230110 version, PR3195 is there).

Just for the reference: for verification I've used the Nav2 system tests that are deploying full Nav2 stack in each turn. The stress-test script is attached to current message: keepout_stress.sh.txt

@ygoumaz
Copy link

ygoumaz commented Mar 13, 2023

I can confirm that we still encounter this issue with the webots_ros2 package. Our tests that use the Nav2 package randomly fail. Failures occur more often on CI machines than locally. Changing to Cyclone DDS seems to resolve the issue.

@MiguelCompany
Copy link

@AlexeyMerzlyakov

Just for the reference: for verification I've used the Nav2 system tests that are deploying full Nav2 stack in each turn. The stress-test script is attached to current message: keepout_stress.sh.txt

This is killing the processes with kill -9, which does not let them the chance to free their resources (shared memory files in the case of Fast DDS, for instance).

I would check two things:

  1. Change the signal being used to -2 (SIGINT)
  2. Add rm /dev/shm/* at the end of the kill_ros function

@SteveMacenski
Copy link
Member

This hasn't received any updated in about a year with anyone noting an issue and I also haven't seen it in my day to day development in some time.

I'm closing this ticket as fixed by Fast-DDS since it does not appear to continue to be a problem of reportable frequency anymore. Though, happy to reopen if anyone can show that this is still and issue plaguing their application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants