Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve issues identified while investigating #21 #22

Merged
merged 22 commits into from
Apr 13, 2021
Merged

Conversation

asorbini
Copy link
Collaborator

@asorbini asorbini commented Apr 3, 2021

This PR includes several improvements identified while investigating issue #21:

  1. Replace the WaitSet and Conditions implementation based on DDS constructs with one based on the C++ std library.
  • The current implementation causes very high CPU usage when faced with tight loops (e.g. a fast-spinning executor).
    The reason for this is still unclear, but it probably lies in the fundamental mismatch between the ROS 2 WaitSet API and the DDS one. The RMW tries to compensate for the lack of attach() and detach() in ROS 2 by trying to cache its arguments across wait() calls. Evidently this is not enough, and it may be even partially responsible for unnecessarily added CPU cycles under certain circumstances.
  • Instead of trying to fix this implementation further (it also causes somewhat of a "concurrency nightmare" trying to make sure cached objects are not referenced after deletion), the alternative implementation based on C++ std was updated to fix the standing issues with "status reporting", and it is now the default.
  • The C++ std implementation relies on DDS listeners and coordinates events using std::condition_variable's and mutex's. Because it uses listeners (which automatically consume status flags and reset "change" counters), it must cache and keep track of reported statuses, in order to then later report the correct "status changes" to the user.
  1. Improved implementation of RMW_Connext_Client::is_service_available()
  • The implementation of RMW_Connext_Client::is_service_available() was updated to only return true if both a "request reader" and a "reply writer" have been matched from the same remote DomainParticipant.
  1. Support special value RMW_DURATION_INFINITE in WaitSet::wait()
  1. Propagate correct type code for request/reply topics
  • The code was incorrectly adding a "request header" to the type propagated via discovery for request/reply topics even when the "Extended" mapping was used (the one used by default). This is now fixed.
  1. Allow deletion of "waited on" objects (alternative WaitSet implementation)
  • This implementation will most likely be removed altogether, but I made some updates to try to resolve some "catastrophic failures" when running with applications that seem to be deleting objects while also trying to delete them (gzserver from Gazebo). This code didn't fully resolve the issues and is probably still bugged, hence why I ultimately went with item 1.
  • Note to reviewers: feel free to ignore changes in rmw_impl_waitset_dds.cpp and rmw_waitset_dds.hpp
  • The alternative WaitSet implementation, based on DDS objects, was completely removed and saved in branch asorbini/dds-waitsets.

Initial CI validation:

  • Linux: Build Status
  • macOS: Build Status
  • Windows: Build Status

Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
@asorbini
Copy link
Collaborator Author

asorbini commented Apr 3, 2021

Linter errors resolved in 0b8fd8a

@asorbini
Copy link
Collaborator Author

asorbini commented Apr 3, 2021

@clalancette @ivanpauno @wjwwood a quick review to merge these changes to master, run them through the nav2 CI again, and still have some time to review the outcome before code freeze, would be greatly appreciated!

CI looks good, and the few errors are linter ones that have already been resolved.

Thanks!

@asorbini asorbini marked this pull request as ready for review April 3, 2021 03:01
@ivanpauno
Copy link
Member

The current implementation causes very high CPU usage when faced with tight loops (e.g. a fast-spinning executor).
The reason for this is still unclear, but it probably lies in the fundamental mismatch between the ROS 2 WaitSet API and the DDS one. The RMW tries to compensate for the lack of attach() and detach() in ROS 2 by trying to cache its arguments across wait() calls. Evidently this is not enough, and it may be even partially responsible for unnecessarily added CPU cycles under certain circumstances.

This is true, I hope at some point we improve the rmw waitset API.

This implementation will most likely be removed altogether, but I made some updates to try to resolve some "catastrophic failures" when running with applications that seem to be deleting objects while also trying to delete them (gzserver from Gazebo). This code didn't fully resolve the issues and is probably still bugged, hence why I ultimately went with item 1.

If we're deleting "waited on" objects somewhere, that's a bug.
The rmw implementation doesn't need to support that case.

Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
@asorbini
Copy link
Collaborator Author

asorbini commented Apr 5, 2021

Addressed review comments. Running CI again to validate:

  • ci_linux: Build Status
  • ci_osx: Build Status
  • ci_windows: Build Status

If we're deleting "waited on" objects somewhere, that's a bug.
The rmw implementation doesn't need to support that case.

As I mentioned in this comment, based on #21, something "fishy" seemed to be happening in Gazebo's gzserver, but I haven't debugged it further. It might be that there are still other issues/race condition lingering in the DDS-based implementation that are only triggered by that application. It would be interesting to investigate, but I'd say low priority if the current implementation is satisfactory. We might revisit it in the (near) future, to get to the bottom of it, or just get rid of the alternative implementation altogether.

@asorbini
Copy link
Collaborator Author

asorbini commented Apr 5, 2021

@ivanpauno after reading the article that you linked, I think the code still suffers from "lost wakeup". Maybe I'm missing something, but I'm considering refactoring to use semaphores and avoid this problem with the condition variable API altogether.

Signed-off-by: Andrea Sorbini <asorbini@rti.com>
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
@asorbini
Copy link
Collaborator Author

asorbini commented Apr 6, 2021

One more run of CI after removing the "lost wakeup" problem 🤞🏼

  • ci_linux: Build Status
  • ci_osx: Build Status
  • ci_windows: Build Status

Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have some concerns related with concurrency.
I also need to take a closer look to the code again...

Comment on lines 588 to 600
std::mutex * waitset_mutex = this->waitset_mutex;
auto scope_exit = rcpputils::make_scope_exit(
[waitset_mutex]()
{
if (nullptr != waitset_mutex) {
waitset_mutex->unlock();
}
});
if (nullptr != waitset_mutex) {
waitset_mutex->lock();
}
std::lock_guard<std::mutex> lock(this->mutex_internal);
this->triggered_data = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I think this should be:

Suggested change
std::mutex * waitset_mutex = this->waitset_mutex;
auto scope_exit = rcpputils::make_scope_exit(
[waitset_mutex]()
{
if (nullptr != waitset_mutex) {
waitset_mutex->unlock();
}
});
if (nullptr != waitset_mutex) {
waitset_mutex->lock();
}
std::lock_guard<std::mutex> lock(this->mutex_internal);
this->triggered_data = true;
std::lock_guard<std::mutex> lock(this->mutex_internal);
if (nullptr != this->waitset_mutex) {
std::lock_guard<std::mutex> lock(this->waitset_mutex);
this->triggered_data = true;
} else {
this->triggered_data = true;
}
if (this->waitset_condition) {
this->waitset_condition->notify_one();
}

rmw_connextdds_common/src/common/rmw_impl.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_impl.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_impl.cpp Outdated Show resolved Hide resolved
@ivanpauno
Copy link
Member

@ivanpauno after reading the article that you linked, I think the code still suffers from "lost wakeup". Maybe I'm missing something, but I'm considering refactoring to use semaphores and avoid this problem with the condition variable API altogether.

Yes, this kind of code always gets a bit tricky 😬 .
The problem with semaphores is that it's not part of the std cpp library (until cpp20), and getting multiplatform code (macOS, linux, windows) working doesn't seem to be worth it.

The trick with condition variables is to always follow a pattern like this when modifying a "condition" (i.e. something that can modify the return value of predicate in cv.wait(lock, predicate)).
Not super easy to enforce though 😅.

Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks reasonable to me too.

rmw_connextdds_common/src/common/rmw_info.cpp Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_impl_waitset_dds.cpp Outdated Show resolved Hide resolved
Signed-off-by: Andrea Sorbini <asorbini@rti.com>
@asorbini
Copy link
Collaborator Author

asorbini commented Apr 9, 2021

Removed DDS-based implementation altogether in ce3f86d, and created branch asorbini/dds-waitsets to preserve it for future reference.

@asorbini
Copy link
Collaborator Author

I opened rmw_connextdds#26 to introduce some QoS optimizations for the reliability protocol which resolve the test failures in interactive_markers.

@asorbini
Copy link
Collaborator Author

I've been investigating the invalid memory access in rclcpp.TestExecutor.add_remove_node_thread_safe and through some printf()'s and valgrind I was able to confirm that a guard condition is being accessed after having been deleted:

******* WAITSET WAIT [BEGIN] 0x5ab20b0 ******
******* GUARD CONDITION [ATTACH] 0x5ab1650 ******
******* GUARD CONDITION [ATTACH] 0x5ab17c0 ******
******* GUARD CONDITION [ATTACH] 0xea81220 ******
###### EXECUTOR NODE [REMOVE] node_1
******* GUARD CONDITION [DELETE] 0xedaadd0 ******
******* GUARD CONDITION [DELETE] 0xeda6960 ******
###### NODE BASE GUARD CONDITION [FINI] 0xea810c0
******* GUARD CONDITION [DELETE] 0xea81220 ******
###### EXECUTOR NODE [NEW] node_2
###### NODE BASE GUARD CONDITION [INIT] 0xf38f9b0
******* GUARD CONDITION [NEW] 0xf38fa90 ******
******* NODE [NEW] 0xedca270 ******
******* GUARD CONDITION [DETACH] 0x5ab1650 ******
******* GUARD CONDITION [DETACH] 0x5ab17c0 ******
******* GUARD CONDITION [DETACH] 0xea81220 ******
==203993== Thread 3:
==203993== Invalid read of size 4
==203993==    at 0x539AFC4: pthread_mutex_lock (pthread_mutex_lock.c:67)
==203993==    by 0x18E389: __gthread_mutex_lock(pthread_mutex_t*) (gthr-default.h:749)
==203993==    by 0x18ECBD: std::mutex::lock() (std_mutex.h:100)
==203993==    by 0x192031: std::lock_guard<std::mutex>::lock_guard(std::mutex&) (std_mutex.h:159)
==203993==    by 0x5DAD111: void RMW_Connext_Condition::detach<RMW_Connext_WaitSet::detach(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_events_t*, unsigned long&)::{lambda()#6}>(RMW_Connext_WaitSet::detach(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_events_t*, unsigned long&)::{lambda()#6}&&) (rmw_waitset_std.hpp:98)
==203993==    by 0x5DAAAD0: RMW_Connext_WaitSet::detach(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_events_t*, unsigned long&) (rmw_impl_waitset_std.cpp:460)
==203993==    by 0x5DAB0BA: RMW_Connext_WaitSet::wait(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_events_t*, rmw_time_t const*) (rmw_impl_waitset_std.cpp:546)
==203993==    by 0x5DBF93B: rmw_api_connextdds_wait(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_events_t*, rmw_wait_set_t*, rmw_time_t const*) (rmw_waitset.cpp:142)
==203993==    by 0x5158838: rmw_wait (rmw_api_impl_ndds.cpp:800)
==203993==    by 0x55E0853: rmw_wait (functions.cpp:492)
==203993==    by 0x511935C: rcl_wait (wait.c:609)
==203993==    by 0x4E2A58D: rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) (executor.cpp:738)

I think the problem is that when a Node object is removed from the executor and ultimately deleted, there is no guarantee that the executor thread has already got out of a concurrent wait() before the Node and its "notify guard condition" are deleted.

When the executor's waitset is awaken, it goes through its list of condition to check their status and "tear down" the wait() but it instead ends up accessing invalid memory.

Relevant code in rclcpp:

Not sure about a fix, because I'm not too familiar with rclcpp, but I think that the executor should implement some type of synchronization with its thread(s) to make sure that any concurrent WaitSet::wait() that was associated with a removed node has returned before StaticSingleThreadedExecutor::remove_node() returns.

For now, I don't think this is a bug in rmw_connextdds but an error in rclcpp since it stems from a concurrent deletion and wait().

I wonder if this is the root cause of the concurrent wait()'s and deletions that were detected with the previous DDS-based implementation.

@ivanpauno
Copy link
Member

For now, I don't think this is a bug in rmw_connextdds but an error in rclcpp since it stems from a concurrent deletion and wait().

Yes, your description sounds like a probelem in rclcpp.
I will try to figure out what the problem is.

@ivanpauno ivanpauno requested a review from hidmic April 12, 2021 17:37
@asorbini
Copy link
Collaborator Author

@ivanpauno thanks, looking forward to your findings!

@ivanpauno
Copy link
Member

ivanpauno commented Apr 13, 2021

This is ready to merge, one extra CI run to double check last changes are ok:

  • ci_linux: Build Status
  • ci_osx: Build Status
  • ci_windows: Build Status

@ivanpauno
Copy link
Member

Wow, this is really reducing the ammount of failures when testing with rmw_connextdds only.

Great work @asorbini !!!

@ivanpauno ivanpauno merged commit 7bd6e5e into master Apr 13, 2021
@delete-merged-branch delete-merged-branch bot deleted the asorbini/fix-21 branch April 13, 2021 16:15
@asorbini asorbini added dashing-backports PR should be backported to Dashing eloquent-backports PR should be backported to Eloquent foxy-backports PR should be backported to Foxy labels Apr 13, 2021
asorbini added a commit that referenced this pull request Apr 13, 2021
* Use C++ std WaitSets by default

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Use Rolling in README's Quick Start

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Improved implementation of client::is_service_available for Connext Pro

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Only add request header to typecode with Basic req/rep profile

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove commented/unused code

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Avoid topic name validation in get_info functions

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Reduce shutdown period to 10ms

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Pass HistoryQosPolicy to graph cache

* Reset error string after looking up type support

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove DDS-based WaitSet implementation

Signed-off-by: Andrea Sorbini <asorbini@rti.com>
asorbini added a commit that referenced this pull request Apr 13, 2021
* Use C++ std WaitSets by default

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Use Rolling in README's Quick Start

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Improved implementation of client::is_service_available for Connext Pro

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Only add request header to typecode with Basic req/rep profile

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove commented/unused code

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Avoid topic name validation in get_info functions

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Reduce shutdown period to 10ms

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Pass HistoryQosPolicy to graph cache

* Reset error string after looking up type support

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove DDS-based WaitSet implementation

Signed-off-by: Andrea Sorbini <asorbini@rti.com>
asorbini added a commit that referenced this pull request Apr 13, 2021
* Use C++ std WaitSets by default

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Use Rolling in README's Quick Start

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Improved implementation of client::is_service_available for Connext Pro

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Only add request header to typecode with Basic req/rep profile

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove commented/unused code

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Avoid topic name validation in get_info functions

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Reduce shutdown period to 10ms

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Pass HistoryQosPolicy to graph cache

* Reset error string after looking up type support

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove DDS-based WaitSet implementation

Signed-off-by: Andrea Sorbini <asorbini@rti.com>
asorbini added a commit that referenced this pull request Apr 14, 2021
* Resolve issues identified while investigating #21 (#22)

* Use C++ std WaitSets by default

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Use Rolling in README's Quick Start

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Improved implementation of client::is_service_available for Connext Pro

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Only add request header to typecode with Basic req/rep profile

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove commented/unused code

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Avoid topic name validation in get_info functions

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Reduce shutdown period to 10ms

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Pass HistoryQosPolicy to graph cache

* Reset error string after looking up type support

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove DDS-based WaitSet implementation

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove unavailable time helpers

Signed-off-by: Andrea Sorbini <asorbini@rti.com>
asorbini added a commit that referenced this pull request Apr 14, 2021
* Resolve issues identified while investigating #21 (#22)

* Use C++ std WaitSets by default

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Use Rolling in README's Quick Start

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Improved implementation of client::is_service_available for Connext Pro

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Only add request header to typecode with Basic req/rep profile

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove commented/unused code

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Avoid topic name validation in get_info functions

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Reduce shutdown period to 10ms

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Pass HistoryQosPolicy to graph cache

* Reset error string after looking up type support

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove DDS-based WaitSet implementation

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove unavailable time helpers

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove unsupported QoS event

Signed-off-by: Andrea Sorbini <asorbini@rti.com>
asorbini added a commit that referenced this pull request Apr 14, 2021
* Resolve issues identified while investigating #21 (#22)

* Use C++ std WaitSets by default

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Use Rolling in README's Quick Start

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Improved implementation of client::is_service_available for Connext Pro

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Only add request header to typecode with Basic req/rep profile

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove commented/unused code

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Avoid topic name validation in get_info functions

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Reduce shutdown period to 10ms

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Pass HistoryQosPolicy to graph cache

* Reset error string after looking up type support

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove DDS-based WaitSet implementation

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove unavailable time helpers

Signed-off-by: Andrea Sorbini <asorbini@rti.com>

* Remove unsupported QoS event

Signed-off-by: Andrea Sorbini <asorbini@rti.com>
@asorbini asorbini added the galactic PR pertaining the Galactic release label Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dashing-backports PR should be backported to Dashing eloquent-backports PR should be backported to Eloquent foxy-backports PR should be backported to Foxy galactic PR pertaining the Galactic release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants