Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possible segfaults in expanded executors unit tests #1230

Closed
brawner opened this issue Jul 15, 2020 · 5 comments
Closed

Investigate possible segfaults in expanded executors unit tests #1230

brawner opened this issue Jul 15, 2020 · 5 comments
Assignees
Labels
tests Failing or missing tests

Comments

@brawner
Copy link
Contributor

brawner commented Jul 15, 2020

The executors unit test is being expanded in #1222, and running it consecutively hundreds of times shows that about 5-6% it will exit with a segfault across several different executors in several different functions. Unfortunately, simple tracing with gdb is not yielding much info.

Copying the results of the investigation here:

Here were all the instances of segfaults with duplicates.

TestExecutorsStable/SingleThreadedExecutor.addTemporaryNode
TestExecutorsStable/SingleThreadedExecutor.spinSome
TestExecutorsStable/SingleThreadedExecutor.spinSome
TestExecutorsStable/SingleThreadedExecutor.spinAll
TestExecutorsStable/SingleThreadedExecutor.spinAll
TestExecutorsStable/SingleThreadedExecutor.spinAll

TestExecutors/MultiThreadedExecutor.testSpinUntilSharedFutureComplete
TestExecutorsStable/MultiThreadedExecutor.addTemporaryNode
TestExecutorsStable/MultiThreadedExecutor.spinSome
TestExecutorsStable/MultiThreadedExecutor.spinSome
TestExecutorsStable/MultiThreadedExecutor.spinSome
TestExecutorsStable/MultiThreadedExecutor.spinAll
TestExecutorsStable/MultiThreadedExecutor.spinAll
TestExecutorsStable/MultiThreadedExecutor.spinAll

TestExecutors/StaticSingleThreadedExecutor.testSpinUntilFutureComplete
TestExecutors/StaticSingleThreadedExecutor.testSpinUntilFutureComplete
TestExecutors/StaticSingleThreadedExecutor.testSpinUntilFutureCompleteNoTimeout
TestExecutors/StaticSingleThreadedExecutor.testSpinUntilFutureCompleteNoTimeout
TestExecutors/StaticSingleThreadedExecutor.testSpinUntilFutureCompleteWithTimeout
TestExecutors/StaticSingleThreadedExecutor.testSpinUntilSharedFutureComplete
TestExecutors/StaticSingleThreadedExecutor.testSpinUntilSharedFutureComplete
TestExecutors/StaticSingleThreadedExecutor.testSpinUntilSharedFutureComplete

Running valgrind on this test executable results in:

==2896250== 
==2896250== HEAP SUMMARY:
==2896250==     in use at exit: 12,390,587 bytes in 26,540 blocks
==2896250==   total heap usage: 1,762,920 allocs, 1,736,380 frees, 178,520,908 bytes allocated
==2896250== 
==2896250== LEAK SUMMARY:
==2896250==    definitely lost: 5,496 bytes in 189 blocks
==2896250==    indirectly lost: 155,231 bytes in 636 blocks
==2896250==      possibly lost: 2,204,381 bytes in 2,155 blocks
==2896250==    still reachable: 10,025,479 bytes in 23,560 blocks
==2896250==                       of which reachable via heuristic:
==2896250==                         multipleinheritance: 1,728 bytes in 9 blocks
==2896250==         suppressed: 0 bytes in 0 blocks
==2896250== Rerun with --leak-check=full to see details of leaked memory
==2896250== 
==2896250== For lists of detected and suppressed errors, rerun with: -s
==2896250== ERROR SUMMARY: 3 errors from 1 contexts (suppressed: 0 from 0)

And a read error

==2896250== Thread 19:
==2896250== Invalid read of size 8
==2896250==    at 0x6529FBB: node_listener(rmw_context_t*) (listener_thread.cpp:150)
==2896250==    by 0x652C110: void std::__invoke_impl<void, void (*)(rmw_context_t*), rmw_context_t*>(std::__invoke_other, void (*&&)(rmw_context_t*), rmw_context_t*&&) (invoke.h:60)
==2896250==    by 0x652BF86: std::__invoke_result<void (*)(rmw_context_t*), rmw_context_t*>::type std::__invoke<void (*)(rmw_context_t*), rmw_context_t*>(void (*&&)(rmw_context_t*), rmw_context_t*&&) (invoke.h:95)
==2896250==    by 0x652BE0E: void std::thread::_Invoker<std::tuple<void (*)(rmw_context_t*), rmw_context_t*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) (thread:244)
==2896250==    by 0x652BD24: std::thread::_Invoker<std::tuple<void (*)(rmw_context_t*), rmw_context_t*> >::operator()() (thread:251)
==2896250==    by 0x652BCD1: std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(rmw_context_t*), rmw_context_t*> > >::_M_run() (thread:195)
==2896250==    by 0x5827CB3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==2896250==    by 0x5956608: start_thread (pthread_create.c:477)
==2896250==    by 0x5A92102: clone (clone.S:95)
==2896250==  Address 0x609cba8 is 72 bytes inside a block of size 192 free'd
==2896250==    at 0x4842A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2896250==    by 0x56DD14E: __default_deallocate (allocator.c:42)
==2896250==    by 0x5658538: __cleanup_context (context.c:155)
==2896250==    by 0x5657DA5: rcl_context_fini (context.c:59)
==2896250==    by 0x4FB597A: __delete_context(rcl_context_t*) (context.cpp:179)
==2896250==    by 0x4FC9AFD: std::_Sp_counted_deleter<rcl_context_t*, void (*)(rcl_context_t*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:471)
==2896250==    by 0x27C502: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:155)
==2896250==    by 0x2750E2: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:730)
==2896250==    by 0x2743C0: std::__shared_ptr<rcl_context_t, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (shared_ptr_base.h:1169)
==2896250==    by 0x4FBBAB8: std::__shared_ptr<rcl_context_t, (__gnu_cxx::_Lock_policy)2>::reset() (shared_ptr_base.h:1287)
==2896250==    by 0x4FB8EFE: rclcpp::Context::clean_up() (context.cpp:452)
==2896250==    by 0x4FB6074: rclcpp::Context::init(int, char const* const*, rclcpp::InitOptions const&) (context.cpp:201)
==2896250==  Block was alloc'd at
==2896250==    at 0x4843D99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2896250==    by 0x56DD1CD: __default_zero_allocate (allocator.c:56)
==2896250==    by 0x565D282: rcl_init (init.c:88)
==2896250==    by 0x4FB61B8: rclcpp::Context::init(int, char const* const*, rclcpp::InitOptions const&) (context.cpp:204)
==2896250==    by 0x51EA08B: rclcpp::init(int, char const* const*, rclcpp::InitOptions const&) (utilities.cpp:34)
==2896250==    by 0x29135F: TestExecutors<rclcpp::executors::StaticSingleThreadedExecutor>::SetUp() (test_executors.cpp:46)
==2896250==    by 0x3C5D6B: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2447)
==2896250==    by 0x3B8176: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2483)
==2896250==    by 0x36CA2D: testing::Test::Run() (gtest.cc:2517)
==2896250==    by 0x36E12D: testing::TestInfo::Run() (gtest.cc:2703)
==2896250==    by 0x36EE64: testing::TestCase::Run() (gtest.cc:2825)
==2896250==    by 0x3883F8: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5216)
==2896250== 

Unrelated to the segfaults, these stderr messages are also outputted.

/home/brawner/workspace/ros2_master/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:157"__rmw_take failed"
@jacobperron
Copy link
Member

Looks like I hit a related segfault in this PR job: http://build.ros2.org/job/Rpr__rclcpp__ubuntu_focal_amd64/210

@brawner
Copy link
Contributor Author

brawner commented Aug 3, 2020

Possibly related: #447

I suspect these segfaults, which occur during destruction, would occur for a test that just initializes and destroys many nodes.

@brawner
Copy link
Contributor Author

brawner commented Aug 3, 2020

@jacobperron Yup, that looks like one of these failures. Are you seeing it show up more often than the 5% I found on local testing?

@jacobperron
Copy link
Member

Are you seeing it show up more often than the 5% I found on local testing?

I just noticed it once during the Rpr job in #1257 and in searching if it was a known issue came across this ticket. Just reporting it in case it's any help.

@brawner
Copy link
Contributor Author

brawner commented Dec 2, 2020

I think these have been cleaned up finally, especially with #1404

@brawner brawner closed this as completed Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Failing or missing tests
Projects
None yet
Development

No branches or pull requests

3 participants