Unstable tests on Action CI #381

emersonknapp · 2020-04-16T17:27:06Z

Description

The regularly scheduled CI builds are seeing periodic failure of the test

See the following overnight builds for example:

It does not fail every time, but has failed in approximately half of today's hourly runs.

The test times out when it fails, this suggests it is waiting for messages that are never delivered, or were delivered before the Subscriptions came online.

System (please complete the following information)

OS: Ubuntu Bionic
ROS 2 Distro: Foxy
Version: master, reporting at cb44cdd

The text was updated successfully, but these errors were encountered:

emersonknapp · 2020-05-01T21:08:46Z

From the last day or so these are the test results, it's not always the same tests failing. The list of unstable tests seems to include:

test_play_topic_remap__rmw_fastrtps_cpp (Timeout)
test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)
test_record__rmw_fastrtps_cpp (Failed)
test_record__rmw_fastrtps_dynamic_cpp (Failed)
test_record__rmw_cyclonedds_cpp (Failed)
test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Failed)
test_rosbag2_play_end_to_end (Timeout)

The timeouts probably have something to do with discovery, or shutdown behavior not receiving the final message published by a publisher.

The failures, I'm guessing are segmentation faults of some kind. It seems common with fastrtps, and occurred once in cyclone in this list.

Here's my output from the script I used to scrape, for context:

TEST RUN 93323352 - 2020-05-01T19:03:53Z
b'2020-05-01T19:38:31.9162579Z The following tests FAILED:'
b'2020-05-01T19:38:31.9162907Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T19:38:31.9163376Z \t 30 - test_record__rmw_fastrtps_dynamic_cpp (Failed)'
b'2020-05-01T19:38:31.9164216Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T19:38:31.9164349Z Errors while running CTest'

b'2020-05-01T19:40:03.0866840Z The following tests FAILED:'
b'2020-05-01T19:40:03.0867133Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T19:40:03.0867235Z Errors while running CTest'

TEST RUN 93280328 - 2020-05-01T18:04:10Z
b'2020-05-01T18:35:20.8619814Z The following tests FAILED:'
b'2020-05-01T18:35:20.8620177Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T18:35:20.8620287Z Errors while running CTest'

b'2020-05-01T18:36:51.5043900Z The following tests FAILED:'
b'2020-05-01T18:36:51.5044307Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T18:36:51.5044399Z Errors while running CTest'

TEST RUN 93238438 - 2020-05-01T17:03:56Z
b'2020-05-01T17:32:03.5810154Z The following tests FAILED:'
b'2020-05-01T17:32:03.5810416Z \t 19 - test_record__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T17:32:03.5810677Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T17:32:03.5810956Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T17:32:03.5811165Z Errors while running CTest'

b'2020-05-01T17:33:33.1755227Z The following tests FAILED:'
b'2020-05-01T17:33:33.1755510Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T17:33:33.1755601Z Errors while running CTest'

TEST RUN 93196491 - 2020-05-01T16:04:10Z
b'2020-05-01T16:33:03.8155439Z The following tests FAILED:'
b'2020-05-01T16:33:03.8155720Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T16:33:03.8156111Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T16:33:03.8156235Z Errors while running CTest'

b'2020-05-01T16:34:32.4372036Z The following tests FAILED:'
b'2020-05-01T16:34:32.4372243Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T16:34:32.4372335Z Errors while running CTest'

TEST RUN 93154036 - 2020-05-01T15:03:52Z
b'2020-05-01T15:37:13.5761261Z The following tests FAILED:'
b'2020-05-01T15:37:13.5761525Z \t 19 - test_record__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T15:37:13.5761775Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T15:37:13.5762050Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T15:37:13.5762165Z Errors while running CTest'

b'2020-05-01T15:38:43.8400797Z The following tests FAILED:'
b'2020-05-01T15:38:43.8401031Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T15:38:43.8401136Z Errors while running CTest'

TEST RUN 93112954 - 2020-05-01T14:03:41Z
b'2020-05-01T14:32:13.9392698Z The following tests FAILED:'
b'2020-05-01T14:32:13.9392945Z \t  8 - test_record__rmw_cyclonedds_cpp (Failed)'
b'2020-05-01T14:32:13.9393219Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T14:32:13.9393336Z Errors while running CTest'

b'2020-05-01T14:33:43.3575792Z The following tests FAILED:'
b'2020-05-01T14:33:43.3576040Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T14:33:43.3576139Z Errors while running CTest'

TEST RUN 93075480 - 2020-05-01T13:03:52Z
b'2020-05-01T13:34:00.5559294Z The following tests FAILED:'
b'2020-05-01T13:34:00.5559583Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T13:34:00.5559839Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T13:34:00.5559952Z Errors while running CTest'

b'2020-05-01T13:35:30.2741230Z The following tests FAILED:'
b'2020-05-01T13:35:30.2741470Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T13:35:30.2741566Z Errors while running CTest'

TEST RUN 93039869 - 2020-05-01T12:04:39Z
b'2020-05-01T12:35:31.9366081Z The following tests FAILED:'
b'2020-05-01T12:35:31.9366523Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T12:35:31.9366817Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T12:35:31.9366923Z Errors while running CTest'

b'2020-05-01T12:37:02.3758277Z The following tests FAILED:'
b'2020-05-01T12:37:02.3758864Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T12:37:02.3759070Z Errors while running CTest'

TEST RUN 93004273 - 2020-05-01T11:03:33Z
b'2020-05-01T11:34:21.8739779Z The following tests FAILED:'
b'2020-05-01T11:34:21.8739991Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T11:34:21.8740177Z \t 31 - test_play__rmw_fastrtps_dynamic_cpp (Failed)'
b'2020-05-01T11:34:21.8740386Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T11:34:21.8740482Z Errors while running CTest'

b'2020-05-01T11:35:51.8787516Z The following tests FAILED:'
b'2020-05-01T11:35:51.8787718Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T11:35:51.8787806Z Errors while running CTest'

TEST RUN 92970747 - 2020-05-01T10:03:47Z
b'2020-05-01T10:29:07.8768310Z The following tests FAILED:'
b'2020-05-01T10:29:07.8768506Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T10:29:07.8768592Z Errors while running CTest'

TEST RUN 92936606 - 2020-05-01T09:04:01Z
b'2020-05-01T09:36:13.5265852Z The following tests FAILED:'
b'2020-05-01T09:36:13.5266121Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T09:36:13.5266403Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T09:36:13.5266524Z Errors while running CTest'

b'2020-05-01T09:37:41.7927209Z The following tests FAILED:'
b'2020-05-01T09:37:41.7927443Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T09:37:41.7927547Z Errors while running CTest'

TEST RUN 92901768 - 2020-05-01T08:04:17Z
b'2020-05-01T08:34:05.0969907Z The following tests FAILED:'
b'2020-05-01T08:34:05.0970184Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T08:34:05.0970454Z \t 30 - test_record__rmw_fastrtps_dynamic_cpp (Failed)'
b'2020-05-01T08:34:05.0970730Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T08:34:05.0970844Z Errors while running CTest'

b'2020-05-01T08:35:33.7374257Z The following tests FAILED:'
b'2020-05-01T08:35:33.7374508Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T08:35:33.7374597Z Errors while running CTest'

TEST RUN 92869475 - 2020-05-01T07:03:59Z
b'2020-05-01T07:32:49.0618630Z The following tests FAILED:'
b'2020-05-01T07:32:49.0618897Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T07:32:49.0619183Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T07:32:49.0619303Z Errors while running CTest'

TEST RUN 92839121 - 2020-05-01T06:04:15Z
b'2020-05-01T06:34:24.5639592Z The following tests FAILED:'
b'2020-05-01T06:34:24.5639859Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T06:34:24.5640032Z Errors while running CTest'

b'2020-05-01T06:35:53.3269251Z The following tests FAILED:'
b'2020-05-01T06:35:53.3269532Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T06:35:53.3269638Z Errors while running CTest'

TEST RUN 92809599 - 2020-05-01T05:03:58Z
b'2020-05-01T05:32:46.6473384Z The following tests FAILED:'
b'2020-05-01T05:32:46.6473617Z \t 19 - test_record__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T05:32:46.6473867Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T05:32:46.6474126Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T05:32:46.6474232Z Errors while running CTest'

b'2020-05-01T05:34:15.6431719Z The following tests FAILED:'
b'2020-05-01T05:34:15.6431966Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T05:34:15.6432051Z Errors while running CTest'

TEST RUN 92779846 - 2020-05-01T04:04:15Z
b'2020-05-01T04:34:43.2583194Z The following tests FAILED:'
b'2020-05-01T04:34:43.2583485Z \t 19 - test_record__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T04:34:43.2583745Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T04:34:43.2584039Z \t 25 - test_record_all_no_discovery__rmw_fastrtps_dynamic_cpp (Failed)'
b'2020-05-01T04:34:43.2584159Z Errors while running CTest'

TEST RUN 92751423 - 2020-05-01T03:04:01Z
b'2020-05-01T03:36:00.7390039Z The following tests FAILED:'
b'2020-05-01T03:36:00.7390295Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T03:36:00.7390553Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T03:36:00.7390664Z Errors while running CTest'

b'2020-05-01T03:37:30.5714729Z The following tests FAILED:'
b'2020-05-01T03:37:30.5714968Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T03:37:30.5715072Z Errors while running CTest'

TEST RUN 92722447 - 2020-05-01T02:04:15Z
b'2020-05-01T02:37:32.1472052Z The following tests FAILED:'
b'2020-05-01T02:37:32.1472340Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T02:37:32.1472617Z Errors while running CTest'

b'2020-05-01T02:39:04.3343220Z The following tests FAILED:'
b'2020-05-01T02:39:04.3343460Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T02:39:04.3343568Z Errors while running CTest'

TEST RUN 92691030 - 2020-05-01T01:04:05Z
b'2020-05-01T01:34:46.3406834Z The following tests FAILED:'
b'2020-05-01T01:34:46.3407183Z \t 20 - test_play__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T01:34:46.3407465Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T01:34:46.3407723Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T01:34:46.3407830Z Errors while running CTest'

b'2020-05-01T01:36:16.3433446Z The following tests FAILED:'
b'2020-05-01T01:36:16.3433669Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T01:36:16.3433769Z Errors while running CTest'

TEST RUN 92656135 - 2020-05-01T00:09:17Z
b'2020-05-01T00:38:39.7218986Z The following tests FAILED:'
b'2020-05-01T00:38:39.7219224Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T00:38:39.7219330Z Errors while running CTest'

TEST RUN 92613255 - 2020-04-30T23:03:53Z
b'2020-04-30T23:33:43.3972138Z The following tests FAILED:'
b'2020-04-30T23:33:43.3972601Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-04-30T23:33:43.3980234Z Errors while running CTest'

mabelzhang · 2020-05-01T21:32:14Z

Is this on Bionic? I saw similar fastrtps test failures in rosbag2 when I was on Bionic. They disappeared when I started compiling in Focal instead.

emersonknapp · 2020-05-01T22:41:34Z

Yeah it's running on Bionic currently - which it should not. I still feel that the first most important change is to get these running on Focal. However it's interesting that Karsten can't repro on his Bionic machine. But, it's not a supported platform so it's not worth tooo much effort tracking it down if it's working on Focal. I have a feeling that a few of these issues will linger on Focal, so the faster we can get that running the better so we can track the remainder down.

This is a prerequisite to getting us there ros-tooling/setup-ros#140

* decrease subscription expected msgs Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * change workflow to test this branch Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * change expected number too Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * linter Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * change workflow to autotest feature branch Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * revert workflow Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * run workflow more frequently to test stability Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * revert github workflow Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * cleanup Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * define constants Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * disable test on windows Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * disable play_filters_by_topic on windows Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

emersonknapp · 2020-05-18T18:37:44Z

Still seeing Action CI fail consistently e.g.

so I'm reopening this to track getting us to green

mm318 · 2020-05-21T19:21:02Z

I have been able to reproduce these issues locally in a Ubuntu Focal container.

The segfaults (for example in test_rosbag2_record_end_to_end, test_record__*, test_record_all_no_discovery__*) occur during rclcpp::Node construction and rclcpp::shutdown(). Since these tests seem to involve quite a bit of asynchronous code execution, I suspect the issues might be related to ros2/rclcpp#1042.

The timeouts happened in test_rosbag2_play_end_to_end and test_play_topic_remap__rmw_fastrtps_cpp, which may mean it is specific to when using rmw_fastrtps_cpp/rmw_fastrtps_dynamic_cpp as the RMW implementation.

emersonknapp · 2020-05-21T21:29:51Z

I assume these timeouts are related to expecting a specific number of messages, and not receiving either the messages at the beginning, or one at the end - it's worth putting a print statement on messages received, to see if it's just that we're getting one fewer than expected. So, either a discovery or a shutdown edge case behavior - if the tests are testing something else (e.g. not about specific numbers of messages, but about whether messages are received at all, etc), then what we should do is make those tests less fragile to these boundary behaviors, perhaps publish N*2 messages and only expect to receive N - or something along those lines.

mm318 · 2020-05-21T23:25:33Z

I assume these timeouts are related to expecting a specific number of messages, and not receiving either the messages at the beginning, or one at the end

In both tests that failed due to timeout that I mentioned, there were indeed an expectation for a specific number of messages, but exactly 0 messages were received (I used the debugger instead of print statements). I think we're missing waiting for get_publisher_count() of the SubscriptionManager and the get_subscription_count() of the Rosbag2Transport/ros2 bag play to reach non-zero. It could also be that get_publisher_count()/get_subscription_count() will never be non-zero due to some discovery edge case behavior as you mentioned.

mm318 · 2020-06-10T05:02:19Z

I wrote a small case that reproduces the timeout (messages never actually received by subscription) problem at unpublished_messages.zip.

I also wrote a small case that tries to reproduce the segfault problem at
shutdown_crash.zip. Although I never experienced a segfault with the small case, sometimes I did experience an unhandled exception, which means there's at least a race condition somewhere.

emersonknapp · 2020-06-10T06:01:30Z

That's awesome! With this we can potentially report it to the rmw implementation - does it show this behavior only on Fast-RTPS?

mm318 · 2020-06-10T15:00:06Z

does it show this behavior only on Fast-RTPS?

I have seen this behavior with both FastRTPS and RTI Connext. I haven't tried with CycloneDDS.

zmichaels11 · 2020-06-10T16:20:43Z

Moving this to the top of input queue so it can be reassigned

emersonknapp · 2020-12-07T21:12:56Z

Bump - watch this. The Action CI has been reenabled, see how it acts over the next 24 hours before closing this issue

emersonknapp · 2020-12-08T22:22:28Z

only one of the last 24 hourly runs failed - and that test failure is also seen on the buildfarm. followup work to de-flake will be tracked by #408

emersonknapp added the bug Something isn't working label Apr 16, 2020

This was referenced Apr 16, 2020

Add loop option to rosbag play #361

Merged

Adaptive playback qos based on recorded metadata #364

Merged

emersonknapp mentioned this issue Apr 29, 2020

add topic remapping option to rosbag2 play #388

Merged

mabelzhang mentioned this issue Apr 30, 2020

Fix #381 unstable play_end_to_end test #396

Merged

emersonknapp changed the title ~~Periodic test failure on test_rosbag2_play_end_to_end~~ Unstable tests on Action CI May 1, 2020

emersonknapp mentioned this issue May 4, 2020

Update GitHub workflows to use Docker, and Focal #401

Merged

mabelzhang closed this as completed in #396 May 12, 2020

emersonknapp reopened this May 18, 2020

emersonknapp added this to the Sprint #19 milestone May 18, 2020

mm318 self-assigned this May 20, 2020

emersonknapp mentioned this issue May 21, 2020

Fix playback of compressed bagfiles #417

Merged

prajakta-gokhale modified the milestones: Sprint #19, Sprint #20 May 22, 2020

zmichaels11 modified the milestones: Sprint #20, Sprint #21 Jun 5, 2020

zmichaels11 unassigned mm318 Jun 10, 2020

emersonknapp mentioned this issue Jun 12, 2020

Disable broken tests for pull requests #440

Merged

emersonknapp self-assigned this Dec 7, 2020

emersonknapp closed this as completed Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable tests on Action CI #381

Unstable tests on Action CI #381

emersonknapp commented Apr 16, 2020

emersonknapp commented May 1, 2020

mabelzhang commented May 1, 2020

emersonknapp commented May 1, 2020

emersonknapp commented May 18, 2020

mm318 commented May 21, 2020

emersonknapp commented May 21, 2020

mm318 commented May 21, 2020

mm318 commented Jun 10, 2020 •

edited

Loading

emersonknapp commented Jun 10, 2020

mm318 commented Jun 10, 2020

zmichaels11 commented Jun 10, 2020

emersonknapp commented Dec 7, 2020

emersonknapp commented Dec 8, 2020

Unstable tests on Action CI #381

Unstable tests on Action CI #381

Comments

emersonknapp commented Apr 16, 2020

Description

System (please complete the following information)

emersonknapp commented May 1, 2020

mabelzhang commented May 1, 2020

emersonknapp commented May 1, 2020

emersonknapp commented May 18, 2020

mm318 commented May 21, 2020

emersonknapp commented May 21, 2020

mm318 commented May 21, 2020

mm318 commented Jun 10, 2020 • edited Loading

emersonknapp commented Jun 10, 2020

mm318 commented Jun 10, 2020

zmichaels11 commented Jun 10, 2020

emersonknapp commented Dec 7, 2020

emersonknapp commented Dec 8, 2020

mm318 commented Jun 10, 2020 •

edited

Loading