Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👨‍🌾 launch_testing_ros regressions wait_for_topic_launch_test on linux #304

Open
Blast545 opened this issue Mar 29, 2022 · 17 comments · Fixed by #309
Open

👨‍🌾 launch_testing_ros regressions wait_for_topic_launch_test on linux #304

Blast545 opened this issue Mar 29, 2022 · 17 comments · Fixed by #309
Assignees

Comments

@Blast545
Copy link
Contributor

Bug report

Required Info:

  • Operating System:
    • Ubuntu jammy
  • Installation type:
    • Source, buildfarm
  • Version or commit hash:
    • Rolling, master
  • DDS implementation:
    • Default: Fast-RTPS
  • Client library (if applicable):
    • Probably rclpy

Steps to reproduce issue

Run a buildfarm job on nightly_linux_debug

Expected behavior

All tests pass.

Actual behavior

Various launch_testing_ros tests fail and throw exceptions.
See: https://ci.ros2.org/view/nightly/job/nightly_linux_debug/2251/

Additional information

wait_for_topic_launch_test.TestFixture.test_topics_unsuccessful shows:

Error log
Traceback (most recent call last):^M
  File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 89, in test_topics_unsuccessful^M
    assert wait_for_node_object.topics_received() == expected_topics^M
AssertionError: assert {'chatter_0',..., 'chatter_4'} == {'chatter_0',..., 'chatter_4'}^M
  Extra items in the right set:^M
  'chatter_2'^M
  'chatter_1'^M
  Use -v to get more diff

...

[INFO] [python-7]: sending signal 'SIGINT' to process[python-7]^M
[INFO] [python-6]: sending signal 'SIGINT' to process[python-6]^M
[python-6] Traceback (most recent call last):^M
[python-6]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/talker.py", line 51, in <module>^M
[python-6]     main()^M
[python-6]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/talker.py", line 47, in main^M
[python-6]     rclpy.shutdown()^M
[python-6]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/__init__.py", line 126, in shutdown^M
[python-6]     _shutdown(context=context)^M
[python-6]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/utilities.py", line 58, in shutdown^M
[python-6]     return context.shutdown()^M
[python-6]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/context.py", line 102, in shutdown^M
[python-6]     self.__context.shutdown()^M
[python-6] rclpy._rclpy_pybind11.RCLError: failed to shutdown: rcl_shutdown already called on the given context, at /home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/rcl/rcl/src/rcl/ini\
t.c:241

wait_for_topic_launch_test.TestFixture.test_topics_successful shows:

Error log
Traceback (most recent call last):^M
  File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 61, in test_topics_successful^M
    with WaitForTopics(topic_list, timeout=2.0) as wait_for_node_object_1:^M
  File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/launch_testing_ros/wait_for_topics.py", line 96, in __enter__^M
    raise RuntimeError('Did not receive messages on these topics: ',^M
RuntimeError: ('Did not receive messages on these topics: ', {'chatter_4'})
...
[python-12] Traceback (most recent call last):^M
[python-12]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/talker.py", line 51, in <module>^M
[python-12]     main()^M
[python-12]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/talker.py", line 47, in main^M
[python-12]     rclpy.shutdown()^M
[python-12]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/__init__.py", line 126, in shutdown^M
[python-12]     _shutdown(context=context)^M
[python-12]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/utilities.py", line 58, in shutdown^M
[python-12]     return context.shutdown()^M
[python-12]   File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/context.py", line 102, in shutdown^M
[python-12]     self.__context.shutdown()^M
[python-12] rclpy._rclpy_pybind11.RCLError: failed to shutdown: rcl_shutdown already called on the given context, at /home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/rcl/rcl/src/rcl/in\
it.c:241
...
Traceback (most recent call last):^M
  File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 69, in test_topics_successful^M
    assert wait_for_node_object_2.wait()^M
AssertionError: assert False^M
 +  where False = <bound method WaitForTopics.wait of <launch_testing_ros.wait_for_topics.WaitForTopics object at 0x7f02e01ab190>>()^M
 +    where <bound method WaitForTopics.wait of <launch_testing_ros.wait_for_topics.WaitForTopics object at 0x7f02e01ab190>> = <launch_testing_ros.wait_for_topics.WaitForTopics object at 0x7f02e01a\
b190>.wait^M
^M
@Blast545
Copy link
Contributor Author

Blast545 commented Apr 1, 2022

@sloretz sloretz self-assigned this Apr 6, 2022
@sloretz
Copy link
Contributor

sloretz commented Apr 6, 2022

I'm able to reproduce both failures ( assert wait_for_node_object_2.wait() and AssertionError: assert set() == {'chatter_0',..., 'chatter_4'} , but only when running stress --cpu N. With an idle system I can't reproduce at all.

@Blast545
Copy link
Contributor Author

@Blast545
Copy link
Contributor Author

@Blast545 Blast545 reopened this May 12, 2022
@Blast545
Copy link
Contributor Author

@Blast545
Copy link
Contributor Author

@Crola1702
Copy link

Flakiness ratio of the last 2 weeks:

Linux:

  • Debug (2470): 13.34%

Aarch64:

  • Debug (2151): 13.34%

Rhel:

  • Debug (1308): 26.68%
  • Repeated (1299): 6.67%

@Crola1702
Copy link

Happened today in aarch64-debug: 2197

FAIL: wait_for_topic_launch_test.TestFixture.test_topics_unsuccessful
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins-agent/workspace/nightly_linux-aarch64_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 90, in test_topics_unsuccessful
    assert wait_for_node_object.topics_received() == expected_topics
AssertionError: assert {'chatter_1',..., 'chatter_4'} == {'chatter_0',..., 'chatter_4'}
  Extra items in the right set:
  'chatter_0'

@Crola1702
Copy link

This issue increased is appearance on ros2 buildfarm during the last week considerably.

Flakiness report 2023-02-21:

Job: Rci__nightly-debug_ubuntu_jammy_amd64
Last time seen: 2023-02-21
Last 30 builds: 3 builds failed, 10% with error
Last 7 builds: 3 builds failed, 42.85% with error

Job: nightly_linux_coverage
Last time seen: 2023-02-20
Last 30 builds: 7 builds failed, 23.33% with error
Last 7 builds: 4 builds failed, 57.14% with error

Job: Rci__nightly-release_ubuntu_jammy_amd64
Last time seen: 2023-02-20
Last 30 builds: 2 builds failed, 6.66% with error
Last 7 builds: 2 builds failed, 28.57% with error

Job: nightly_linux-rhel_debug
Last time seen: 2023-02-19
Last 30 builds: 3 builds failed, 10% with error
Last 7 builds: 1 builds failed, 14.28% with error

Job: nightly_linux_debug
Last time seen: 2023-02-18
Last 30 builds: 1 builds failed, 3.33% with error
Last 7 builds: 1 builds failed, 14.28% with error

Job: nightly_linux-rhel_repeated
Last time seen: 2023-02-17 23:00:01.423000
Last 30 builds: 2 builds failed, 6.66% with error
Last 7 builds: 1 builds failed, 14.28% with error

Job: nightly_linux_humble_coverage
Last time seen: 2023-02-17
Last 30 builds: 4 builds failed, 13.33% with error
Last 7 builds: 2 builds failed, 28.57% with error

Job: nightly_linux_repeated
Last time seen: 2023-02-17
Last 30 builds: 1 builds failed, 3.33% with error
Last 7 builds: 1 builds failed, 14.28% with error

Job: nightly_linux-aarch64_debug
Last time seen: 2023-02-10
Last 30 builds: 1 builds failed, 3.33% with error
Last 7 builds: 0 builds failed, 0% with error

@clalancette
Copy link
Contributor

This issue increased is appearance on ros2 buildfarm during the last week considerably.

I think the reason for that is that there were 2 weeks in there where we basically weren't running any of the rclpy-based tests (this was due to a bug in colcon that has since been fixed). So it may not be something that actually increased, we just really weren't testing it much over the last month.

Can you go back and compare with historical data? Like, what was the percentage failure in November and December 2022?

@Crola1702
Copy link

what was the percentage failure in November and December 2022?

Rci__nightly-debug_ubuntu_jammy_amd64: 6.56%
nightly_linux_coverage: 37.1%
nightly_linux-rhel_debug: 12.9%
nightly_linux_debug: 11.48%
nightly_linux-rhel_repeated: 11.29%
nightly_linux_humble_coverage: 12.9%
nightly_linux_repeated: 4.92%
nightly_linux-aarch64_debug: 9.09%
nightly_linux-aarch64_repeated: 1.37%
Rci__nightly-fastrtps_ubuntu_jammy_amd64: 1.64%

In both November and December

Yes, I think they are like the same as before

@Crola1702
Copy link

Flakiness Report 2023-03-27

job_name last_time first_time build_count failure_count failure_percentage
nightly_linux-rhel_repeated 2023-03-26 2021-12-09 30 14 46.67
nightly_linux_coverage 2023-03-26 2021-10-12 30 14 46.67
Rci__nightly-debug_ubuntu_jammy_amd64 2023-03-27 2022-03-01 31 9 29.03
Rci__nightly-fastrtps_ubuntu_jammy_amd64 2023-03-27 2022-03-01 33 9 27.27
nightly_linux_debug 2023-03-27 2021-10-31 31 7 22.58
nightly_linux-aarch64_debug 2023-03-26 2021-07-23 34 6 17.65
nightly_linux-rhel_debug 2023-03-26 2021-12-10 30 5 16.67
nightly_linux-aarch64_repeated 2023-03-26 2021-07-23 32 3 9.38
nightly_linux-rhel_release 2023-03-26 2021-12-30 30 2 6.67
Rci__nightly-release_ubuntu_jammy_amd64 2023-03-27 2022-03-01 33 2 6.06
nightly_linux_humble_coverage 2023-03-27 2022-08-08 31 1 3.23

I think these percentages are higher than ever. Some jobs almost reached 50% flakiness.

@clalancette Do you think we should disable this test?

@clalancette
Copy link
Contributor

@clalancette Do you think we should disable this test?

No, I don't think so. The ones that we actually mostly care about (nightly_linux_debug and below) are at 22%. Also, we will hopefully have some time to look at this later in April.

@clalancette
Copy link
Contributor

With #360 merged, I'm going to call this "fixed". But please reopen if this starts occuring again.

@Crola1702
Copy link

Crola1702 commented Nov 7, 2023

A similar problem is happening in Humble Coverage at a 10% flaky ratio:

Reference build: https://ci.ros2.org/job/nightly_linux_humble_coverage/594/

wait_for_topic_launch_test.TestFixture.test_topics_successful failed

FAIL: wait_for_topic_launch_test.TestFixture.test_topics_successful
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins-agent/workspace/nightly_linux_humble_coverage/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 62, in test_topics_successful
    with WaitForTopics(topic_list, timeout=2.0) as wait_for_node_object_1:
  File "/home/jenkins-agent/workspace/nightly_linux_humble_coverage/ws/src/ros2/launch_ros/launch_testing_ros/launch_testing_ros/wait_for_topics.py", line 96, in __enter__
    raise RuntimeError('Did not receive messages on these topics: ',
RuntimeError: ('Did not receive messages on these topics: ', {'chatter_1'})

I'm not sure if this is the same problem, or if it is a different but similar one

@Blast545
Copy link
Contributor Author

This seems to be happening consistently in humble coverage jobs now, see: https://ci.ros2.org/job/nightly_linux_humble_coverage/944/

@Crola1702
Copy link

This is happening as a flaky issue

Flakiness for the last 15 days:

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_linux_humble_coverage 2024-10-07 2024-09-23 14 7 50.0
Ici__nightly-debug_ubuntu_jammy_amd64 2024-10-04 2024-10-04 5 1 20.0
nightly_linux_iron_coverage 2024-10-02 2024-09-22 13 3 23.08
nightly_linux-rhel_repeated 2024-09-24 2024-09-23 15 2 13.33

Last failures:

@Crola1702 Crola1702 reopened this Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants