-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable tests on Action CI #381
Comments
From the last day or so these are the test results, it's not always the same tests failing. The list of unstable tests seems to include:
The timeouts probably have something to do with discovery, or shutdown behavior not receiving the final message published by a publisher. The failures, I'm guessing are segmentation faults of some kind. It seems common with fastrtps, and occurred once in cyclone in this list. Here's my output from the script I used to scrape, for context:
|
Is this on Bionic? I saw similar fastrtps test failures in rosbag2 when I was on Bionic. They disappeared when I started compiling in Focal instead. |
Yeah it's running on Bionic currently - which it should not. I still feel that the first most important change is to get these running on Focal. However it's interesting that Karsten can't repro on his Bionic machine. But, it's not a supported platform so it's not worth tooo much effort tracking it down if it's working on Focal. I have a feeling that a few of these issues will linger on Focal, so the faster we can get that running the better so we can track the remainder down. This is a prerequisite to getting us there ros-tooling/setup-ros#140 |
* decrease subscription expected msgs Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * change workflow to test this branch Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * change expected number too Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * linter Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * change workflow to autotest feature branch Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * revert workflow Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * run workflow more frequently to test stability Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * revert github workflow Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * cleanup Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * define constants Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * disable test on windows Signed-off-by: Mabel Zhang <mabel@openrobotics.org> * disable play_filters_by_topic on windows Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Still seeing Action CI fail consistently e.g.
so I'm reopening this to track getting us to green |
I have been able to reproduce these issues locally in a Ubuntu Focal container. The segfaults (for example in The timeouts happened in |
I assume these timeouts are related to expecting a specific number of messages, and not receiving either the messages at the beginning, or one at the end - it's worth putting a print statement on messages received, to see if it's just that we're getting one fewer than expected. So, either a discovery or a shutdown edge case behavior - if the tests are testing something else (e.g. not about specific numbers of messages, but about whether messages are received at all, etc), then what we should do is make those tests less fragile to these boundary behaviors, perhaps publish N*2 messages and only expect to receive N - or something along those lines. |
In both tests that failed due to timeout that I mentioned, there were indeed an expectation for a specific number of messages, but exactly 0 messages were received (I used the debugger instead of print statements). I think we're missing waiting for |
I wrote a small case that reproduces the timeout (messages never actually received by subscription) problem at unpublished_messages.zip. I also wrote a small case that tries to reproduce the segfault problem at |
That's awesome! With this we can potentially report it to the rmw implementation - does it show this behavior only on Fast-RTPS? |
I have seen this behavior with both FastRTPS and RTI Connext. I haven't tried with CycloneDDS. |
Moving this to the top of input queue so it can be reassigned |
Bump - watch this. The Action CI has been reenabled, see how it acts over the next 24 hours before closing this issue |
only one of the last 24 hourly runs failed - and that test failure is also seen on the buildfarm. followup work to de-flake will be tracked by #408 |
Description
The regularly scheduled CI builds are seeing periodic failure of the test
See the following overnight builds for example:
It does not fail every time, but has failed in approximately half of today's hourly runs.
The test times out when it fails, this suggests it is waiting for messages that are never delivered, or were delivered before the Subscriptions came online.
System (please complete the following information)
The text was updated successfully, but these errors were encountered: