Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[16192] fix notification lost #3087

Merged
merged 14 commits into from
Jan 6, 2023

Conversation

iuhilnehc-ynos
Copy link
Contributor

@iuhilnehc-ynos iuhilnehc-ynos commented Nov 14, 2022

Signed-off-by: Chen Lihui lihui.chen@sony.com

Description

Please refer to ros2/rclcpp#2039 (comment).

Contributor Checklist

  • Commit messages follow the project guidelines.
  • The code follows the style guidelines of this project.
  • Tests that thoroughly check the new feature have been added/Regression tests checking the bug and its fix have been added.
  • Any new/modified methods have been properly documented using Doxygen.
  • Fast DDS test suite has been run locally.
  • Changes are ABI compatible.
  • Changes are API compatible.
  • Documentation builds and tests pass locally.
  • New feature has been added to the versions.md file (if applicable).
  • New feature has been documented/Current behavior is correctly described in the documentation.

Reviewer Checklist

  • Check contributor checklist is correct.
  • Check CI results: changes do not issue any warning.
  • Check CI results: failing tests are unrelated with the changes.

Signed-off-by: Chen Lihui <lihui.chen@sony.com>
fujitatomoya
fujitatomoya previously approved these changes Nov 14, 2022
Copy link
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me.

@MiguelCompany
Copy link
Member

@iuhilnehc-ynos Thank you for your contribution. Changes LGTM, but we'll need to add a regression test before merging this.

Signed-off-by: Chen Lihui <lihui.chen@sony.com>
@iuhilnehc-ynos
Copy link
Contributor Author

iuhilnehc-ynos commented Nov 15, 2022

@MiguelCompany @fujitatomoya

I have added a regression test. The test log shows that I need 176070(a random number) times to run the test to output an error in my local machine if not applying the patch.

[16:25:12 OptiPlex-7080] /home/chenlh/Projects/ROS2/eProsima/build/fastrtps/test/unittest/dds/core/condition  
chenlh condition $ mytest ./WaitSetImplTests --gtest_filter="*fix_wait_notification_lost"

...

Times 176070
Note: Google Test filter = *fix_wait_notification_lost
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from WaitSetImplTests
[ RUN      ] WaitSetImplTests.fix_wait_notification_lost
/home/chenlh/Projects/ROS2/eProsima/src/fastrtps/test/unittest/dds/core/condition/WaitSetImplTests.cpp:232: Failure
Expected equality of these values:
  ReturnCode_t::RETCODE_OK
    Which is: 0
  ret
    Which is: 4-byte object <01-00 00-00>
terminate called without an active exception

Do you have any suggestions about the regression test?

Chen Lihui added 2 commits November 15, 2022 16:57
Signed-off-by: Chen Lihui <lihui.chen@sony.com>
Signed-off-by: Chen Lihui <lihui.chen@sony.com>
@fujitatomoya
Copy link
Contributor

@iuhilnehc-ynos

Do you have any suggestions about the regression test?

not really, sorry 😢

Chen Lihui added 2 commits November 16, 2022 10:24
Signed-off-by: Chen Lihui <lihui.chen@sony.com>
Signed-off-by: Chen Lihui <lihui.chen@sony.com>
@MiguelCompany MiguelCompany changed the title fix notification lost [16192] fix notification lost Nov 16, 2022
@MiguelCompany
Copy link
Member

@richiprosima Please test this

@MiguelCompany
Copy link
Member

@iuhilnehc-ynos It seems this PR is introducing possible deadlocks. At least that is what the TSan job is complaining about. You can get the report from here.

Signed-off-by: Miguel Company <MiguelCompany@eprosima.com>
@MiguelCompany
Copy link
Member

@iuhilnehc-ynos I think I fixed the deadlock on 8c60364. Let's wait for the TSan job.

@MiguelCompany
Copy link
Member

@richiprosima Please test this

@iuhilnehc-ynos
Copy link
Contributor Author

@MiguelCompany

I just analyzed the log, but you have already fixed the potential deadlock. Thank you.

@MiguelCompany
Copy link
Member

@MiguelBarro This has a deadlock report on code related with ReadConditions. Would you mind taking a look?

@MiguelBarro
Copy link
Contributor

MiguelBarro commented Nov 18, 2022

The fix consists on assuring no notification will take place on predicate evaluation by taken the mutex during notifications (WaitSetImpl::wake_up changes).
Unfortunately, that introduces an ABBA deadlock on intraproccess scenarios:

  • Waiting thread:

    • A waitset mutex (wait())
    • B reader mutex (retrieving trigger value)
  • Writing thread:

    • B reader mutex (StatefulReader::processDataMsg directly called from writer)
    • A waitset mutex (notification to the Conditions implies wakeup)

A workaround may be to introduce an atomic_uint notification counter in the WaitSetImpl. This will be updated on wake up avoding taking the mutex. The wait predicate must check the counter and decide to reevaluate.
Another alternative is avoid taking the reader mutex on Condition notification (there is a devoted mutex for the condition collection). The issue here is that the reader mutex is recursive and taken upstream in the stack which makes it a costly refactor.
Surely I came up with a better solution over the weekend.

@mergify mergify bot mentioned this pull request Nov 19, 2022
13 tasks
Miguel Barro added 7 commits November 19, 2022 16:40
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
@fujitatomoya
Copy link
Contributor

@MiguelCompany @MiguelBarro anything we need to do on our side? friendly ping.

@fujitatomoya
Copy link
Contributor

friendly ping, this can resolve a couple of issues.

@MiguelCompany
Copy link
Member

@richiprosima Please test this

@MiguelCompany
Copy link
Member

@richiprosima Please test windows

@MiguelCompany
Copy link
Member

@Mergifyio backport 2.8.x 2.7.x 2.6.x

@mergify
Copy link
Contributor

mergify bot commented Jan 5, 2023

backport 2.8.x 2.7.x 2.6.x

✅ Backports have been created

@iuhilnehc-ynos
Copy link
Contributor Author

iuhilnehc-ynos commented Jan 5, 2023

I am sorry about the comment ros2/rmw_fastrtps#650 (comment).
I tested with commit 8c60364 before.

If using 34a8f8d, the wait forever issue in ros2/rmw_fastrtps#650 still exists. (just confirmed.)

I was thinking you will use #3098, which can re-check the condition every 500ms.

@MiguelCompany
Copy link
Member

@iuhilnehc-ynos Thank you for checking with 34a8f8d.

We will then update this PR to aa260da and close #3098.

Would you mind checking ros2/rmw_fastrtps#650 with aa260da to verify it solves the issue?

@MiguelCompany
Copy link
Member

@richiprosima Please test this

@MiguelCompany
Copy link
Member

@richiprosima Please test Linux

@iuhilnehc-ynos
Copy link
Contributor Author

iuhilnehc-ynos commented Jan 6, 2023

Would you mind checking ros2/rmw_fastrtps#650 with aa260da to verify it solves the issue?

I believe this PR with aa260da can fix the wait forever issue, but it leads to another issue (notification is not triggered in time) as below

[INFO] [1672971861.484102416] [sender]: Sending tick: '6357'...              // timer period is set with 10ms
[INFO] [1672971861.493948311] [sender]: Sending tick: '6358'...
[INFO] [1672971861.504098546] [sender]: Sending tick: '6359'...
[INFO] [1672971861.513917624] [sender]: Sending tick: '6360'...
[INFO] [1672971861.523958969] [sender]: Sending tick: '6361'...
[INFO] [1672971861.533877038] [sender]: Sending tick: '6362'...
[INFO] [1672971862.034249728] [sender]: Sending tick: '6363'...             // it costs about 500ms to trigger the timer callback.

@MiguelCompany MiguelCompany merged commit a9e49cd into eProsima:master Jan 6, 2023
mergify bot pushed a commit that referenced this pull request Jan 6, 2023
* fix notification lost

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* add a regression test

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* rename a variable name and update comments

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* fix uncrustify issue

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* make the regression test better

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* fix uncrustify

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* Refs #16192. Fix deadlock on WaitSetImpl.

Signed-off-by: Miguel Company <MiguelCompany@eprosima.com>

* Refs 16192. Workaround for deadlock and notify pessimization

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Mandatory logic fix

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. linter

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs #16192. Catch up any notification lost

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Avoid predicate evaluation on spurious wakeups.

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Fixing gcc warnings

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Allowing wait without previous notification

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

Signed-off-by: Chen Lihui <lihui.chen@sony.com>
Signed-off-by: Miguel Company <MiguelCompany@eprosima.com>
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
Co-authored-by: Miguel Company <miguelcompany@eprosima.com>
Co-authored-by: Miguel Barro <miguelbarro@eprosima.com>
(cherry picked from commit a9e49cd)
mergify bot pushed a commit that referenced this pull request Jan 6, 2023
* fix notification lost

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* add a regression test

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* rename a variable name and update comments

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* fix uncrustify issue

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* make the regression test better

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* fix uncrustify

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* Refs #16192. Fix deadlock on WaitSetImpl.

Signed-off-by: Miguel Company <MiguelCompany@eprosima.com>

* Refs 16192. Workaround for deadlock and notify pessimization

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Mandatory logic fix

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. linter

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs #16192. Catch up any notification lost

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Avoid predicate evaluation on spurious wakeups.

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Fixing gcc warnings

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Allowing wait without previous notification

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

Signed-off-by: Chen Lihui <lihui.chen@sony.com>
Signed-off-by: Miguel Company <MiguelCompany@eprosima.com>
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
Co-authored-by: Miguel Company <miguelcompany@eprosima.com>
Co-authored-by: Miguel Barro <miguelbarro@eprosima.com>
(cherry picked from commit a9e49cd)
mergify bot pushed a commit that referenced this pull request Jan 6, 2023
* fix notification lost

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* add a regression test

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* rename a variable name and update comments

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* fix uncrustify issue

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* make the regression test better

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* fix uncrustify

Signed-off-by: Chen Lihui <lihui.chen@sony.com>

* Refs #16192. Fix deadlock on WaitSetImpl.

Signed-off-by: Miguel Company <MiguelCompany@eprosima.com>

* Refs 16192. Workaround for deadlock and notify pessimization

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Mandatory logic fix

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. linter

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs #16192. Catch up any notification lost

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Avoid predicate evaluation on spurious wakeups.

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Fixing gcc warnings

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

* Refs 16192. Allowing wait without previous notification

Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>

Signed-off-by: Chen Lihui <lihui.chen@sony.com>
Signed-off-by: Miguel Company <MiguelCompany@eprosima.com>
Signed-off-by: Miguel Barro <miguelbarro@eprosima.com>
Co-authored-by: Miguel Company <miguelcompany@eprosima.com>
Co-authored-by: Miguel Barro <miguelbarro@eprosima.com>
(cherry picked from commit a9e49cd)
MiguelCompany added a commit that referenced this pull request Jan 9, 2023
@MiguelCompany
Copy link
Member

@iuhilnehc-ynos The browser showed me your commit after merging. We're reverting the merge on #3193.

I thought about this a bit more during the weekend, and decided a new approach, based on your original solution.

After the revert is done, I will create a new PR with the changes on this branch. Could you check aginst that branch?

MiguelCompany added a commit that referenced this pull request Jan 9, 2023
@iuhilnehc-ynos
Copy link
Contributor Author

After the revert is done, I will create a new PR with the changes on this branch. Could you check aginst that branch?

It seems there is some compilation issue on that branch. Maybe you are working on it, I'll check it based on your new PR.

@MiguelCompany MiguelCompany mentioned this pull request Jan 9, 2023
11 tasks
@MiguelCompany
Copy link
Member

It seems there is some compilation issue on that branch. Maybe you are working on it, I'll check it based on your new PR.

@iuhilnehc-ynos Yeah, sorry about that. It should be fixed now. I have also created #3195 with the backport to 2.6.x, so we can have the fix on Humble too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants