Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover models together with consumers #1077

Merged
merged 1 commit into from
Sep 1, 2021

Conversation

bollhals
Copy link
Contributor

Proposed Changes

This fixes two issues:

  1. As Change recovery process to allow configuration recovery of closed channels #1064 the exchanges, queues and bindings are recovered by a temporary model, fixing Consumer connection recovery does not stop on success when RabbitMQ node is restarted #1061
  2. Changes the model recovery to only change the _innerChannel once it recovered all state & consumers, fixing ConnectionRecovery recovers models before consumers #1076

And improves a few things along the way like

  • Changing the RecordXXX to a readonly struct
  • Locking the _recordedEntitiesLock around a bigger block of the recovery.

Types of Changes

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)

Checklist

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in related repositories

@michaelklishin
Copy link
Member

Thank you, this looks reasonable. I'm looking at a nearly 50% failure rate in a recovery test that is not new.

@bollhals
Copy link
Contributor Author

bollhals commented Aug 30, 2021

I can't see the test output anymore from the build...

They were all green locally, so I'd have to get more details on the failure

@michaelklishin michaelklishin changed the title recover models together with consumers Recover models together with consumers Aug 31, 2021
@michaelklishin
Copy link
Member

@bollhals TestDelayedBasicAckNackAfterChannelRecovery fails with a latch timeout about 30-40% of the time for me. Bumping the timeout (which was fairly reasonable anyway) did not completely help. I'll get a stack trace and this is not necessarily something novel or specific to this PR but since you have a lot of context on the new implementation, I thought I should mention it.

@michaelklishin
Copy link
Member

Doing another 10 run round in order to get a stack trace now.

@michaelklishin
Copy link
Member

A total of 1 test files matched the specified pattern.
[xUnit.net 00:02:03.33]     RabbitMQ.Client.Unit.TestConnectionRecovery.TestBasicNackAfterChannelRecovery [FAIL]
  Failed RabbitMQ.Client.Unit.TestConnectionRecovery.TestBasicNackAfterChannelRecovery [1 m]
  Error Message:
   waiting on a latch timed out
Expected: True
Actual:   False
  Stack Trace:
     at RabbitMQ.Client.Unit.IntegrationFixture.Wait(ManualResetEventSlim latch, TimeSpan timeSpan) in /path/to/rabbitmq_dotnet_client.git/projects/Unit/Fixtures.cs:line 398
   at RabbitMQ.Client.Unit.TestConnectionRecovery.TestDelayedBasicAckNackAfterChannelRecovery(TestBasicConsumer1 cons, ManualResetEventSlim latch) in /path/to/rabbitmq_dotnet_client.git/projects/Unit/TestConnectionRecovery.cs:line 1070
   at RabbitMQ.Client.Unit.TestConnectionRecovery.TestBasicNackAfterChannelRecovery() in /path/to/rabbitmq_dotnet_client.git/projects/Unit/TestConnectionRecovery.cs:line 233

@michaelklishin
Copy link
Member

This is on macOS with .NET Core 5.0.

@michaelklishin
Copy link
Member

Things pass just fine on a more powerful (more cores, more powerful cores) machine. OK, fine.

@bollhals
Copy link
Contributor Author

Hmm looking into this test, it's unclear to me how it sometimes can work...
These tests trigger the connection close & wait for the reconnect while executing the HandleBasicDeliver.
The connection close closes the models which wait for its dispatcher to be finished, which is basically deadlocking itself (due to the source of the close is the Dispatcher.HandleBasicDeliver...
I'm quite puzzled to how this sometime passes.

@michaelklishin michaelklishin merged commit d31e243 into rabbitmq:main Sep 1, 2021
@bollhals bollhals deleted the feature/fixRecovery branch September 1, 2021 20:41
@bollhals bollhals mentioned this pull request Sep 2, 2021
11 tasks
@lukebakken
Copy link
Contributor

@bollhals @michaelklishin please see #1140 and this comment:

#1140 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants