Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when using SinglePhaseCommit with distributed transactions #1800

Closed
roji opened this issue Oct 7, 2022 · 7 comments · Fixed by #1801
Closed

Deadlock when using SinglePhaseCommit with distributed transactions #1800

roji opened this issue Oct 7, 2022 · 7 comments · Fixed by #1801
Labels
💥 Regression Regression introduced from earlier PRs

Comments

@roji
Copy link
Member

roji commented Oct 7, 2022

In .NET 7.0, the OleTx/MSDTC distributed transactions support has been ported over from .NET Framework (dotnet/runtime#715). This means that it will now be possible to use distributed transactions with .NET 7.0, on Windows only.

We've received a bug report (dotnet/runtime#76010) with two separate console programs, enlisting which propagate a transaction. The initiating program (the "outer" transaction) enlists only a single SqlConnection to the transaction, and hangs when the transaction is committed. This happens quite consistently, but not 100% reliably. To reproduce this, simply run the .NET 7 version of the application in dotnet/runtime#76010 in two separate windows; the first one should hang.

When the program hangs, two threads are stuck with the following stack traces:

Stack 1

SinglePhaseEnlistment.Committed()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\d42a13b4dda85c7ff6fe99f383b4a1b279abd39735cb45449b7401d99c387\SinglePhaseEnlistment.cs:line 62
SqlDelegatedTransaction.SinglePhaseCommit()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\7c62ff4f1cfc3e61c3e65f286d919f9f0bbb2ef6a822b99cb37623d6d3a891\SqlDelegatedTransaction.cs:line 389
TransactionStateDelegatedCommitting.EnterState()
CommittableTransaction.Commit()
TransactionScope.InternalDispose()
TransactionScope.Dispose()
Program.StartTransaction()
async Program.Main()
AsyncMethodBuilderCore.Start<System.__Canon>()
Program.Main()
Program.<Main>()

Stack 2:

DbConnectionInternal.DetachTransaction()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\71847c97c677567f13e9df7e4f1f44a99f349762bb77c34f819752ee84b71\DbConnectionInternal.cs:line 413
DbConnectionInternal.CleanupConnectionOnTransactionCompletion()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\71847c97c677567f13e9df7e4f1f44a99f349762bb77c34f819752ee84b71\DbConnectionInternal.cs:line 436
DbConnectionInternal.TransactionCompletedEvent()
TransactionStatePromotedCommitted.EnterState()
InternalTransaction.DistributedTransactionOutcome()
RealOletxTransaction.FireOutcome()
OutcomeEnlistment.InvokeOutcomeFunction()
OletxTransactionManager.ShimNotificationCallback()
PortableThreadPool.CompleteWait()
ThreadPoolWorkQueue.Dispatch()
PortableThreadPool.WorkerThread.WorkerThreadStart()
[Native to Managed Transition]

Here is the general flow leading up to this state:

  1. When the TransactionScope is disposed, we get to SqlDelegatedTransaction.SinglePhaseCommit(), which is the SqlClient part that interacts with System.Transactions.
    a. SqlDelegatedTransaction.SinglePhaseCommit() locks connection (the SqlInternalConnectionTds)
    b. It then sends Commit to SQL Server. Since the transaction is delegated, the triggers the commit with MSDTC, which causes commit notifications to be sent to the enlisted parties. This includes the running program, which starts thread 2 below running concurrently.
    b. Finally, it proceeds to call enlistment.Committed(), which is SinglePhaseEnlistment.Committed(), while keeping the lock on SqlInternalConnectionTds. That method then takes _internalEnlistment.SyncRoot, which is the InternalTransaction.

  2. The above commit causes us to get a notification from the native layer (MSDTC) that the transaction committed (triggered above in 1b).
    a. We get to InternalTransaction.DistributedTransactionOutcome(), which locks the InternalTransaction.
    b. We then get to DbConnectionInternal.DetachTransaction() (DbConnectionInternal is registered as a listener on the Transaction.TransactionCompleted event), which attempts to lock this (the SqlInternalConnectionTds).

So thread 1 locks the SqlInternalConnectionTds and then the InternalTransaction, while thread 2 - which runs concurrently - locks InternalTransaction and then SqlInternalConnectionTds. This produces a deadlock.

I'm not an expert in this code, but moving the call to enlistment.Committed() (code) outside of the lock could be a way to resolve this deadlock.

Note that I couldn't reproduce the deadlock in .NET Framework; there's likely some timing differences which make the deadlock manifest far more frequently on .NET Core, but AFAICT the bug is in Framework as well. Note that this repro uses SinglePhaseCommit - only one connection is enlisted to the TransactionScope in the application. I recommend also checking having two connections to force 2PC - the same bug could be present in that flow as well.

Many thanks to @nathangreaves for providing the original repro code.

/cc @ajcvickers

@JRahnama
Copy link
Member

JRahnama commented Oct 7, 2022

@roji thanks. We will resume back on the issues on Tuesday. Monday is a holiday in Canada.

@roji
Copy link
Member Author

roji commented Oct 9, 2022

I did a bit more investigation into this.

First, here's a minimal repro console app - target net7.0 with SDK 7.0.100-rc.1.22431.12 or later. As you can see it's very trivial - anyone using distributed transactions with .NET 7.0 is likely to hit it:

while (true)
{
    using var transaction = new CommittableTransaction();

    // Uncommenting the following makes the deadlock go away as a workaround. If the transaction is promoted before
    // the first SqlClient enlistment, it never goes into the delegated state.
    // _ = TransactionInterop.GetTransmitterPropagationToken(transaction);

    await using var conn = new SqlConnection(...);
    await conn.OpenAsync();
    conn.EnlistTransaction(transaction);

    // Enlisting the second transaction causes causes the transaction to be promoted.
    // After this, the transaction state will be "delegated" (delegated to SQL Server), and the commit below will
    // trigger a call to SqlDelegatedTransaction.SinglePhaseCommit.
    await using var conn2 = new SqlConnection("Server=localhost;Database=test;User=SA;Password=Abcd5678;Connect Timeout=60;ConnectRetryCount=0;Encrypt=false");
    await conn2.OpenAsync();
    conn2.EnlistTransaction(transaction);

    Console.WriteLine("Committing transaction...");
    transaction.Commit();
    Console.WriteLine("Transaction committed.");
}

The deadlock was introduced in #1042, when the lock scope in SinglePhaseCommit was extended to include the call to enlistment.Committed(); moving that call out of the lock scope resolves this deadlock. I don't understand #1042 well enough to know whether moving the call out is the correct fix (or whether it would reintroduce the bug that #1042 fixed) - @cheenamalhotra I think your expertise here would be required :) In any case, I've submitted #1801, which fixes the deadlock.

Note that the netfx version of SqlDelegatedTransaction doesn't include this change (and so neither did System.Data.SqlClient), which explains why this deadlock didn't occur with .NET Framework.

@lcheunglci lcheunglci added the 💥 Regression Regression introduced from earlier PRs label Oct 11, 2022
@roji
Copy link
Member Author

roji commented Oct 12, 2022

One additional note on this... SqlClient currently calls CleanupConnectionOnTransactionCompletion twice when a transaction completes:

  1. It's called directly from SqlDelegatedTransaction.SinglePhaseCommit (on this line).
  2. It's also called from DbConnectionInternal.TransactionCompletedEvent, which is triggered when the transaction is committed.

Removing one of these should make the deadlock go away. Alternatively, detecting inside that cleanup has already occured - and specifically that the transaction has already been detached - would allow skipping taking the lock, and again resolve the deadlock. This may be safer than reducing the scope of the connection lock in SinglePhaseCommit as indicated above (although I don't have the full picture here).

@cheenamalhotra
Copy link
Member

Hi @roji

Thanks for your findings! I'm a little occupied this week, and haven't got chance to dig deep, yet.

I hope you have already, but if not, I'd like to confirm #237 doesn't reoccur (fixed by #543 - found cause here) as that's the main cause we had to limit enlistment for externally Aborted transactions.

I'll try getting back to you in a day or two!

@nathangreaves
Copy link

Hi @cheenamalhotra, please could we have an update on a rough timescale for this?
It's a blocker for our org for upgrading from .net framework to .net 7

@llermanos
Copy link

Hello @cheenamalhotra .
@roji mentioned a workaround of using TransactionInterop.GetTransmitterPropagationToken(Transaction.Current) .
I don't know if it is correct to temporary apply this "fix" , or to wait for the bug to be solved.
Any ETA or time to update?.
I have been able to migrate 60% of our solution from .NET 4.8 to 7.0, but we are blocked by this issue.
Thanks!

@cheenamalhotra
Copy link
Member

@nathangreaves @llermanos

Very sorry for being caught up for longer than I thought.. I hope this would be fixed soon so you can be unblocked!
Added some notes here: #1801 (comment)

cc @David-Engel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💥 Regression Regression introduced from earlier PRs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants