Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential race condition in node-relay #1716

Merged
merged 3 commits into from
Mar 8, 2021

Conversation

t-bast
Copy link
Member

@t-bast t-bast commented Mar 4, 2021

We previously relied on context.child to check whether we already had a relay handler for a given payment_hash.

Unfortunately this can return an actor that is currently stopping itself.
When that happens our relay command can end up in the dead letters and the payment will not be relayed, nor be failed upstream.

We fix that by maintaining the list of current relay handlers in the NodeRelayer and removing them from the list before stopping them. This is similar to what's done in the MultiPartPaymentFSM.

This bug has been seen in the wild by the Electrum team with an eclair trampoline node that they run. Phoenix is not affected by this, because this bug only happens when the sender uses trampoline with multiple incoming HTLCs, but without letting the trampoline node aggregate the incoming MPP.

In their case, the NodeRelayer receives for example 2 HTLCs with the same payment_hash, but each HTLC sets total_amount_msat to its amount_msat (so the trampoline node cannot aggregate them and must relay them separately).

If the first HTLC fails downstream slightly before the second HTLC reaches our node, the NodeRelayer thinks it has a handler for the payment_hash of the second HTLC but in fact it's stopping itself, and when we send it NodeRelay.Relay with the second HTLC the actor is gone and we don't correctly fail the HTLC upstream.

We previously relied on `context.child` to check whether we already had a
relay handler for a given payment_hash.

Unfortunately this could return an actor that is currently stopping itself.
When that happens our relay command can end up in the dead letters and the
payment will not be relayed, nor be failed upstream.

We fix that by maintaining the list of current relay handlers in the
NodeRelayer and removing them from the list before stopping them.
This is similar to what's done in the MultiPartPaymentFSM.
@t-bast t-bast requested a review from pm47 March 4, 2021 15:03
@codecov-io
Copy link

Codecov Report

Merging #1716 (5253c30) into master (844829a) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1716      +/-   ##
==========================================
+ Coverage   86.26%   86.29%   +0.02%     
==========================================
  Files         151      151              
  Lines       11701    11712      +11     
  Branches      501      503       +2     
==========================================
+ Hits        10094    10107      +13     
+ Misses       1607     1605       -2     
Impacted Files Coverage Δ
...cala/fr/acinq/eclair/payment/relay/NodeRelay.scala 95.41% <100.00%> (+1.71%) ⬆️
...la/fr/acinq/eclair/payment/relay/NodeRelayer.scala 100.00% <100.00%> (ø)
...q/eclair/blockchain/electrum/ElectrumWatcher.scala 74.79% <0.00%> (-2.44%) ⬇️
...main/scala/fr/acinq/eclair/io/PeerConnection.scala 82.97% <0.00%> (-0.43%) ⬇️
...main/scala/fr/acinq/eclair/router/Validation.scala 92.30% <0.00%> (+1.53%) ⬆️

@t-bast t-bast merged commit ea8f940 into master Mar 8, 2021
@t-bast t-bast deleted the trampoline-relay-race-condition branch March 8, 2021 15:14
tompro pushed a commit to tompro/eclair that referenced this pull request Mar 21, 2021
We previously relied on `context.child` to check whether we already had a
relay handler for a given payment_hash.

Unfortunately this could return an actor that is currently stopping itself.
When that happens our relay command can end up in the dead letters and the
payment will not be relayed, nor be failed upstream.

We fix that by maintaining the list of current relay handlers in the
NodeRelayer and removing them from the list before stopping them.
This is similar to what's done in the MultiPartPaymentFSM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants