Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid acquiring another read lock while holding one to avoid potential deadlock #6200

Merged
merged 1 commit into from
Jul 30, 2024

Conversation

jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Jul 29, 2024

Issue Addressed

While testing PeerDAS I ran into a scenario that looked like a deadlock. The Lighthouse process was still live but stopped responding and logging anything.

The backtrace shows that one of the thread is waiting for a read lock

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000556ea43cf66d in parking_lot::raw_rwlock::RawRwLock::lock_shared_slow ()
#2  0x0000556ea5cbb969 in eth1::service::Service::earliest_block_timestamp ()
#3  0x0000556ea54fc61e in beacon_chain::eth1_chain::Eth1Chain<T,E>::eth1_data_for_block_production ()

while another one is waiting for a write lock, potentially here

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000556ea6ffc92d in parking_lot::raw_rwlock::RawRwLock::wait_for_readers ()
#2  0x0000556ea43cedf0 in parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow ()
#3  0x0000556ea5cbf35f in <futures_util::future::maybe_done::MaybeDone<Fut> as core::future::future::Future>::poll ()
#4  0x0000556ea5cb643b in <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll ()
#5  0x0000556ea5ccb06d in eth1::service::Service::do_update::{{closure}} ()

I'm not sure if I've got to the bottom of the problem yet, but I'm suspecting that the code here is causing the deadlock, as it tries to acquire a read lock while we already hold one in scope. If do_update attempts to acquire a write lock when the deposit cache read lock is in place, then it will wait for the read lock to release. However the code below attempts to acquire the read lock again while the old read lock is still in scope, causing the write lock to be stuck and a dead lock in this function:

head_block_number: self.inner.block_cache.read().highest_block_number(),

This part of the code hasn't changed in ages though, so I could be wrong or running into an edge case.

@jimmygchen jimmygchen added bug Something isn't working ready-for-review The code is ready for review labels Jul 29, 2024
@michaelsproul
Copy link
Member

michaelsproul commented Jul 29, 2024

I'd need to check the drop rules, but the compiler may insert a drop for block_cache as soon as it is no longer used. Although I'm not sure of this, and it is definitely better to be explicit

Nice find!!

@michaelsproul michaelsproul changed the base branch from unstable to release-v5.3.0 July 30, 2024 05:45
@michaelsproul michaelsproul added the v5.3.0 Q3 2024 release with database changes! label Jul 30, 2024
@michaelsproul michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jul 30, 2024
@michaelsproul
Copy link
Member

@mergify queue

Copy link

mergify bot commented Jul 30, 2024

queue

🛑 The pull request has been removed from the queue default

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@jimmygchen
Copy link
Member Author

@mergify requeue

Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! But why did this showed up in das branch and not on stable?

@jimmygchen
Copy link
Member Author

@mergify requeue

Copy link

mergify bot commented Jul 30, 2024

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Jul 30, 2024

queue

🛑 The pull request has been removed from the queue default

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@jimmygchen
Copy link
Member Author

jimmygchen commented Jul 30, 2024

Nice find! But why did this showed up in das branch and not on stable?

My guess is this is a super rare edge case - its really hard to hit this because there's no long running operation between the two read locks - only simple time calculation and logging - and this could only happen if another thread attempts to acquire a write lock before the 2nd read lock is being acquired.

@jimmygchen
Copy link
Member Author

@mergify refresh

Copy link

mergify bot commented Jul 30, 2024

refresh

✅ Pull request refreshed

@jimmygchen
Copy link
Member Author

@mergify requeue

Copy link

mergify bot commented Jul 30, 2024

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Jul 30, 2024

queue

🛑 The pull request has been removed from the queue default

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@michaelsproul
Copy link
Member

@mergify requeue

Copy link

mergify bot commented Jul 30, 2024

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Jul 30, 2024

queue

🛑 The pull request has been removed from the queue default

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@michaelsproul
Copy link
Member

@mergify refresh

Copy link

mergify bot commented Jul 30, 2024

refresh

✅ Pull request refreshed

@michaelsproul
Copy link
Member

@mergify queue

Copy link

mergify bot commented Jul 30, 2024

queue

🛑 The pull request has been removed from the queue default

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@michaelsproul
Copy link
Member

@mergify requeue

Copy link

mergify bot commented Jul 30, 2024

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Jul 30, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 9b3b730

@mergify mergify bot merged commit 9b3b730 into sigp:release-v5.3.0 Jul 30, 2024
28 checks passed
AgeManning pushed a commit to AgeManning/lighthouse that referenced this pull request Sep 3, 2024
…l deadlock (sigp#6200)

* Avoid acquiring another read lock to avoid potential deadlock.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready-for-merge This PR is ready to merge. v5.3.0 Q3 2024 release with database changes!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants