Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: fix a deadlock bug in mvcc #11817

Merged
merged 2 commits into from
May 7, 2020
Merged

Conversation

tangcong
Copy link
Contributor

@tangcong tangcong commented Apr 26, 2020

gorountine A(restore snapshot):

Apr 26 18:18:57 node-testbash[412644]: go.etcd.io/etcd/pkg/schedule.(*fifo).run(0xc260f6c240)
Apr 26 18:18:57 node-testbash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/etcdserver/server.go:1048 +0x3c
Apr 26 18:18:57 node-testbash[412644]: go.etcd.io/etcd/etcdserver.(*EtcdServer).run.func8(0x1278e20, 0xc29824e040)
Apr 26 18:18:57 node-testbash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/etcdserver/server.go:1102 +0x5d
Apr 26 18:18:57 node-testbash[412644]: go.etcd.io/etcd/etcdserver.(*EtcdServer).applyAll(0xc0002a6580, 0xc00033efa0, 0xc0c0805700)
Apr 26 18:18:57 node-testbash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/etcdserver/server.go:1207 +0xc2c
Apr 26 18:18:57 node-testbash[412644]: go.etcd.io/etcd/etcdserver.(*EtcdServer).applySnapshot(0xc0002a6580, 0xc00033efa0, 0xc0c0805700)
Apr 26 18:18:57 node-testbash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/mvcc/watchable_store.go:191 +0xae
Apr 26 18:18:57 node-testbash[412644]: go.etcd.io/etcd/mvcc.(*watchableStore).Restore(0xc000150140, 0x128c0c0, 0xc49e5deb00, 0x0, 0x0)
Apr 26 18:18:57 node-testbash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/mvcc/kvstore.go:345 +0x9e
Apr 26 18:18:57 node-testbash[412644]: go.etcd.io/etcd/mvcc.(*store).Restore(0xc0005f00f0, 0x128c0c0, 0xc49e5deb00, 0x0, 0x0)
Apr 26 18:18:57 node-testbash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/pkg/schedule/schedule.go:124 +0x79
Apr 26 18:18:57 node-testbash[412644]: go.etcd.io/etcd/pkg/schedule.(*fifo).Stop(0xc36923baa0)

gorountine B(compact job):

Apr 26 18:18:57 node-test bash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/pkg/schedule/schedule.go:70 +0x13d
Apr 26 18:18:57 node-test bash[412644]: created by go.etcd.io/etcd/pkg/schedule.NewFIFOScheduler
Apr 26 18:18:57 node-test bash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/pkg/schedule/schedule.go:157 +0xdb
Apr 26 18:18:57 node-test bash[412644]: go.etcd.io/etcd/pkg/schedule.(*fifo).run(0xc36923baa0)
Apr 26 18:18:57 node-test bash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/mvcc/kvstore.go:284 +0x175
Apr 26 18:18:57 node-test bash[412644]: go.etcd.io/etcd/mvcc.(*store).compact.func1(0x1278e20, 0xc15ddc41c0)
Apr 26 18:18:57 node-test bash[412644]: /tmp/etcd-release-3.4.7/etcd/release/etcd/mvcc/kvstore.go:166 +0x55
Apr 26 18:18:57 node-test bash[412644]: go.etcd.io/etcd/mvcc.(*store).compactBarrier(0xc0005f00f0, 0x0, 0x0, 0xc28ac7e6c0)
Apr 26 18:18:57 node-test bash[412644]: /usr/local/go/src/sync/rwmutex.go:93 +0x2d
Apr 26 18:18:57 node-test bash[412644]: sync.(*RWMutex).Lock(0xc0005f0120)
Apr 26 18:18:57 node-test bash[412644]: /usr/local/go/src/sync/mutex.go:134 +0x109
Apr 26 18:18:57 node-test bash[412644]: sync.(*Mutex).Lock(0xc0005f0120)
Apr 26 18:18:57 node-test bash[412644]: /usr/local/go/src/runtime/sema.go:71 +0x3d
Apr 26 18:18:57 node-test bash[412644]: sync.runtime_SemacquireMutex(0xc0005f0124, 0x0)

A is holding lock and closed stopc chan, waiting for the schedule job to end.
B is doing compaction job, but stopc is closed,then return false, call compactBarrier, but A holded lock, so deadlock occurs.

@tangcong
Copy link
Contributor Author

this bug will cause etcdserver to hang and affect all etcd3 versions. @jingyih @gyuho @xiang90

select {
case <-s.stopc:
default:
s.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this why this is safe? (e.g. s.stopc is only updated in restore operation, which is called by apply snapshot call, compaction and apply snapshot requests are serialized by raft, and do not happen at the same time)

Copy link
Contributor Author

@tangcong tangcong Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.updated.

@tangcong
Copy link
Contributor Author

tangcong commented Apr 28, 2020

When the cluster has more write requests, there is a certain probability to trigger this deadlock bug, will we backport it to 3.4 / 3.3 ?thanks. @gyuho

@tangcong
Copy link
Contributor Author

tangcong commented May 7, 2020

Do you have any other suggestions for a solution to this problem?thanks. @gyuho

@gyuho
Copy link
Contributor

gyuho commented May 7, 2020

@tangcong Thanks for the update. LGTM. Let's backport to 3.4 and 3.3.

@tangcong
Copy link
Contributor Author

tangcong commented May 8, 2020

@gyuho thanks. I have submitted pr #11855 #11856 #11857 to backport it to release 3.4/3.3.

gyuho added a commit that referenced this pull request May 8, 2020
…17-origin-release-3.3

Automated cherry pick of #11817 on release-3.3
gyuho added a commit that referenced this pull request May 8, 2020
…17-origin-release-3.4

Automated cherry pick of #11817 on release-3.4
@tangcong tangcong deleted the fix-mvcc-deadlock-bug branch February 26, 2021 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants