-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGBUS on startup in etcd-3.5.0 after filesystem rollback #13406
Comments
It seems that the data file etcd/member/snap/db is corrupted. |
Low level bolt cmd also does not manages to open that file:
In general bbolt would be a better component for this type of issue. |
I am pretty sure the db file etcd/member/snap/db is corrupted. The file size is 2527232 bytes, and the pageSize in meta is 4096, so there are 617 (2527232/4096) pages in total. But the pgid value in meta page is 1706, which is out of range. There are also some invalid entries in the B-tree internal nodes as well. I provided a solution to fix the db file, please note that there may be some data loss, and I am not responsible for the data loss! |
@bencord0 Please let me know whether the solution works for you. Can I upload the original zip file etcd-crash-on-startup.zip into my personal public repo etcd-issues, so that others can learn from the issue? |
Yes! I've managed to test this this morning, and it does fix the issue for me.
Yes, this is fine. As I outlined in the original report, this is from a test cluster and I managed to isolate the problem just to etcd, and not the entire k0s setup. Out of curiosity, how did you find the broken pages, and how did you calculate the checksums? If I see a similar issue in the future, is it possible to do the recovery myself? could this solution be made into a generic fsck-like tool? Given a choice, I think that it could be acceptable (in some cases, and definitely an administrator's choice) to be able to recover etcd with dataloss rather than to crash on startup. |
Thanks. I only uploaded the file
The db file is actually updated using bbolt, and the data are organized as a B tree. So you just need to analyze the data using the same logic as what bbolt does.
Please see db.go#L1221
Yes, it's a good idea. I may submit a PR to bbolt. What do you think? @ptabor |
not crashing in such situation, and providing more data what's broken/inconsistent. |
@ptabor Thanks for the comment. I will submit a PR later. @bencord0 I wonder how was the corrupted db file generated? Could you provide more detailed steps or info? |
@ptabor I just submitted a PR bbolt/pull/294 for bolt. It's just the very first step to fix this issue. Once this one gets merged, there will be a couple of followup PRs. |
@ahrtr How familiar are you with zfs snapshots? I created this from a live snapshot of a running system, etcd was installed as part of the default k0s configuration (single host controller) in a systemd-nspawn container running Gentoo (although, all packages here were not installed via the package manager, and from what I can tell, the etcd binary matches the checksums of the tagged 3.5 release).
Someone else asked about my process for doing systemd containers recently, so I wrote it up if you're interested.
In this case, I installed the controller node in an nspawn container, and the worker node in a VM, since the kubelet currently doesn't run when nested inside an existing container.
At this point, etcd is not running, the filesystem has been cleanly rolled back to a consistent state (i.e. all writes before
|
@bencord0 Thanks for the detailed info. I am not familiar with zfs. I suspect it has something to do with the live snapshotting, which means you create the zfs snapshot when the etcd is still serving requests. But on the other hand, snapshot is a generic technology. CSI spec has clear definition on snapshot API, and some CSI Drivers have already supported the snapshot functionality. I suppose it (snapshot) isn't the root cause, nor the zfs snapshot, although zfs snapshot may be different from CSI snapshot. How often do you run into this issue? Can this issue be reproduced easily when creating the zfs snapshot when the etcd is running? I would suggest to backup the etcd using etcdctl, such as I double checked the meta pages, both of them are not corrupted. But the pgid in the meta page is obvious out of range. @ptabor Any thoughts on the possible reasons on the corrupted db file? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Hi, I seem to have found a reproducible bug in etcd after restarting from a zfs snapshot rollback.
This occurred while I was testing a k0s based kubernetes cluster, and since there isn't much in the database, I've provided a zipfile with the data-dir if you need to debug.
etcd-crash-on-startup.zip
Here's a stack trace of the crash.
And you should be able to reproduce this on any linux-amd64 host by running the
start.sh
in the attached zip file.Expected Result
etcd's storage is durable and is able to startup and recover the database after a crash event (or in my case, a zfs rollback).
Actual Result
Runtime panic in bbolt when opening the backend.
I've also tried testing the snapshot with
etcdutl snapshot status ./etcd/member/snap/db
, and the same crash occurs.Steps to reproduce
You can reproduce the crash by extracting the attached zip file and running
etcd --data-dir=./etcd
in that directory.This is the contents of
/var/lib/k0s/
from my test host, with the etcd binary (which I think matches the released v3.5.0 distribution) and the etcd data dir.This was created from a running a single-node k0s cluster controller. On my system, I take periodic snapshots of the zfs filesystem with the zfs-auto-snapshot tool. This filesystem snapshot was taken when the cluster was idle, and I would have expected all durable writes to have been flushed to disk.
Given that etcd's storage design includes a WAL, I would have expected that the service would be able to recover from a perceived crash (in this case, a point-in-time filesystem rollback, and not a bug in etcd), with an acceptable bit of data loss for any writes after the snapshot was taken.
This was a single-node cluster, and was not participating in distributed consensus.
The text was updated successfully, but these errors were encountered: