Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ext4 filesystem going read only #3884

Closed
YannSK opened this issue Jun 23, 2021 · 4 comments
Closed

Ext4 filesystem going read only #3884

YannSK opened this issue Jun 23, 2021 · 4 comments

Comments

@YannSK
Copy link

YannSK commented Jun 23, 2021

Describe the bug
After a few weeks using Loki without any problem (thanks for this great soft BTW !), my ext4 filesystem used for chunk storage have gone read only for the 2nd time in 2 days. I have to repair it using fsck and reboot to get my logs ingested again

To Reproduce
Not sure how to reproduce, it happens during normal runtime while load average is low (<1)

Expected behavior
Filesystem is not broken

Environment:

  • Infrastructure: KVM virtual machine dedicated for loki with virtio devices and external storage (SAN)
  • Deployment tool: ansible

Screenshots, Promtail config, or terminal output
Below are the logs preceding last incident :
...
2021-06-23T08:31:24.048653+02:00 os137 loki[773]: level=info ts=2021-06-23T06:31:24.041937572Z caller=table.go:421 msg="cleaning up unwanted dbs from table index_18800"
2021-06-23T08:31:54.011436+02:00 os137 loki[773]: level=error ts=2021-06-23T06:31:54.009964963Z caller=flush.go:220 org_id=fake msg="failed to flush user" err="open /opt/loki_promtail/data/loki/chunks/ZmFrZS8zZGNiZTdmNGViYWQ5YmY1OjE3YTM3NWEzM2NiOjE3YTM3NWEzN2RlOjIwNDdmNTc=: file exists"
...
2021-06-23T08:32:23.223376+02:00 os137 loki[773]: level=info ts=2021-06-23T06:32:23.22309268Z caller=table.go:336 msg="uploading table index_18801"
2021-06-23T08:32:25.385712+02:00 os137 loki[773]: level=error ts=2021-06-23T06:32:25.384976325Z caller=flush.go:220 org_id=fake msg="failed to flush user" err="open /opt/loki_promtail/data/loki/chunks/ZmFrZS8zZGNiZTdmNGViYWQ5YmY1OjE3YTM3NWEzM2NiOjE3YTM3NWEzN2RlOjIwNDdmNTc=: file exists"
2021-06-23T08:32:48.253400+02:00 os137 loki[773]: level=info ts=2021-06-23T06:32:48.252284136Z caller=checkpoint.go:497 msg="atomic checkpoint finished" old=/tmp/wal/checkpoint.000262.tmp new=/tmp/wal/checkpoint.000262
2021-06-23T08:32:48.312392+02:00 os137 loki[773]: level=info ts=2021-06-23T06:32:48.312112456Z caller=checkpoint.go:568 msg="checkpoint done" time=4m25.045689399s
2021-06-23T08:32:53.806396+02:00 os137 kernel: [78622.908286] EXT4-fs error (device dm-0): dx_probe:856: inode #27787274: block 156904: comm loki: directory leaf block found instead of index block

Fsck repaired thousands of errors in filesystem, a few samples :

fsck.ext4 -y /dev/mapper/vg0-root

e2fsck 1.44.5 (15-Dec-2018)
/dev/mapper/vg0-root contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 2752524, i_blocks is 321136, should be 320744. Fix? yes
Pass 1D: Reconciling multiply-claimed blocks
(There are 3 inodes containing multiply-claimed blocks.)

File /opt/loki_promtail/data/loki/boltdb-shipper-active/index_18801/1624429800 (inode #2864969, mod time Wed Jun 23 08:32:53 2021)
has 558 multiply-claimed block(s), shared with 1 file(s):
/opt/loki_promtail/data/loki/chunks/ZmFrZS80NDVkZDczODU2NTU0ZGM1OjE3YTM3MjM4ZjI4OjE3YTM3OTBlODkzOmZjZGY1MTNh (inode #2931885, mod time Wed Jun 23 08:31:23 2021)
Clone multiply-claimed blocks? yes
Entry 'promtail_positions.yaml' in /var/log (28573866) has deleted/unused inode 28577456. Clear? yes
Entry 'ZmFrZS8yYTFlNWY5MTQwZDYyMzEwOjE3YTM3M2RkMDNkOjE3YTM3OTAzNTk0OjFjOGJhMTU=' in /opt/loki_promtail/data/loki/chunks (27787274) has deleted/unused inode 2931688. Clear? yes
Free blocks count wrong for group #1975 (32768, counted=22184).
Fix? yes
Free inodes count wrong for group #3983 (8192, counted=0).
Fix? yes
Free inodes count wrong (32670759, counted=25684403).
Fix? yes
/dev/mapper/vg0-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg0-root: ***** REBOOT SYSTEM *****
/dev/mapper/vg0-root: 7018061/32702464 files (0.0% non-contiguous), 53046165/130778112 blocks

@dannykopping
Copy link
Contributor

Hey @YannSK - to be clear: do you think Loki is corrupting your ext4 filesystem?

@YannSK
Copy link
Author

YannSK commented Jun 30, 2021

Hi, I'm running hundreds of VMs identical to this one in high IO load environments and never got recurrent filesystem corruption like this, so yes I think it's somehow related to Loki. I've checked what I could, system and storage behave as I expect on this VM.
But still if I'm the only one having this and you don't think it can be related to Loki then I'll definitely accept it and will investigate further on my side

@YannSK
Copy link
Author

YannSK commented Jul 12, 2021

I cannot reproduce using local SSD storage on the hypervisor, It's been running like a charm for a week now.
So it is storage related after all, sorry for the noise !

@YannSK YannSK closed this as completed Jul 12, 2021
@dannykopping
Copy link
Contributor

No problem @YannSK - thank you for following up 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants