Ext4 filesystem going read only #3884

YannSK · 2021-06-23T09:39:20Z

Describe the bug
After a few weeks using Loki without any problem (thanks for this great soft BTW !), my ext4 filesystem used for chunk storage have gone read only for the 2nd time in 2 days. I have to repair it using fsck and reboot to get my logs ingested again

To Reproduce
Not sure how to reproduce, it happens during normal runtime while load average is low (<1)

Expected behavior
Filesystem is not broken

Environment:

Infrastructure: KVM virtual machine dedicated for loki with virtio devices and external storage (SAN)
Deployment tool: ansible

Screenshots, Promtail config, or terminal output
Below are the logs preceding last incident :
...
2021-06-23T08:31:24.048653+02:00 os137 loki[773]: level=info ts=2021-06-23T06:31:24.041937572Z caller=table.go:421 msg="cleaning up unwanted dbs from table index_18800"
2021-06-23T08:31:54.011436+02:00 os137 loki[773]: level=error ts=2021-06-23T06:31:54.009964963Z caller=flush.go:220 org_id=fake msg="failed to flush user" err="open /opt/loki_promtail/data/loki/chunks/ZmFrZS8zZGNiZTdmNGViYWQ5YmY1OjE3YTM3NWEzM2NiOjE3YTM3NWEzN2RlOjIwNDdmNTc=: file exists"
...
2021-06-23T08:32:23.223376+02:00 os137 loki[773]: level=info ts=2021-06-23T06:32:23.22309268Z caller=table.go:336 msg="uploading table index_18801"
2021-06-23T08:32:25.385712+02:00 os137 loki[773]: level=error ts=2021-06-23T06:32:25.384976325Z caller=flush.go:220 org_id=fake msg="failed to flush user" err="open /opt/loki_promtail/data/loki/chunks/ZmFrZS8zZGNiZTdmNGViYWQ5YmY1OjE3YTM3NWEzM2NiOjE3YTM3NWEzN2RlOjIwNDdmNTc=: file exists"
2021-06-23T08:32:48.253400+02:00 os137 loki[773]: level=info ts=2021-06-23T06:32:48.252284136Z caller=checkpoint.go:497 msg="atomic checkpoint finished" old=/tmp/wal/checkpoint.000262.tmp new=/tmp/wal/checkpoint.000262
2021-06-23T08:32:48.312392+02:00 os137 loki[773]: level=info ts=2021-06-23T06:32:48.312112456Z caller=checkpoint.go:568 msg="checkpoint done" time=4m25.045689399s
2021-06-23T08:32:53.806396+02:00 os137 kernel: [78622.908286] EXT4-fs error (device dm-0): dx_probe:856: inode #27787274: block 156904: comm loki: directory leaf block found instead of index block

Fsck repaired thousands of errors in filesystem, a few samples :

fsck.ext4 -y /dev/mapper/vg0-root

e2fsck 1.44.5 (15-Dec-2018)
/dev/mapper/vg0-root contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 2752524, i_blocks is 321136, should be 320744. Fix? yes
Pass 1D: Reconciling multiply-claimed blocks
(There are 3 inodes containing multiply-claimed blocks.)

File /opt/loki_promtail/data/loki/boltdb-shipper-active/index_18801/1624429800 (inode #2864969, mod time Wed Jun 23 08:32:53 2021)
has 558 multiply-claimed block(s), shared with 1 file(s):
/opt/loki_promtail/data/loki/chunks/ZmFrZS80NDVkZDczODU2NTU0ZGM1OjE3YTM3MjM4ZjI4OjE3YTM3OTBlODkzOmZjZGY1MTNh (inode #2931885, mod time Wed Jun 23 08:31:23 2021)
Clone multiply-claimed blocks? yes
Entry 'promtail_positions.yaml' in /var/log (28573866) has deleted/unused inode 28577456. Clear? yes
Entry 'ZmFrZS8yYTFlNWY5MTQwZDYyMzEwOjE3YTM3M2RkMDNkOjE3YTM3OTAzNTk0OjFjOGJhMTU=' in /opt/loki_promtail/data/loki/chunks (27787274) has deleted/unused inode 2931688. Clear? yes
Free blocks count wrong for group #1975 (32768, counted=22184).
Fix? yes
Free inodes count wrong for group #3983 (8192, counted=0).
Fix? yes
Free inodes count wrong (32670759, counted=25684403).
Fix? yes
/dev/mapper/vg0-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg0-root: ***** REBOOT SYSTEM *****
/dev/mapper/vg0-root: 7018061/32702464 files (0.0% non-contiguous), 53046165/130778112 blocks

dannykopping · 2021-06-30T08:52:43Z

Hey @YannSK - to be clear: do you think Loki is corrupting your ext4 filesystem?

YannSK · 2021-06-30T10:32:04Z

Hi, I'm running hundreds of VMs identical to this one in high IO load environments and never got recurrent filesystem corruption like this, so yes I think it's somehow related to Loki. I've checked what I could, system and storage behave as I expect on this VM.
But still if I'm the only one having this and you don't think it can be related to Loki then I'll definitely accept it and will investigate further on my side

YannSK · 2021-07-12T13:21:29Z

I cannot reproduce using local SSD storage on the hypervisor, It's been running like a charm for a week now.
So it is storage related after all, sorry for the noise !

dannykopping · 2021-07-12T13:25:12Z

No problem @YannSK - thank you for following up 👍

YannSK closed this as completed Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ext4 filesystem going read only #3884

Ext4 filesystem going read only #3884

YannSK commented Jun 23, 2021

dannykopping commented Jun 30, 2021

YannSK commented Jun 30, 2021

YannSK commented Jul 12, 2021

dannykopping commented Jul 12, 2021

Ext4 filesystem going read only #3884

Ext4 filesystem going read only #3884

Comments

YannSK commented Jun 23, 2021

fsck.ext4 -y /dev/mapper/vg0-root

dannykopping commented Jun 30, 2021

YannSK commented Jun 30, 2021

YannSK commented Jul 12, 2021

dannykopping commented Jul 12, 2021