Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warning: fileutil: file already locked #15360

Closed
junhwong opened this issue Feb 26, 2023 · 8 comments · Fixed by #15408
Closed

warning: fileutil: file already locked #15360

junhwong opened this issue Feb 26, 2023 · 8 comments · Fixed by #15408

Comments

@junhwong
Copy link

What happened?

got a warning level log :

fileutil/purge.go:72 failed to lock file 
{
"path":"/wal/000000000000000c-0000000000018568.wal",
"error":"fileutil: file already locked"
}

What did you expect to happen?

I don't know the effect of this warning

How can we reproduce it (as minimally and precisely as possible)?

I don't know how it came about

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
v3.5.7

Etcd configuration (command line flags or environment variables)

Minimum configuration

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

No response

@Elbehery
Copy link
Member

I have faced this issue, May I ask what is the root cause ?

cc @tjungblu @hasbro17 @dusk125

@jmhbnz
Copy link
Member

jmhbnz commented Feb 26, 2023

Here is the link to the file responsible for that error: https://github.com/etcd-io/etcd/blob/main/client/pkg/fileutil/purge.go#L72

There was a discussion on this error last month: #15100, including some parameters that might help with preventing this error.

Can you please confirm which filesystem are you using?

@tjungblu
Copy link
Contributor

I think in our case it was a customer running etcd over NFS, so it's not entirely surprising they would get an error like that.

@Elbehery
Copy link
Member

Hello ✋🏽

I have a customer who faced this issue as well.

The error is raised from here, and it wraps the error from here

However, there was no other process but etcd has opened the path /var/lib/etcd

[root@master-1 lib]# fuser -v /var/lib/etcd
[root@master-1 lib]# fuser -v /var/lib/etcd/member/wal/
                     USER        PID ACCESS COMMAND
/var/lib/etcd/member/wal:
                     root       3827 f.... etcd
[root@master-2 /]# fuser -v /var/lib/etcd
[root@master-2 /]# fuser -v /var/lib/etcd/member/wal/
                     USER        PID ACCESS COMMAND
/var/lib/etcd/member/wal:
                     root       4538 f.... etcd
[root@master-3 /]# fuser -v /var/lib/etcd
[root@master-3 /]# fuser -v /var/lib/etcd/member/wal
                     USER        PID ACCESS COMMAND
/var/lib/etcd/member/wal:
                     root       3494 f.... etcd

Any ideas about the cause of this ?

cc @ahrtr @spzala

@cenkalti
Copy link
Member

cenkalti commented Mar 3, 2023

I can reproduce this with a fresh db when I run put workload on it. No NFS involved.

Steps:
Run etcd.

etcd Version: 3.5.7
Git SHA: 215b53cf3
Go Version: go1.19
Go OS/Arch: linux/amd64

Run benchmark:

benchmark --target-leader --conns=10 --clients=10 \
    put --key-size=8 --key-space-size=1000000000 --total=100000 --val-size=20000

Warning message observed in log:

{"level":"info","ts":"2023-03-03T18:08:57.548-0500","caller":"wal/wal.go:785","msg":"created a new WAL segment","path":"default.etcd/member/wal/0000000000000009-0000000000007061.wal"}
{"level":"info","ts":"2023-03-03T18:08:59.393-0500","caller":"wal/wal.go:785","msg":"created a new WAL segment","path":"default.etcd/member/wal/000000000000000a-0000000000007cdd.wal"}
{"level":"warn","ts":"2023-03-03T18:08:59.911-0500","caller":"fileutil/purge.go:72","msg":"failed to lock file","path":"default.etcd/member/wal/0000000000000000-0000000000000000.wal","error":"fileutil: file already locked"}
{"level":"info","ts":"2023-03-03T18:09:01.397-0500","caller":"wal/wal.go:785","msg":"created a new WAL segment","path":"default.etcd/member/wal/000000000000000b-0000000000008958.wal"}
{"level":"info","ts":"2023-03-03T18:09:03.661-0500","caller":"wal/wal.go:785","msg":"created a new WAL segment","path":"default.etcd/member/wal/000000000000000c-00000000000095d0.wal"}

@jmhbnz
Copy link
Member

jmhbnz commented Mar 3, 2023

Thanks for the repro @cenkalti - confirming I can reproduce also following those steps, just on a standard nvme ssd with standard ext4 filesystem so we can confirm this isn't just nfs related.

@ahrtr
Copy link
Member

ahrtr commented Mar 6, 2023

It's expected behavior. The wal files will be locked until a snapshot (with higher index) is created. Only 5 WAL files are retained by default, and the snapshot-count is 100K by default. When the sixth WAL file is created, it tries to remove the oldeset WAL file, but it isn't released yet because the snapshot isn't created yet.

But we can lower the default value for snapshotCount, (e.g set to 10K by default) FYI.

DefaultSnapshotCount = 100000

I think majority maintainers have already reached an agreement on this, so please feel free to deliver a PR for this.

The related discussion:

@jmhbnz
Copy link
Member

jmhbnz commented Mar 6, 2023

Thanks for the behavior clarification @ahrtr, was good context to read through the previous discussions. I've raised a tentative pull request to see if there is now consensus to reduce back to 10k 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

6 participants