-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman logs: error resolving symbol "sd_journal_xxx": libsystemd-journal.so.0: cannot open shared object file: ENOENT #20569
Comments
Here's a fascinating one from August, int f37 root: # podman [options] logs registry
[stuff]
# podman [options] logs registry
[stuff]
time="2023-08-22T07:08:16-05:00" level=error msg="Unable to close journal: error resolving symbol \"sd_journal_close\": libsystemd-journal.so: cannot open shared object file: No such file or directory"
# podman [options] logs registry
[stuff] Redacted for readability. What fascinates me about this is that the error is transient! It happened during Since the log is two months old and will get wiped soon, here's a more complete excerpt:
|
My only idea would be that some goroutine is switching the mnt namespace (somehow has different mounts there and does not correctly pin the thread) and due the unknown go scheduling it will be transient. However that does not really make sense either, the code path for podman logs is not that much and I cannot see any namespace involved there at all. Alternatively some external command is modifying our test suite environment while the tests are running??? What I find odd that these are different kind of calls that are failing, |
Plot twist: there is no such thing as |
Ugh. This could be a thread interaction between |
I am also not going to pretend I know how the go -> c interactions really work but it would surprise me if they would cause conflicts mainly because there is one case also with boltdb and we shouldn't be calling any sqlite code when boltdb is used AFAIK. |
Yeah, that's not it. I spent more time on it last night and ended up more confused than ever. |
I think I know what is going on, the common error is This is always generated because as you linked in https://github.com/coreos/go-systemd/blob/main/sdjournal/functions.go#L30-L33 it always first tries the old library before the new one. So it is normal that these are messages are always generated by dlopen. Now the interesting part is how the
So it is clear the the previous error must be cleared by calling dlerror(). And in fact the go-systemd code is doing that here: https://github.com/coreos/go-systemd/blob/d843340ab4bd3815fda02e648f9b09ae2dc722a7/internal/dlopen/dlopen.go#L62-L64 Now here is the catch, because it is go code there is no guarantee on what thread the code is running, the go runtime dynamically schedules goroutines on threads and we have no control over it unless Thus I think it is possible that between line 62-63 or 63-64 go moves this to another thread and no the error can no longer be read correctly. Note it will however only fail if it got moved to the same thread were the dlopen() on the old non exiting lib was called earlier. This would explain why it is so rare. Form https://groups.google.com/g/golang-nuts/c/8Tdb-aCeYFY:
|
ARGH! I thought the |
Thank you again. I can now leave for the day with a clearer mind. |
@Luap99 great finding! Could we protect any access to these functions with |
Yes I will open a PR to go-systemd to implement this |
The dlerror() value is stored in thread local storage and thus checking the error from go code is not safe because go can schedule goroutines on any other thread at any time. This means it is possible that between the dlsym() or dlclose() call the code get moved to another thread and thus the dlerror() value is not what we expect. Instead because we read now from another thread it will be an error from a different call. A test is added which without this fix is able to reproduce the problem somewhat reliably. This issue was observed in the podman CI[1] which uses the sdjournal API. [1] containers/podman#20569 Signed-off-by: Paul Holzinger <pholzing@redhat.com>
I created coreos/go-systemd#431 which also includes a test with a reproducer. |
This fixes a CI flake. go-systemd was not handling the dlerror() function correctly which lead to wrong errors being reported. [NO NEW TESTS NEEDED] Fixes containers#20569 Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Two in the last week and a half.
Both under vfs, which may be coincidence but I think it isn'tNever mind, it's not a vfs thing:and:
Reminiscent of #17042, but different enough to merit its own issue. Also reminiscent of #20282.
Seen in: int+sys podman+remote fedora-37+fedora-38 root+rootless host sqlite
The text was updated successfully, but these errors were encountered: