Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman logs: error resolving symbol "sd_journal_xxx": libsystemd-journal.so.0: cannot open shared object file: ENOENT #20569

Closed
edsantiago opened this issue Nov 2, 2023 · 12 comments · Fixed by #20581
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

edsantiago commented Nov 2, 2023

Two in the last week and a half. Both under vfs, which may be coincidence but I think it isn't Never mind, it's not a vfs thing:

<+073ms> # # podman logs 68fee835c41fcfbab5a673ec3457204394563e38857e979747d8ca457ef72608
<+086ms> # Error: error resolving symbol "sd_journal_open": libsystemd-journal.so.0: cannot open shared object file: No such file or directory

and:

# podman-remote [options] logs testPod-init-test
time="2023-11-02T06:49:37-05:00" level=error msg="Failed to get journal entry: failed to move journal cursor to next entry: error resolving symbol \"sd_journal_next\": libsystemd-journal.so.0: cannot open shared object file: No such file or directory"

Reminiscent of #17042, but different enough to merit its own issue. Also reminiscent of #20282.

Seen in: int+sys podman+remote fedora-37+fedora-38 root+rootless host sqlite

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Nov 2, 2023
@edsantiago
Copy link
Member Author

Here's a fascinating one from August, int f37 root:

# podman [options] logs registry
[stuff]

# podman [options] logs registry
[stuff]
time="2023-08-22T07:08:16-05:00" level=error msg="Unable to close journal: error resolving symbol \"sd_journal_close\": libsystemd-journal.so: cannot open shared object file: No such file or directory"

# podman [options] logs registry
[stuff]

Redacted for readability. What fascinates me about this is that the error is transient! It happened during WaitContainerReady(), which runs repeated podman logs with no error checks. In this instance there were 5 successful podman logs commands, then the barf one, then ~13 subsequent podman logs without any errors.

Since the log is two months old and will get wiped soon, here's a more complete excerpt:

           # podman [options] logs registry
           time="2023-08-22T12:08:10.915479433Z" level=warning msg="No HTTP secret provided - generated random secret. This may cause problems with uploads if multiple registries are behind a load-balancer. To provide a shared secret, fill in http.secret in the configuration file or set the REGISTRY_HTTP_SECRET environment variable." go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.915627177Z" level=info msg="redis not configured" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919444463Z" level=info msg="Starting upload purge in 1m0s" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919746173Z" level=info msg="using inmemory blob descriptor cache" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924434903Z" level=info msg="restricting TLS version to tls1.2 or higher" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924544899Z" level=info msg="restricting TLS cipher suites to: TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924665552Z" level=fatal msg="open /certs/domain.crt: permission denied" 
           # podman [options] logs registry
           time="2023-08-22T12:08:10.915479433Z" level=warning msg="No HTTP secret provided - generated random secret. This may cause problems with uploads if multiple registries are behind a load-balancer. To provide a shared secret, fill in http.secret in the configuration file or set the REGISTRY_HTTP_SECRET environment variable." go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.915627177Z" level=info msg="redis not configured" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919444463Z" level=info msg="Starting upload purge in 1m0s" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919746173Z" level=info msg="using inmemory blob descriptor cache" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924434903Z" level=info msg="restricting TLS version to tls1.2 or higher" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924544899Z" level=info msg="restricting TLS cipher suites to: TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924665552Z" level=fatal msg="open /certs/domain.crt: permission denied" 
           time="2023-08-22T07:08:16-05:00" level=error msg="Unable to close journal: error resolving symbol \"sd_journal_close\": libsystemd-journal.so: cannot open shared object file: No such file or directory"
           # podman [options] logs registry
           time="2023-08-22T12:08:10.915479433Z" level=warning msg="No HTTP secret provided - generated random secret. This may cause problems with uploads if multiple registries are behind a load-balancer. To provide a shared secret, fill in http.secret in the configuration file or set the REGISTRY_HTTP_SECRET environment variable." go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.915627177Z" level=info msg="redis not configured" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919444463Z" level=info msg="Starting upload purge in 1m0s" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919746173Z" level=info msg="using inmemory blob descriptor cache" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924434903Z" level=info msg="restricting TLS version to tls1.2 or higher" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924544899Z" level=info msg="restricting TLS cipher suites to: TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924665552Z" level=fatal msg="open /certs/domain.crt: permission denied" 

@Luap99
Copy link
Member

Luap99 commented Nov 2, 2023

My only idea would be that some goroutine is switching the mnt namespace (somehow has different mounts there and does not correctly pin the thread) and due the unknown go scheduling it will be transient. However that does not really make sense either, the code path for podman logs is not that much and I cannot see any namespace involved there at all.

Alternatively some external command is modifying our test suite environment while the tests are running??? What I find odd that these are different kind of calls that are failing, sd_journal_open is always the first call which means the sometime it worked and only a later sd_journal call failed.

@edsantiago
Copy link
Member Author

Plot twist: there is no such thing as libsystemd-journal.so.0, and has not been since 2014. This is a ghost (Happy Halloween!) and seems to be coming from https://github.com/coreos/go-systemd/blob/main/sdjournal/functions.go#L30-L33 and possibly in its GetSymbolPointer() or dlopen(). Or possibly not. I'm stuck, and am giving up for the moment. Hope this helps someone track it down further.

@edsantiago
Copy link
Member Author

Ugh. This could be a thread interaction between go-systemd (above) and unixDlError() in https://github.com/mattn/go-sqlite3/blob/master/sqlite3-binding.c . And at this point this is way over my head.

@Luap99
Copy link
Member

Luap99 commented Nov 3, 2023

I am also not going to pretend I know how the go -> c interactions really work but it would surprise me if they would cause conflicts mainly because there is one case also with boltdb and we shouldn't be calling any sqlite code when boltdb is used AFAIK.

@edsantiago
Copy link
Member Author

Yeah, that's not it. I spent more time on it last night and ended up more confused than ever.

@Luap99
Copy link
Member

Luap99 commented Nov 3, 2023

I think I know what is going on, the common error is libsystemd-journal.so.0: cannot open shared object file: No such file or directory or also libsystemd-journal.so without .0.

This is always generated because as you linked in https://github.com/coreos/go-systemd/blob/main/sdjournal/functions.go#L30-L33 it always first tries the old library before the new one. So it is normal that these are messages are always generated by dlopen.

Now the interesting part is how the dlerror()handling works. If you look in the man page it is marked as mt-thread safe so we can assume the error message is stored local to the thread.
Now if we follow the example in dlopen(3)

dlerror();    /* Clear any existing error */

cosine = (double (*)(double)) dlsym(handle, "cos");
error = dlerror();
if (error != NULL) {
               fprintf(stderr, "%s\n", error);
               exit(EXIT_FAILURE);
}

So it is clear the the previous error must be cleared by calling dlerror(). And in fact the go-systemd code is doing that here: https://github.com/coreos/go-systemd/blob/d843340ab4bd3815fda02e648f9b09ae2dc722a7/internal/dlopen/dlopen.go#L62-L64

Now here is the catch, because it is go code there is no guarantee on what thread the code is running, the go runtime dynamically schedules goroutines on threads and we have no control over it unless runtime.LockOSThread() is used which is not used anywhere here.

Thus I think it is possible that between line 62-63 or 63-64 go moves this to another thread and no the error can no longer be read correctly. Note it will however only fail if it got moved to the same thread were the dlopen() on the old non exiting lib was called earlier. This would explain why it is so rare.

Form https://groups.google.com/g/golang-nuts/c/8Tdb-aCeYFY:

The Go runtime creates threads as needed and can
move a goroutine from one thread to another at any time.
In particular, every call to C.foo, even within a single
goroutine, might happen on a different OS thread.

If the library is squirreling away OS-thread-local data
then you need runtime.LockOSThread. If not, then
you just need a mutex.

@edsantiago
Copy link
Member Author

ARGH! I thought the libsystemdMutex.Lock() in getFunction() would be enough. Go is way too subtle for me. Thank you for finding that. I need to leave for the day, would you be willing to report that to go-systemd? (If you don't have time, no worries, I can do it on Sunday).

@edsantiago
Copy link
Member Author

Thank you again. I can now leave for the day with a clearer mind.

@giuseppe
Copy link
Member

giuseppe commented Nov 3, 2023

@Luap99 great finding!

Could we protect any access to these functions with runtime.LockOSThread()/runtime.UnlockOSThread() ?

@Luap99
Copy link
Member

Luap99 commented Nov 3, 2023

Yes I will open a PR to go-systemd to implement this

@Luap99 Luap99 self-assigned this Nov 3, 2023
Luap99 added a commit to Luap99/go-systemd that referenced this issue Nov 3, 2023
The dlerror() value is stored in thread local storage and thus checking
the error from go code is not safe because go can schedule goroutines
on any other thread at any time. This means it is possible that between
the dlsym() or dlclose() call the code get moved to another thread and
thus the dlerror() value is not what we expect. Instead because we read
now from another thread it will be an error from a different call.

A test is added which without this fix is able to reproduce the
problem somewhat reliably.

This issue was observed in the podman CI[1] which uses the sdjournal
API.

[1] containers/podman#20569

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99
Copy link
Member

Luap99 commented Nov 3, 2023

I created coreos/go-systemd#431 which also includes a test with a reproducer.

Luap99 added a commit to Luap99/libpod that referenced this issue Nov 3, 2023
This fixes a CI flake. go-systemd was not handling the dlerror()
function correctly which lead to wrong errors being reported.

[NO NEW TESTS NEEDED]

Fixes containers#20569

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Feb 3, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants