podman logs: error resolving symbol "sd_journal_xxx": libsystemd-journal.so.0: cannot open shared object file: ENOENT #20569

edsantiago · 2023-11-02T13:13:00Z

Two in the last week and a half. ~~Both under vfs, which may be coincidence but I think it isn't~~ Never mind, it's not a vfs thing:

<+073ms> # # podman logs 68fee835c41fcfbab5a673ec3457204394563e38857e979747d8ca457ef72608
<+086ms> # Error: error resolving symbol "sd_journal_open": libsystemd-journal.so.0: cannot open shared object file: No such file or directory

and:

# podman-remote [options] logs testPod-init-test
time="2023-11-02T06:49:37-05:00" level=error msg="Failed to get journal entry: failed to move journal cursor to next entry: error resolving symbol \"sd_journal_next\": libsystemd-journal.so.0: cannot open shared object file: No such file or directory"

Reminiscent of #17042, but different enough to merit its own issue. Also reminiscent of #20282.

fedora-37 : sys podman fedora-37 rootless host boltdb
- PR e2e: more ExitCleanly() on more test files #19959
  - 09-13-2023 10:39 in [sys] podman-kube@.service template
fedora-38 : int remote fedora-38 root host sqlite [remote]
- 11-02 08:02 in Podman kube play test with init containers and annotation set
fedora-38 : sys podman fedora-38 root host sqlite
- 10-23 17:06 in [sys] podman networking: port with --userns=keep-id for rootless or --uidmap=* for rootful

Seen in: int+sys podman+remote fedora-37+fedora-38 root+rootless host sqlite

The text was updated successfully, but these errors were encountered:

edsantiago · 2023-11-02T14:45:20Z

Here's a fascinating one from August, int f37 root:

# podman [options] logs registry
[stuff]

# podman [options] logs registry
[stuff]
time="2023-08-22T07:08:16-05:00" level=error msg="Unable to close journal: error resolving symbol \"sd_journal_close\": libsystemd-journal.so: cannot open shared object file: No such file or directory"

# podman [options] logs registry
[stuff]

Redacted for readability. What fascinates me about this is that the error is transient! It happened during WaitContainerReady(), which runs repeated podman logs with no error checks. In this instance there were 5 successful podman logs commands, then the barf one, then ~13 subsequent podman logs without any errors.

Since the log is two months old and will get wiped soon, here's a more complete excerpt:

           # podman [options] logs registry
           time="2023-08-22T12:08:10.915479433Z" level=warning msg="No HTTP secret provided - generated random secret. This may cause problems with uploads if multiple registries are behind a load-balancer. To provide a shared secret, fill in http.secret in the configuration file or set the REGISTRY_HTTP_SECRET environment variable." go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.915627177Z" level=info msg="redis not configured" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919444463Z" level=info msg="Starting upload purge in 1m0s" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919746173Z" level=info msg="using inmemory blob descriptor cache" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924434903Z" level=info msg="restricting TLS version to tls1.2 or higher" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924544899Z" level=info msg="restricting TLS cipher suites to: TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924665552Z" level=fatal msg="open /certs/domain.crt: permission denied" 
           # podman [options] logs registry
           time="2023-08-22T12:08:10.915479433Z" level=warning msg="No HTTP secret provided - generated random secret. This may cause problems with uploads if multiple registries are behind a load-balancer. To provide a shared secret, fill in http.secret in the configuration file or set the REGISTRY_HTTP_SECRET environment variable." go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.915627177Z" level=info msg="redis not configured" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919444463Z" level=info msg="Starting upload purge in 1m0s" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919746173Z" level=info msg="using inmemory blob descriptor cache" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924434903Z" level=info msg="restricting TLS version to tls1.2 or higher" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924544899Z" level=info msg="restricting TLS cipher suites to: TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924665552Z" level=fatal msg="open /certs/domain.crt: permission denied" 
           time="2023-08-22T07:08:16-05:00" level=error msg="Unable to close journal: error resolving symbol \"sd_journal_close\": libsystemd-journal.so: cannot open shared object file: No such file or directory"
           # podman [options] logs registry
           time="2023-08-22T12:08:10.915479433Z" level=warning msg="No HTTP secret provided - generated random secret. This may cause problems with uploads if multiple registries are behind a load-balancer. To provide a shared secret, fill in http.secret in the configuration file or set the REGISTRY_HTTP_SECRET environment variable." go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.915627177Z" level=info msg="redis not configured" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919444463Z" level=info msg="Starting upload purge in 1m0s" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.919746173Z" level=info msg="using inmemory blob descriptor cache" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924434903Z" level=info msg="restricting TLS version to tls1.2 or higher" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924544899Z" level=info msg="restricting TLS cipher suites to: TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384" go.version=go1.19.9 instance.id=ea44dcb1-7aeb-4748-82a6-3f2772437fa0 service=registry version=2.8.2 
           time="2023-08-22T12:08:10.924665552Z" level=fatal msg="open /certs/domain.crt: permission denied"

Luap99 · 2023-11-02T16:03:01Z

My only idea would be that some goroutine is switching the mnt namespace (somehow has different mounts there and does not correctly pin the thread) and due the unknown go scheduling it will be transient. However that does not really make sense either, the code path for podman logs is not that much and I cannot see any namespace involved there at all.

Alternatively some external command is modifying our test suite environment while the tests are running??? What I find odd that these are different kind of calls that are failing, sd_journal_open is always the first call which means the sometime it worked and only a later sd_journal call failed.

edsantiago · 2023-11-02T17:08:37Z

Plot twist: there is no such thing as libsystemd-journal.so.0, and has not been since 2014. This is a ghost (Happy Halloween!) and seems to be coming from https://github.com/coreos/go-systemd/blob/main/sdjournal/functions.go#L30-L33 and possibly in its GetSymbolPointer() or dlopen(). Or possibly not. I'm stuck, and am giving up for the moment. Hope this helps someone track it down further.

edsantiago · 2023-11-02T18:44:15Z

Ugh. This could be a thread interaction between go-systemd (above) and unixDlError() in https://github.com/mattn/go-sqlite3/blob/master/sqlite3-binding.c . And at this point this is way over my head.

Luap99 · 2023-11-03T10:03:51Z

I am also not going to pretend I know how the go -> c interactions really work but it would surprise me if they would cause conflicts mainly because there is one case also with boltdb and we shouldn't be calling any sqlite code when boltdb is used AFAIK.

edsantiago · 2023-11-03T10:56:40Z

Yeah, that's not it. I spent more time on it last night and ended up more confused than ever.

Luap99 · 2023-11-03T12:07:36Z

I think I know what is going on, the common error is libsystemd-journal.so.0: cannot open shared object file: No such file or directory or also libsystemd-journal.so without .0.

This is always generated because as you linked in https://github.com/coreos/go-systemd/blob/main/sdjournal/functions.go#L30-L33 it always first tries the old library before the new one. So it is normal that these are messages are always generated by dlopen.

Now the interesting part is how the dlerror()handling works. If you look in the man page it is marked as mt-thread safe so we can assume the error message is stored local to the thread.
Now if we follow the example in dlopen(3)

dlerror();    /* Clear any existing error */

cosine = (double (*)(double)) dlsym(handle, "cos");
error = dlerror();
if (error != NULL) {
               fprintf(stderr, "%s\n", error);
               exit(EXIT_FAILURE);
}

So it is clear the the previous error must be cleared by calling dlerror(). And in fact the go-systemd code is doing that here: https://github.com/coreos/go-systemd/blob/d843340ab4bd3815fda02e648f9b09ae2dc722a7/internal/dlopen/dlopen.go#L62-L64

Now here is the catch, because it is go code there is no guarantee on what thread the code is running, the go runtime dynamically schedules goroutines on threads and we have no control over it unless runtime.LockOSThread() is used which is not used anywhere here.

Thus I think it is possible that between line 62-63 or 63-64 go moves this to another thread and no the error can no longer be read correctly. Note it will however only fail if it got moved to the same thread were the dlopen() on the old non exiting lib was called earlier. This would explain why it is so rare.

Form https://groups.google.com/g/golang-nuts/c/8Tdb-aCeYFY:

The Go runtime creates threads as needed and can
move a goroutine from one thread to another at any time.
In particular, every call to C.foo, even within a single
goroutine, might happen on a different OS thread.

If the library is squirreling away OS-thread-local data
then you need runtime.LockOSThread. If not, then
you just need a mutex.

edsantiago · 2023-11-03T12:12:40Z

ARGH! I thought the libsystemdMutex.Lock() in getFunction() would be enough. Go is way too subtle for me. Thank you for finding that. I need to leave for the day, would you be willing to report that to go-systemd? (If you don't have time, no worries, I can do it on Sunday).

edsantiago · 2023-11-03T12:13:02Z

Thank you again. I can now leave for the day with a clearer mind.

giuseppe · 2023-11-03T12:54:09Z

@Luap99 great finding!

Could we protect any access to these functions with runtime.LockOSThread()/runtime.UnlockOSThread() ?

Luap99 · 2023-11-03T12:55:35Z

Yes I will open a PR to go-systemd to implement this

The dlerror() value is stored in thread local storage and thus checking the error from go code is not safe because go can schedule goroutines on any other thread at any time. This means it is possible that between the dlsym() or dlclose() call the code get moved to another thread and thus the dlerror() value is not what we expect. Instead because we read now from another thread it will be an error from a different call. A test is added which without this fix is able to reproduce the problem somewhat reliably. This issue was observed in the podman CI[1] which uses the sdjournal API. [1] containers/podman#20569 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 · 2023-11-03T13:07:38Z

I created coreos/go-systemd#431 which also includes a test with a reproducer.

This fixes a CI flake. go-systemd was not handling the dlerror() function correctly which lead to wrong errors being reported. [NO NEW TESTS NEEDED] Fixes containers#20569 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

edsantiago added the flakes Flakes from Continuous Integration label Nov 2, 2023

Luap99 self-assigned this Nov 3, 2023

Luap99 mentioned this issue Nov 3, 2023

dlopen: make error handling thread safe coreos/go-systemd#431

Merged

Luap99 mentioned this issue Nov 3, 2023

vendor: update github.com/coreos/go-systemd/v22 to latest main #20581

Merged

openshift-ci bot closed this as completed in #20581 Nov 4, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Feb 3, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podman logs: error resolving symbol "sd_journal_xxx": libsystemd-journal.so.0: cannot open shared object file: ENOENT #20569

podman logs: error resolving symbol "sd_journal_xxx": libsystemd-journal.so.0: cannot open shared object file: ENOENT #20569

edsantiago commented Nov 2, 2023 •

edited

Loading

edsantiago commented Nov 2, 2023

Luap99 commented Nov 2, 2023

edsantiago commented Nov 2, 2023

edsantiago commented Nov 2, 2023

Luap99 commented Nov 3, 2023

edsantiago commented Nov 3, 2023

Luap99 commented Nov 3, 2023

edsantiago commented Nov 3, 2023

edsantiago commented Nov 3, 2023

giuseppe commented Nov 3, 2023

Luap99 commented Nov 3, 2023

Luap99 commented Nov 3, 2023

podman logs: error resolving symbol "sd_journal_xxx": libsystemd-journal.so.0: cannot open shared object file: ENOENT #20569

podman logs: error resolving symbol "sd_journal_xxx": libsystemd-journal.so.0: cannot open shared object file: ENOENT #20569

Comments

edsantiago commented Nov 2, 2023 • edited Loading

edsantiago commented Nov 2, 2023

Luap99 commented Nov 2, 2023

edsantiago commented Nov 2, 2023

edsantiago commented Nov 2, 2023

Luap99 commented Nov 3, 2023

edsantiago commented Nov 3, 2023

Luap99 commented Nov 3, 2023

edsantiago commented Nov 3, 2023

edsantiago commented Nov 3, 2023

giuseppe commented Nov 3, 2023

Luap99 commented Nov 3, 2023

Luap99 commented Nov 3, 2023

edsantiago commented Nov 2, 2023 •

edited

Loading