initrd: nm-initrd.service fails to spawn depending on console setup #943

kai-uwe-rommel · 2021-08-29T14:50:34Z

Describe the bug
I regularly deploy OKD clusters with FCOS on vSphere.
The deployment process worked fine up until FCOS 34.2021-06-26 and started failing with 2021-07-11 and still fails with 2021-07-25 and 2021-0808.
The OKD cluster VMs start with plain FCOS OVA and get an igition file passed via guestinfo.ignition from vSphere.
Starting with FCOS 34.2021-07-11, the VMs simply fail to start their network.

Reproduction steps
Steps to reproduce the behavior:

create a VM from the FCOS OVA
configure an OKD ignition file via guestinfo.ignition
power on the VM

Expected behavior
As usual, the VM should initialize their network (with DHCP) and process the ignition file which includes merging an additional remote ignition file.

Actual behavior
The VM does not start its network, is not pingable and cannot resolve the remote http server (and can not connect to it) from which the initial ignition file (passed from vSphere via guestinfo.ignition) tries to merge a remote ignition file.

System details
All on a vSphere 7.0u2 cluster.
FCOS versions see above.

Ignition config
File attached here as text file.

Additional information
I tried to gather more details about why the boot process fails to start the networking by connecting a virtual serial console to the VM, in order to view serial console output and record it.
Unfortunately, when a serial console is connected to the VM, then the networking starts successfully.
So it looks like it is a timing issue and the serial console slows it down enough to let it succeed.

initial-ignition.txt

kai-uwe-rommel · 2021-08-29T15:01:37Z

What I can see in the console as it passes by quickly (as described, I cannot reproduce with a serial console and can thus not record it):

dustymabe · 2021-08-30T13:32:51Z

This may be fallout from #842 (mainly because it was first implemented in 34.20210711.3.0). If you remove OKD from the equation can you still reproduce with just a simple remote Ignition config?

kai-uwe-rommel · 2021-08-30T13:48:21Z

Without OKD, e.g. some basic standalone VM with an otherwise similar (but purely local, e.g. completely within guestinfo.ignition) ignition file (just without the merged in remote/secondary ignition file) comes up fine. I would need to construct a case where I configure a standalone VM but with a merged in add'l remote ignition file ... did not have that case yet.

dustymabe · 2021-08-30T13:56:52Z

Without OKD, e.g. some basic standalone VM with an otherwise similar (but purely local, e.g. completely within guestinfo.ignition) ignition file (just without the merged in remote/secondary ignition file) comes up fine.

Yep. We have some logic in there that won't bring up networking at all if there are no remote references.

I would need to construct a case where I configure a standalone VM but with a merged in add'l remote ignition file ... did not have that case yet.

+1 - it may be easy enough just to provide ignition.config.url=http://... on the kernel command line. Or it's possible it's triggered different, but we could start there.

lucab · 2021-08-30T14:04:51Z

@kai-uwe-rommel is this node set up via DHCP (specifically, for the initrd and Ignition fetch), or does it have some custom networking configuration via guestinfo or kargs?

kai-uwe-rommel · 2021-08-30T14:08:54Z

This is via DHCP.

kai-uwe-rommel · 2021-08-30T14:09:56Z

As I tried to express above: this seems a timing issue. When I attach a serial console to debug, the problem vanishes away. Does this ring some bell for someone of you?

zfiel · 2021-08-30T15:14:16Z

As I tried to express above: this seems a timing issue. When I attach a serial console to debug, the problem vanishes away. Does this ring some bell for someone of you?

Thank you @kai-uwe-rommel, it "solved" my problems.

kai-uwe-rommel · 2021-08-30T15:52:50Z

As I tried to express above: this seems a timing issue. When I attach a serial console to debug, the problem vanishes away. Does this ring some bell for someone of you?

Thank you @kai-uwe-rommel, it "solved" my problems.

???

I am having a problem. In what way could this have solved your problem??

dustymabe · 2021-08-30T17:48:56Z

I am having a problem. In what way could this have solved your problem??

well - in the very least using the serial console (as you suggested) would allow someone to workaround the issue and get unblocked. So maybe that's what they meant.

zfiel · 2021-08-30T19:36:42Z

Yes sorry it dit not solved the root issue, but at least I am able to get my cluster working again, by attaching a serial console (as a workaround).

kai-uwe-rommel · 2021-08-31T10:05:17Z

Back to the topic - what next?
Anything I can do to help create debug logs or whatever?
The problem is not going away on its own ... :-(

dustymabe · 2021-08-31T13:35:04Z

Anything I can do to help create debug logs or whatever?

Yep. It would still be nice to get a simple reproducer without OKD.

lucab · 2021-08-31T14:07:34Z

I did manage to reproduce this outside of OKD on latest stable (34.20210808.3.0), just with an Ignition config with a reference to a remote resource.
nm-initrd.service is conditional on the initrd neednet, and a remote reference in Ignition will plug into it.

I'm having some hard time catching the real error.
Adding a serial device makes this go away, but the same effect also happens if I drop all console= kargs in order to get an emergency shell.
As such, I suspect this is not merely a timing race which can get affected by attaching a serial device, but a plain bug related to the node console setup.

lucab · 2021-08-31T14:15:33Z

After a few tries, I did manage to capture the failure (cut for brevity, this was the 4th restart-on-failure of that unit):

I think this is due to nm-initrd.service unit having a StandardOutput=tty in its service definition, see https://github.com/dracutdevs/dracut/blob/055/modules.d/35network-manager/nm-initrd.service#L19.

I believe this isn't vmware-specific, but may affect any setup with a mismatched console configuration.

kai-uwe-rommel · 2021-08-31T14:27:16Z

Ok. This sounds like this can/will be fixed in near future?

lucab · 2021-08-31T14:42:52Z

Yes, I think this could be simply fixed by dropping the StandardOutput= config, although I'm not sure why it's there in the first place.
@bengal do you maybe remember why you set it up this way?

Sidenote: I won't be able to push this forward anytime soon, as I'm about to go offline for a few days.

bengal · 2021-09-06T08:36:51Z

StandardOutput=tty was added so that NM output is visible in the console (which is useful especially with rd.debug). I'll try to reproduce and find a different solution.

lucab · 2021-09-06T09:07:00Z

@bengal ack. In that case I think you can try to see whether StandardOutput=journal+console is ok too.

bengal · 2021-09-06T14:30:19Z

I reproduced the issue by addingconsole=null to kernel command line.

StandardOutput=journal+console solves the problem with missing console, but also causes duplicate messages in the journal, as NM already writes to both stdout and the journal in initrd.

Perhaps we should add a configuration option (or command line switch) so that NM can write only to stdout (and then the service will use StandardOutput=journal+console)

cgwalters · 2021-09-07T15:28:55Z

Perhaps we should add a configuration option (or command line switch) so that NM can write only to stdout (and then the service will use StandardOutput=journal+console)

Yes, that makes the most sense to me, then systemd handles everything (and presumably does so in a race-free way).

bengal · 2021-09-10T08:29:00Z

NM discussion here: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/977

kai-uwe-rommel · 2021-09-12T10:33:01Z

How will I know when and with which release of FCOS the fix will be delivered?

sandrobonazzola · 2021-10-01T14:47:35Z

Seems like I hit this issue as well, thanks @kai-uwe-rommel for pointing here.

The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943

kai-uwe-rommel · 2021-10-21T11:57:39Z

Any news about when a release to be expected?

dustymabe · 2021-10-21T13:34:22Z

waiting on @haraldh in dracutdevs/dracut#1611 (comment)

The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943

olivierlemasle · 2021-10-26T09:59:29Z

Now that the dracut fix has been merged, what are the next steps? Backporting the fix in Fedora's dracut rpm package and fast-tracking the update in fcos?

dustymabe · 2021-10-26T12:37:38Z

Now that the dracut fix has been merged, what are the next steps? Backporting the fix in Fedora's dracut rpm package and fast-tracking the update in fcos?

yep.. any chance you want to do the backport to the rpm? See https://src.fedoraproject.org/rpms/dracut/pull-request/14# for an example.

olivierlemasle · 2021-10-26T19:12:13Z

I've successfully tested the fix (I also encountered the issue) 🎉
And I've opened a pull request to backport it in the rpm: https://src.fedoraproject.org/rpms/dracut/pull-request/17

dustymabe · 2021-10-27T16:17:59Z

Thank you so much @olivierlemasle - you're amazing!

dustymabe · 2021-10-27T18:48:20Z

In case anyone else wants to try out the fix here is a link to a rawhide stream build: https://dustymabe.fedorapeople.org/fedora-coreos-36.20211027.dev.0-vmware.x86_64.ova

AdamWill · 2021-10-28T20:18:48Z

PR merged, builds and updates done for F34, F35 and F36.

dustymabe · 2021-11-01T15:48:48Z

Fixed by:

dustymabe · 2021-11-03T16:12:58Z

The fix for this went into next stream release 35.20211029.1.0. Please try out the new release and report issues.

dustymabe · 2021-11-03T16:13:04Z

The fix for this went into testing stream release 34.20211031.2.0. Please try out the new release and report issues.

kai-uwe-rommel · 2021-11-05T20:27:52Z

I just did a cluster installation with 34.20211031.2.0 on vSphere UPI and the fix seems to work properly. All nodes started fine.

dustymabe · 2021-11-11T15:39:19Z

The fix for this went into stable stream release 34.20211031.3.0.

The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943 (cherry picked from commit f6e6be245d0cda14d90a0442b688c8dca1410a2e)

The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943 (cherry picked from commit f6e6be2)

The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943 (cherry picked from commit f6e6be2) bsc#1201975

kai-uwe-rommel added the kind/bug label Aug 29, 2021

vrutkovs mentioned this issue Aug 30, 2021

Cannot create a new 4.7.0-0.okd-2021-08-22-163618 cluster with oVirt backend okd-project/okd#837

Closed

lucab changed the title ~~FCOS starting with 2021-07-11 fails to start networking when installing OKD~~ initrd: nm-initrd fails to spawn depending on console setup Aug 31, 2021

lucab added area/initramfs component/NetworkManager labels Aug 31, 2021

lucab changed the title ~~initrd: nm-initrd fails to spawn depending on console setup~~ initrd: nm-initrd.service fails to spawn depending on console setup Aug 31, 2021

dustymabe added the jira for syncing to jira label Aug 31, 2021

lucab mentioned this issue Oct 13, 2021

remove console=ttyS0 on metal #567

Closed

dustymabe added status/pending-testing-release Fixed upstream. Waiting on a testing release. status/pending-next-release Fixed upstream. Waiting on a next release. labels Nov 1, 2021

dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. status/pending-next-release Fixed upstream. Waiting on a next release. labels Nov 3, 2021

dustymabe closed this as completed Nov 3, 2021

dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Nov 11, 2021

kai-uwe-rommel mentioned this issue Nov 11, 2021

machine-config operator does not initialize okd-project/okd#963

Closed

dustymabe mentioned this issue Mar 21, 2022

Issues resizing the root partition #1135

Closed

initrd: nm-initrd.service fails to spawn depending on console setup #943

initrd: nm-initrd.service fails to spawn depending on console setup #943

Comments

kai-uwe-rommel commented Aug 29, 2021

kai-uwe-rommel commented Aug 29, 2021

dustymabe commented Aug 30, 2021

kai-uwe-rommel commented Aug 30, 2021

dustymabe commented Aug 30, 2021

lucab commented Aug 30, 2021

kai-uwe-rommel commented Aug 30, 2021

kai-uwe-rommel commented Aug 30, 2021

zfiel commented Aug 30, 2021 • edited Loading

kai-uwe-rommel commented Aug 30, 2021 • edited Loading

dustymabe commented Aug 30, 2021 • edited Loading

zfiel commented Aug 30, 2021

kai-uwe-rommel commented Aug 31, 2021

dustymabe commented Aug 31, 2021

lucab commented Aug 31, 2021

lucab commented Aug 31, 2021 • edited Loading

kai-uwe-rommel commented Aug 31, 2021

lucab commented Aug 31, 2021

bengal commented Sep 6, 2021

lucab commented Sep 6, 2021

bengal commented Sep 6, 2021

cgwalters commented Sep 7, 2021

bengal commented Sep 10, 2021

kai-uwe-rommel commented Sep 12, 2021

sandrobonazzola commented Oct 1, 2021

kai-uwe-rommel commented Oct 21, 2021

dustymabe commented Oct 21, 2021

olivierlemasle commented Oct 26, 2021

dustymabe commented Oct 26, 2021

olivierlemasle commented Oct 26, 2021

dustymabe commented Oct 27, 2021

dustymabe commented Oct 27, 2021

AdamWill commented Oct 28, 2021

dustymabe commented Nov 1, 2021

dustymabe commented Nov 3, 2021

dustymabe commented Nov 3, 2021

kai-uwe-rommel commented Nov 5, 2021

dustymabe commented Nov 11, 2021

zfiel commented Aug 30, 2021 •

edited

Loading

kai-uwe-rommel commented Aug 30, 2021 •

edited

Loading

dustymabe commented Aug 30, 2021 •

edited

Loading

lucab commented Aug 31, 2021 •

edited

Loading