Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initrd: nm-initrd.service fails to spawn depending on console setup #943

Closed
kai-uwe-rommel opened this issue Aug 29, 2021 · 41 comments
Closed

Comments

@kai-uwe-rommel
Copy link

Describe the bug
I regularly deploy OKD clusters with FCOS on vSphere.
The deployment process worked fine up until FCOS 34.2021-06-26 and started failing with 2021-07-11 and still fails with 2021-07-25 and 2021-0808.
The OKD cluster VMs start with plain FCOS OVA and get an igition file passed via guestinfo.ignition from vSphere.
Starting with FCOS 34.2021-07-11, the VMs simply fail to start their network.

Reproduction steps
Steps to reproduce the behavior:

  1. create a VM from the FCOS OVA
  2. configure an OKD ignition file via guestinfo.ignition
  3. power on the VM

Expected behavior
As usual, the VM should initialize their network (with DHCP) and process the ignition file which includes merging an additional remote ignition file.

Actual behavior
The VM does not start its network, is not pingable and cannot resolve the remote http server (and can not connect to it) from which the initial ignition file (passed from vSphere via guestinfo.ignition) tries to merge a remote ignition file.

System details
All on a vSphere 7.0u2 cluster.
FCOS versions see above.

Ignition config
File attached here as text file.

Additional information
I tried to gather more details about why the boot process fails to start the networking by connecting a virtual serial console to the VM, in order to view serial console output and record it.
Unfortunately, when a serial console is connected to the VM, then the networking starts successfully.
So it looks like it is a timing issue and the serial console slows it down enough to let it succeed.

initial-ignition.txt

@kai-uwe-rommel
Copy link
Author

What I can see in the console as it passes by quickly (as described, I cannot reproduce with a serial console and can thus not record it):
grafik

@dustymabe
Copy link
Member

This may be fallout from #842 (mainly because it was first implemented in 34.20210711.3.0). If you remove OKD from the equation can you still reproduce with just a simple remote Ignition config?

@kai-uwe-rommel
Copy link
Author

Without OKD, e.g. some basic standalone VM with an otherwise similar (but purely local, e.g. completely within guestinfo.ignition) ignition file (just without the merged in remote/secondary ignition file) comes up fine. I would need to construct a case where I configure a standalone VM but with a merged in add'l remote ignition file ... did not have that case yet.

@dustymabe
Copy link
Member

Without OKD, e.g. some basic standalone VM with an otherwise similar (but purely local, e.g. completely within guestinfo.ignition) ignition file (just without the merged in remote/secondary ignition file) comes up fine.

Yep. We have some logic in there that won't bring up networking at all if there are no remote references.

I would need to construct a case where I configure a standalone VM but with a merged in add'l remote ignition file ... did not have that case yet.

+1 - it may be easy enough just to provide ignition.config.url=http://... on the kernel command line. Or it's possible it's triggered different, but we could start there.

@lucab
Copy link
Contributor

lucab commented Aug 30, 2021

@kai-uwe-rommel is this node set up via DHCP (specifically, for the initrd and Ignition fetch), or does it have some custom networking configuration via guestinfo or kargs?

@kai-uwe-rommel
Copy link
Author

This is via DHCP.

@kai-uwe-rommel
Copy link
Author

As I tried to express above: this seems a timing issue. When I attach a serial console to debug, the problem vanishes away. Does this ring some bell for someone of you?

@zfiel
Copy link

zfiel commented Aug 30, 2021

As I tried to express above: this seems a timing issue. When I attach a serial console to debug, the problem vanishes away. Does this ring some bell for someone of you?

Thank you @kai-uwe-rommel, it "solved" my problems.

@kai-uwe-rommel
Copy link
Author

kai-uwe-rommel commented Aug 30, 2021

As I tried to express above: this seems a timing issue. When I attach a serial console to debug, the problem vanishes away. Does this ring some bell for someone of you?

Thank you @kai-uwe-rommel, it "solved" my problems.

???

I am having a problem. In what way could this have solved your problem??

@dustymabe
Copy link
Member

dustymabe commented Aug 30, 2021

I am having a problem. In what way could this have solved your problem??

well - in the very least using the serial console (as you suggested) would allow someone to workaround the issue and get unblocked. So maybe that's what they meant.

@zfiel
Copy link

zfiel commented Aug 30, 2021

Yes sorry it dit not solved the root issue, but at least I am able to get my cluster working again, by attaching a serial console (as a workaround).

@kai-uwe-rommel
Copy link
Author

Back to the topic - what next?
Anything I can do to help create debug logs or whatever?
The problem is not going away on its own ... :-(

@dustymabe
Copy link
Member

Anything I can do to help create debug logs or whatever?

Yep. It would still be nice to get a simple reproducer without OKD.

@lucab
Copy link
Contributor

lucab commented Aug 31, 2021

I did manage to reproduce this outside of OKD on latest stable (34.20210808.3.0), just with an Ignition config with a reference to a remote resource.
nm-initrd.service is conditional on the initrd neednet, and a remote reference in Ignition will plug into it.

I'm having some hard time catching the real error.
Adding a serial device makes this go away, but the same effect also happens if I drop all console= kargs in order to get an emergency shell.
As such, I suspect this is not merely a timing race which can get affected by attaching a serial device, but a plain bug related to the node console setup.

@lucab
Copy link
Contributor

lucab commented Aug 31, 2021

After a few tries, I did manage to capture the failure (cut for brevity, this was the 4th restart-on-failure of that unit):
nm-initrd

I think this is due to nm-initrd.service unit having a StandardOutput=tty in its service definition, see https://github.com/dracutdevs/dracut/blob/055/modules.d/35network-manager/nm-initrd.service#L19.

I believe this isn't vmware-specific, but may affect any setup with a mismatched console configuration.

@kai-uwe-rommel
Copy link
Author

Ok. This sounds like this can/will be fixed in near future?

@lucab lucab changed the title FCOS starting with 2021-07-11 fails to start networking when installing OKD initrd: nm-initrd fails to spawn depending on console setup Aug 31, 2021
@lucab
Copy link
Contributor

lucab commented Aug 31, 2021

Yes, I think this could be simply fixed by dropping the StandardOutput= config, although I'm not sure why it's there in the first place.
@bengal do you maybe remember why you set it up this way?

Sidenote: I won't be able to push this forward anytime soon, as I'm about to go offline for a few days.

@lucab lucab changed the title initrd: nm-initrd fails to spawn depending on console setup initrd: nm-initrd.service fails to spawn depending on console setup Aug 31, 2021
@dustymabe dustymabe added the jira for syncing to jira label Aug 31, 2021
@bengal
Copy link

bengal commented Sep 6, 2021

StandardOutput=tty was added so that NM output is visible in the console (which is useful especially with rd.debug). I'll try to reproduce and find a different solution.

@lucab
Copy link
Contributor

lucab commented Sep 6, 2021

@bengal ack. In that case I think you can try to see whether StandardOutput=journal+console is ok too.

@bengal
Copy link

bengal commented Sep 6, 2021

I reproduced the issue by addingconsole=null to kernel command line.

StandardOutput=journal+console solves the problem with missing console, but also causes duplicate messages in the journal, as NM already writes to both stdout and the journal in initrd.

Perhaps we should add a configuration option (or command line switch) so that NM can write only to stdout (and then the service will use StandardOutput=journal+console)

@cgwalters
Copy link
Member

Perhaps we should add a configuration option (or command line switch) so that NM can write only to stdout (and then the service will use StandardOutput=journal+console)

Yes, that makes the most sense to me, then systemd handles everything (and presumably does so in a race-free way).

@bengal
Copy link

bengal commented Sep 10, 2021

@kai-uwe-rommel
Copy link
Author

How will I know when and with which release of FCOS the fix will be delivered?

@sandrobonazzola
Copy link

Seems like I hit this issue as well, thanks @kai-uwe-rommel for pointing here.

bengal added a commit to NetworkManager/dracut that referenced this issue Oct 19, 2021
The network-manager module also writes logs to the console, so that it's easier
to debug network-related boot issues. If systemd can't open the console, the
service fails and network doesn't get configured.

Add a check to disable tty output when the console is not present or not
usable.

coreos/fedora-coreos-tracker#943
@kai-uwe-rommel
Copy link
Author

Any news about when a release to be expected?

@dustymabe
Copy link
Member

waiting on @haraldh in dracutdevs/dracut#1611 (comment)

johannbg pushed a commit to dracutdevs/dracut that referenced this issue Oct 21, 2021
The network-manager module also writes logs to the console, so that it's easier
to debug network-related boot issues. If systemd can't open the console, the
service fails and network doesn't get configured.

Add a check to disable tty output when the console is not present or not
usable.

coreos/fedora-coreos-tracker#943
@olivierlemasle
Copy link

Now that the dracut fix has been merged, what are the next steps? Backporting the fix in Fedora's dracut rpm package and fast-tracking the update in fcos?

@dustymabe
Copy link
Member

Now that the dracut fix has been merged, what are the next steps? Backporting the fix in Fedora's dracut rpm package and fast-tracking the update in fcos?

yep.. any chance you want to do the backport to the rpm? See https://src.fedoraproject.org/rpms/dracut/pull-request/14# for an example.

@olivierlemasle
Copy link

I've successfully tested the fix (I also encountered the issue) 🎉
And I've opened a pull request to backport it in the rpm: https://src.fedoraproject.org/rpms/dracut/pull-request/17

@dustymabe
Copy link
Member

Thank you so much @olivierlemasle - you're amazing!

@dustymabe
Copy link
Member

In case anyone else wants to try out the fix here is a link to a rawhide stream build: https://dustymabe.fedorapeople.org/fedora-coreos-36.20211027.dev.0-vmware.x86_64.ova

@AdamWill
Copy link

PR merged, builds and updates done for F34, F35 and F36.

@dustymabe dustymabe added status/pending-testing-release Fixed upstream. Waiting on a testing release. status/pending-next-release Fixed upstream. Waiting on a next release. labels Nov 1, 2021
@dustymabe
Copy link
Member

The fix for this went into next stream release 35.20211029.1.0. Please try out the new release and report issues.

@dustymabe
Copy link
Member

The fix for this went into testing stream release 34.20211031.2.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. status/pending-next-release Fixed upstream. Waiting on a next release. labels Nov 3, 2021
@kai-uwe-rommel
Copy link
Author

I just did a cluster installation with 34.20211031.2.0 on vSphere UPI and the fix seems to work properly. All nodes started fine.

@dustymabe
Copy link
Member

The fix for this went into stable stream release 34.20211031.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Nov 11, 2021
Vogtinator pushed a commit to Vogtinator/dracut-1 that referenced this issue Mar 7, 2022
The network-manager module also writes logs to the console, so that it's easier
to debug network-related boot issues. If systemd can't open the console, the
service fails and network doesn't get configured.

Add a check to disable tty output when the console is not present or not
usable.

coreos/fedora-coreos-tracker#943
(cherry picked from commit f6e6be245d0cda14d90a0442b688c8dca1410a2e)
aafeijoo-suse pushed a commit to aafeijoo-suse/dracut that referenced this issue Aug 2, 2022
The network-manager module also writes logs to the console, so that it's easier
to debug network-related boot issues. If systemd can't open the console, the
service fails and network doesn't get configured.

Add a check to disable tty output when the console is not present or not
usable.

coreos/fedora-coreos-tracker#943
(cherry picked from commit f6e6be2)
aafeijoo-suse pushed a commit to aafeijoo-suse/dracut that referenced this issue Aug 4, 2022
The network-manager module also writes logs to the console, so that it's easier
to debug network-related boot issues. If systemd can't open the console, the
service fails and network doesn't get configured.

Add a check to disable tty output when the console is not present or not
usable.

coreos/fedora-coreos-tracker#943
(cherry picked from commit f6e6be2)

bsc#1201975
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants