-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bootstrap machine fails to install with okd 4.7, 4.8 and 4.9 #897
Comments
About
tried:
|
Failing on 4.7 and 4.8 as well |
Please attach log bundle |
Masters booted, but didn't request master ignition. Check boot log on master machine, seems network is down there |
Seems to be invalid network config - it can't reach to DNS server apparently? |
the DNS is provided by the lab but here it seems it's trying to connect on [::1]:53 rather than connecting to the lab dns |
is there any way for configuring the dns to be used from the install-config.yaml ? |
Can you confirm that the master node is getting an IP address(i.e. can you ping it)? This does seem to indicate an issue with the network configuration. The IP configuration is not able to be set via install-config. |
master0 doesn't reply to ping but I see it got both IPv4 and IPv6 from the guest agent running on it (it's a bare metal UPI but the host is virtual) |
is there a way for settinmg up core user password from install-config.yaml as for https://docs.fedoraproject.org/en-US/fedora-coreos/authentication/#_using_password_authentication ? |
Unfortunately, I don't believe there are a lot of further insights to be offered without understanding why the master isn't unreachable. You might consider booting the fcos live ISO to see if you can troubleshoot the network from a vantage point where you can run troubleshooting commands. install-config.yaml doesn't allow specific users to be created. Users are created via the ignition content that is produced by the installer binary. Regardless, until the master nodes retrieve their ignition, a user provided via master.ign wouldn't be applied anyway. IMO your best bet is to boot the ISO and see if you can inspect the network configuration. |
|
@dustymabe sounds like something you handled in coreos/fedora-coreos-tracker#883 |
looks like passing |
^ i.e. coreos/fedora-coreos-tracker#883 |
@LorbusChris please check also 4.8 and 4.9 as I see the issue there too |
Which is also a version from well before the change even made it into FCOS's testing-devel (on 20210709: coreos/fedora-coreos-config@876fda1) |
We have been struggling with this same issue for the last week while trying to stand up a new OKD cluster. When the nodes try pull the ignition config from the bootstrap we get a 'connection refused' message. We are using FCOS 34.20210904.3.0 and trying to build an OKD 4.7 UPI bare metal cluster. Any feedback would be appreciated. We have validated our network configuration (DNS, DHCP, load balancer, ect) multiple times. |
@markandrewj I managed to pass the network issue by adding I now progressed the deployment from 80% to 87% and got stuck with:
I tried attaching the log bundle here but I guess it's too large. Uploaded here: http://file.rdu.redhat.com/~sbonazzo/log-bundle-20210930072055.tar.gz but I'm not sure it's visible outside Red Hat. |
After a few reboots of masters and workers the OKD deployment reached 100%. The whole flow will need to be re-tested once @LorbusChris ' patches will get into a build. |
@sandrobonazzola we can give your suggestion a try, thank you. The log bundle you have linked is unfortunately not available from our network. Once the patches are applied we will probably try rebuilding the nodes again. We are building this as a test cluster to support two other OKD/OCP clusters we currently have. The plan is to use it for testing upgrades, cluster configuration changes, and other administrative tasks. |
Please see here: coreos/fedora-coreos-tracker#943 |
If that's the case, bumping the boot image now (see PRs linked above) won't solve it. |
@kai-uwe-rommel we took your suggestion and tried to ignite the cluster using the fedora-coreos-34.20210626.3.2-live.x86_64 iso. Using this version of FCOS we were able to get past the issue we were having. The cluster is currently still bootstrapping, I will follow up further after the installation completes (or doesn't complete). Thank you for the feedback. |
I hade the same problem (masters querying localhost DNS instead of network DNS) on my NUCs. It seems the NICs on the NUCs are too slow during startup and I had to add a timeout during first boot:
perhaps that would be something to try out as well? |
We tried the @smuda's suggestion, and we got past the original issue we had where we saw a 'connection refused' message. However now when the worker nodes try to pull the config from the internal API we receive a 'internal server error' message. The master nodes were able to pull their configs though. We tried using an older FCOS image from June, which allowed the config to be pulled, but the bootstrap never completed. We also tried installing using both the the 4.7 and 4.8 installer. |
The worker ignition config is made available until sometime around the bootstrap-complete phase of the installation. It may need some time to finish bootstrapping. A must-gather could be useful in understanding why the machine-config server isn't yet returning a worker ignition. |
I use FCOS 34.20210626.3.1 which works for me but it needs some fixes. Once I detect the bootstrap has finished, I remove the bootstrap machine from the loadbalancer in front of api-int, thereby only including the masters. I've documented my setup at AyoyAB/okd-with-ansible. Perhaps something in that setup can give you some inspiration? Among other things there are fixes for (746, 975) FCOS issues which I've included in ansible role "0-create-local-files". |
We were able to finally get the cluster running using openshift-install-linux-4.7.0-0.okd-2021-07-03-190901.tar.gz and fedora-coreos-34.20210626.3.2-live.x86_64.iso. I would recommend doing a test install of OKD using the bare metal UPI method, if this isn't already tested in as part of the OKD release pipeline. It seems that some of the FCOS images have issues currently. |
That would indeed be great, but we don't have a capacity for this yet.
Yes. See https://github.com/openshift/okd/blob/master/FAQ.md#which-fedora-coreos-should-i-use for OKD 4.8 @sandrobonazzola do you mind if I close this one? |
@vrutkovs ok to close this one, the issue I hit is being tracked on fcos side. |
@sandrobonazzola Where can i find the 4.9 version of installer? I don't see it in the release tags. |
@karthik101 you can get it here: https://amd64.origin.releases.ci.openshift.org/#4.9.0-0.okd but be aware it's not promoted to |
Hi, bootstrap machine fails to install with okd 4.7, 4.8 and okd 4.9. for test was used VMs with these compatibility modes for tests these okd versions have been (unsuccessfully) tried: The first part of ignition process looks fine
the installation process ends on the same place as describe @sandrobonazzola and symptoms are very similar logs:
end of jornalctl from bootstrap machine:
|
@konup It would seem you have a very different problem, since you have ESXi (vs metal) and you seem to be able to connect från master to bootstrap during installation (vs having a network problem fetching the secondary ignition file). You'd probably best of creating your own issue. |
ok, because this is diferent problem, new issue #1093 was created |
Describe the bug
bootstrap machine fails to install with okd 4.8 and okd 4.9.
Watching the installation with:
# openshift-install --dir /root/install_dir wait-for install-complete --log-level=debug
reports:Version
4.9.0-0.okd-2021-09-27-224448
How reproducible
100%
Log bundle
and stays stuck there.
The text was updated successfully, but these errors were encountered: