Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvm with containerd needs more time to stop #17967

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

prezha
Copy link
Contributor

@prezha prezha commented Jan 15, 2024

we see tests flake when the kvm driver is used with the containerd container runtime to stop a vm node (example), so we increase the timeout to allow a vm to shut down gracefully

also, the libvirt's VIR_DOMAIN_SHUTDOWN status (where the domain is being shut down) has a more suitable mapping to libmachine's state.Stopping instead of state.Running that can be misleading - eg:

// If it's not running, quickly bail out rather than delivering conflicting messages
if st.Host != state.Running.String() {

before

$ minikube -p ha-741094 node stop m02 -v=7 --alsologtostderr

...
I0115 19:17:18.946620   30826 out.go:177] ✋  Stopping node "ha-741094-m02"  ...
✋  Stopping node "ha-741094-m02"  ...
...
I0115 19:17:19.065289   30826 main.go:141] libmachine: Stopping "ha-741094-m02"...
I0115 19:17:19.065308   30826 main.go:141] libmachine: (ha-741094-m02) Calling .GetState
I0115 19:17:19.068085   30826 main.go:141] libmachine: (ha-741094-m02) Calling .Stop
I0115 19:17:19.075758   30826 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 0/60
I0115 19:17:20.078748   30826 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 1/60
I0115 19:17:21.081220   30826 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 2/60
...
I0115 19:18:16.315134   30826 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 57/60
I0115 19:18:17.321792   30826 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 58/60
I0115 19:18:18.328580   30826 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 59/60
I0115 19:18:19.329504   30826 stop.go:66] stop err: unable to stop vm, current state "Running"
W0115 19:18:19.329612   30826 out.go:239] 💣  Failed to stop node m02
💣  Failed to stop node m02
I0115 19:18:19.329659   30826 out.go:177] 🛑  Successfully stopped node ha-741094-m02
🛑  Successfully stopped node ha-741094-m02

after

$ minikube -p ha-741094 node stop m02 -v=7 --alsologtostderr

I0115 19:52:45.773802   24027 main.go:141] libmachine: Stopping "ha-741094-m02"...
I0115 19:52:45.773822   24027 main.go:141] libmachine: (ha-741094-m02) Calling .GetState
I0115 19:52:45.777192   24027 main.go:141] libmachine: (ha-741094-m02) Calling .Stop
I0115 19:52:45.786709   24027 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 0/120
I0115 19:52:46.792581   24027 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 1/120
I0115 19:52:47.797988   24027 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 2/120
...
I0115 19:54:14.201874   24027 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 88/120
I0115 19:54:15.205584   24027 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 89/120
I0115 19:54:16.212570   24027 main.go:141] libmachine: (ha-741094-m02) Waiting for machine to stop 90/120
I0115 19:54:17.216139   24027 main.go:141] libmachine: (ha-741094-m02) Calling .GetState
I0115 19:54:17.219025   24027 main.go:141] libmachine: Machine "ha-741094-m02" was stopped.
I0115 19:54:17.219039   24027 stop.go:75] duration metric: took 1m31.560021625s to stop
I0115 19:54:17.219096   24027 out.go:177] 🛑  Successfully stopped node ha-741094-m02
🛑  Successfully stopped node ha-741094-m02

note: Successfully stopped node following the Failed to stop node is addressed in pr #17965

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 15, 2024
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 15, 2024
@prezha
Copy link
Contributor Author

prezha commented Jan 16, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jan 16, 2024
@minikube-pr-bot
Copy link

kvm2 driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17967) |
+----------------+----------+---------------------+
| minikube start | 51.1s    | 50.4s               |
| enable ingress | 25.2s    | 25.4s               |
+----------------+----------+---------------------+

Times for minikube start: 51.9s 53.7s 51.4s 51.1s 47.2s
Times for minikube (PR 17967) start: 49.1s 51.4s 50.2s 50.3s 51.0s

Times for minikube ingress: 23.6s 25.1s 27.6s 23.6s 26.1s
Times for minikube (PR 17967) ingress: 23.5s 26.5s 23.6s 26.5s 26.6s

docker driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17967) |
+----------------+----------+---------------------+
| minikube start | 24.2s    | 22.6s               |
| enable ingress | 18.6s    | 19.9s               |
+----------------+----------+---------------------+

Times for minikube start: 25.0s 24.9s 22.3s 24.4s 24.3s
Times for minikube (PR 17967) start: 21.6s 24.2s 21.7s 21.3s 24.1s

Times for minikube ingress: 17.8s 17.8s 18.8s 20.8s 17.8s
Times for minikube (PR 17967) ingress: 20.8s 20.8s 18.3s 20.9s 18.8s

docker driver with containerd runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17967) |
+----------------+----------+---------------------+
| minikube start | 21.8s    | 22.1s               |
| enable ingress | 28.3s    | 26.1s               |
+----------------+----------+---------------------+

Times for minikube start: 24.1s 20.5s 20.8s 23.4s 20.4s
Times for minikube (PR 17967) start: 23.6s 20.4s 23.0s 23.2s 20.3s

Times for minikube ingress: 30.3s 31.3s 30.3s 18.4s 31.3s
Times for minikube (PR 17967) ingress: 18.3s 18.3s 33.3s 30.3s 30.3s

Copy link
Member

@spowelljr spowelljr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me, just waiting for the tests

As a side note 1.5 mins seems like a long time for a instance to stop, but it looks like it's just a KVM limitation

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: prezha, spowelljr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@prezha
Copy link
Contributor Author

prezha commented Jan 16, 2024

Code looks good to me, just waiting for the tests

As a side note 1.5 mins seems like a long time for a instance to stop, but it looks like it's just a KVM limitation

thanks @spowelljr

yes, i saw that occasional behaviour (as shown in the example above) only for kvm+contrainderd combo (probably the contrainerd itself needs that additional time to stop gracefully) - this should eliminate some tests' flakiness in general

btw, for this pr: those three netlify tests look stuck and (most of) the other tests were completed hours ago; update: they're unstuck & done now :)

@prezha prezha merged commit 09eee30 into kubernetes:master Jan 17, 2024
38 of 50 checks passed
@minikube-pr-bot
Copy link

These are the flake rates of all failed tests.

Environment Failed Tests Flake Rate (%)
Docker_Linux_crio_arm64 TestMultiNode/serial/PingHostFrom2Pods (gopogh) 0.00 (chart)
Docker_Linux_crio TestMultiNode/serial/PingHostFrom2Pods (gopogh) 0.00 (chart)
Hyper-V_Windows TestForceSystemdEnv (gopogh) 0.00 (chart)
Hyper-V_Windows TestIngressAddonLegacy/serial/ValidateIngressAddonActivation (gopogh) 0.00 (chart)
Hyper-V_Windows TestIngressAddonLegacy/serial/ValidateIngressAddons (gopogh) 0.00 (chart)
Hyper-V_Windows TestIngressAddonLegacy/serial/ValidateIngressDNSAddonActivation (gopogh) 0.00 (chart)
Hyper-V_Windows TestIngressAddonLegacy/StartLegacyK8sCluster (gopogh) 0.00 (chart)
Hyper-V_Windows TestJSONOutput/pause/Command (gopogh) 0.00 (chart)
Hyper-V_Windows TestJSONOutput/start/Command (gopogh) 0.00 (chart)
Hyper-V_Windows TestJSONOutput/unpause/Command (gopogh) 0.00 (chart)
Hyper-V_Windows TestOffline (gopogh) 0.00 (chart)
Hyper-V_Windows TestScheduledStopWindows (gopogh) 0.00 (chart)
KVM_Linux_crio TestMultiNode/serial/PingHostFrom2Pods (gopogh) 0.00 (chart)
Docker_Linux_containerd_arm64 TestAddons/parallel/CloudSpanner (gopogh) 2.13 (chart)
Hyper-V_Windows TestCertExpiration (gopogh) 4.76 (chart)
Hyper-V_Windows TestNetworkPlugins/group/enable-default-cni/Start (gopogh) 5.00 (chart)
KVM_Linux_containerd TestAddons/parallel/Headlamp (gopogh) 5.07 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmd/Format (gopogh) 9.52 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmd/HTTPS (gopogh) 9.52 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmd/URL (gopogh) 9.52 (chart)
Hyper-V_Windows TestNoKubernetes/serial/StartWithK8s (gopogh) 28.57 (chart)

To see the flake rates of all tests by environment, click here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants