Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production failure: Dotmesh server pods ending up in Completed state #449

Closed
alaric-dotmesh opened this issue Jun 12, 2018 · 1 comment
Closed
Labels

Comments

@alaric-dotmesh
Copy link
Contributor

In production, it was thusly seen that two of our server pods just... finished:

server-gke-dothub-cluster-default-pool-6cc559e3-d4j9   0/1       Completed   0          1h
server-gke-dothub-cluster-default-pool-6cc559e3-gmdw   1/1       Running     0          1h
server-gke-dothub-cluster-default-pool-6cc559e3-j1w6   0/1       Completed   0          1h

They're not supposed to do that. The operator didn't know what to do about this (see #448 about fixing that) because it shouldn't ever happen.

The logs for the dead pods don't show any useful errors, but do show a strange mixture of everyday logging and what looks like a goroutine dump, annotated with what each goroutine is blocked on.

Here's the logs from all three dots:

d4j9.log
gmdw.log
j1w6.log

@Godley
Copy link
Contributor

Godley commented Jun 12, 2018

Describe on one of the pods did also show:

Warning  Unhealthy  38m   kubelet, gke-dothub-cluster-default-pool-6cc559e3-j1w6  Readiness probe failed: Get http://10.60.2.34:32607/check: dial tcp 10.60.2.34:32607: getsockopt: connection refused
  Warning  Unhealthy  38m   kubelet, gke-dothub-cluster-default-pool-6cc559e3-j1w6  Liveness probe failed: Get http://10.60.2.34:32608/check: dial tcp 10.60.2.34:32608: getsockopt: connection refused

But...it's 38m old and there's no "last seen" there, so I'm inclined to say that's only when the pod started up

alaric-dotmesh added a commit that referenced this issue Jun 12, 2018
…sing us to `exit 0` and discard the error code.

Still doesn't explain why the container was dying (with exit code 2, as it happens);
just why the exit code failure was being reported as successful completion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants