Production failure: Dotmesh server pods ending up in `Completed` state #449

alaric-dotmesh · 2018-06-12T12:46:32Z

In production, it was thusly seen that two of our server pods just... finished:

server-gke-dothub-cluster-default-pool-6cc559e3-d4j9   0/1       Completed   0          1h
server-gke-dothub-cluster-default-pool-6cc559e3-gmdw   1/1       Running     0          1h
server-gke-dothub-cluster-default-pool-6cc559e3-j1w6   0/1       Completed   0          1h

They're not supposed to do that. The operator didn't know what to do about this (see #448 about fixing that) because it shouldn't ever happen.

The logs for the dead pods don't show any useful errors, but do show a strange mixture of everyday logging and what looks like a goroutine dump, annotated with what each goroutine is blocked on.

Here's the logs from all three dots:

d4j9.log
gmdw.log
j1w6.log

The text was updated successfully, but these errors were encountered:

Godley · 2018-06-12T13:00:04Z

Describe on one of the pods did also show:

Warning  Unhealthy  38m   kubelet, gke-dothub-cluster-default-pool-6cc559e3-j1w6  Readiness probe failed: Get http://10.60.2.34:32607/check: dial tcp 10.60.2.34:32607: getsockopt: connection refused
  Warning  Unhealthy  38m   kubelet, gke-dothub-cluster-default-pool-6cc559e3-j1w6  Liveness probe failed: Get http://10.60.2.34:32608/check: dial tcp 10.60.2.34:32608: getsockopt: connection refused

But...it's 38m old and there's no "last seen" there, so I'm inclined to say that's only when the pod started up

…sing us to `exit 0` and discard the error code. Still doesn't explain why the container was dying (with exit code 2, as it happens); just why the exit code failure was being reported as successful completion.

alaric-dotmesh added the bug label Jun 12, 2018

alaric-dotmesh closed this as completed Jun 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production failure: Dotmesh server pods ending up in `Completed` state #449

Production failure: Dotmesh server pods ending up in `Completed` state #449

alaric-dotmesh commented Jun 12, 2018

Godley commented Jun 12, 2018

Production failure: Dotmesh server pods ending up in Completed state #449

Production failure: Dotmesh server pods ending up in Completed state #449

Comments

alaric-dotmesh commented Jun 12, 2018

Godley commented Jun 12, 2018

Production failure: Dotmesh server pods ending up in `Completed` state #449

Production failure: Dotmesh server pods ending up in `Completed` state #449