Ark server needs to handle crashed/stopped plugin processes #481

lli-hiya · 2018-05-10T19:55:44Z

I've had a question that I don't know how to debug -- I noticed that Ark can suddenly fail to make its backup (after it made several successful backups) and the error logs looks like

2018-05-10T00:04:13.972+00:00 10.1.69.73 dockerd[4085]: time="2018-05-10T00:04:13Z" level=error msg="Error uploading log file" bucket=<some-bucket> error="rpc error: code = Internal desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin763922082: connect: connection refused"" key=<some-key>-logs.gz logSource="pkg/cloudprovider/backup_service.go:148".

This error was first noticed when I was using ark v0.6, and now I upgrade to ark v0.8.1 and the error still persists.

Some symptoms that I noticed when the error happens

Ark client can still list backups (ark backup get works), but it cannot show backups logs (ark backup logs <backup> fails)
Kubectl can still show pod logs (kubectl logs -f <pod> works)
If I don't take any actions, all future backups will fail and won't be uploaded to S3.
If I bounce the pod (kubectl delete pod), everything works again and further backups can succeed maybe for a couple of days and then this error will happen again

My interpretation

This error is caused by the failure to connect to S3
It is not an authentication/authorization error, because it DOES work for a while after I bounce the pod

What I am asking here is

Has anyone seen this problem, or this kind of error logs before?
What might cause Ark to suddenly lose its ability to connect to S3?

Thanks for all the attentions.

The text was updated successfully, but these errors were encountered:

ncdc · 2018-05-10T19:58:11Z

Hi @lli-hiya, thanks for the question! The error transport: error while dialing: dial unix /tmp/plugin763922082: connect: connection refused I believe is coming from the communication between the Ark server process and one of its plugins. It sounds like the plugin process has exited and the Ark server isn't handling that well.

When you have a well-functioning pod, could you please kubectl exec into it and get ps information on all the running processes? And then when you start having issues, could you capture the same information?

ncdc · 2018-05-10T20:03:17Z

You can also try setting the server's log level to debug:

kubectl -n heptio-ark edit deploy/ark, then change args so it looks like this:

args:
  - server
  - --log-level
  - debug

This will include more information about launching and terminating plugins.

ncdc · 2018-05-10T20:04:27Z

@skriss this is interesting reading: hashicorp/go-plugin#31

jfoy · 2018-05-10T20:08:44Z

@ncdc Here's what we see in a healthy pod. (I work with @lli-hiya )

PID   USER     TIME   COMMAND
    1 nobody     0:14 /ark server --logtostderr --v 4
   19 nobody     0:00 /ark run-plugin cloudprovider aws
   52 nobody     0:00 sh
   60 nobody     0:00 ps uaxf

ncdc · 2018-05-10T20:11:52Z

@jfoy awesome, thanks! Now I'm looking forward to seeing what's in an unhealthy pod...

lli-hiya · 2018-05-10T20:30:11Z

@ncdc I will try to give that information to you later if the error happens again.

skriss · 2018-05-10T21:22:59Z

@ncdc if this is in fact due to the plugin process exiting (sounds like it), moving to short-lived plugins for object/block store as you've proposed would probably mitigate this issue.

lli-hiya · 2018-05-11T00:33:04Z

@ncdc Since I don't know when the error will happen, as a demonstration I manually killed this process 19 nobody 0:00 /ark run-plugin cloudprovider aws, and then I observed EXACTLY the same symptom. Ark client freezes to show logs, backups failed and etc.

Although this cannot prove that our original failure was because of the cloudprovider process got terminated, at least it can show some evidence.

lli-hiya · 2018-05-14T16:54:51Z

Follow up: the error happened in my original cluster during the weekend, and the ps aux within the pod shows that the cloudprovider process is gone. I think this shows that our guess is correct -- Ark failed to handle it.

ncdc · 2018-05-14T16:56:25Z

@lli-hiya thanks for the update. We are actively redesigning our plugin management code to be more robust and handle this sort of situation.

Is there anything in your pod's logs that indicates what happened to the plugin process?

lli-hiya · 2018-05-14T17:05:18Z

No more than what I posted before because due to some unknown reason the log-level was reverted back to info. I just changed it to debug again, and will update this thread if I see something useful

lli-hiya · 2018-05-16T00:25:24Z

@ncdc Here it is. This is the log around the our daily backup time, and you can see it shows that the plugin process terminates.

time="2018-05-16T00:00:46Z" level=debug msg="Backup has not expired yet, skipping" backup=heptio-ark/<xxx> expiration="2018-05-24 00:00:46 +0000 UTC" logSource="pkg/controller/gc_controller.go:129"
time="2018-05-16T00:00:46Z" level=debug msg="waiting for RPC address" logSource="pkg/plugin/logrus_adapter.go:74" path=/ark
time="2018-05-16T00:00:46Z" level=info msg="Checking for expired DeleteBackupRequests" controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:285"
time="2018-05-16T00:00:46Z" level=info msg="Done checking for expired DeleteBackupRequests" controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:313"
time="2018-05-16T00:00:46Z" level=debug msg="starting plugin" args="[/ark run-plugin backupitemaction pod]" logSource="pkg/plugin/logrus_adapter.go:74" path=/ark
time="2018-05-16T00:00:46Z" level=debug msg="plugin address" address=/tmp/plugin537000931 logSource="pkg/plugin/logrus_adapter.go:74" network=unix pluginName=ark
time="2018-05-16T00:00:46Z" level=debug msg="waiting for RPC address" logSource="pkg/plugin/logrus_adapter.go:74" path=/ark
time="2018-05-16T00:00:47Z" level=debug msg=gcController.enqueueAllBackups logSource="pkg/controller/gc_controller.go:88"
time="2018-05-16T00:00:47Z" level=debug msg="plugin address" address=/tmp/plugin854132941 logSource="pkg/plugin/logrus_adapter.go:74" network=unix pluginName=ark
I0516 00:00:50.982700       1 reflector.go:428] github.com/heptio/ark/pkg/generated/informers/externalversions/factory.go:73: Watch close - *v1.Backup total 2 items received
time="2018-05-16T00:01:16Z" level=debug msg="plugin process exited" logSource="pkg/plugin/logrus_adapter.go:74" path=/ark
time="2018-05-16T00:01:16Z" level=error msg="Error uploading log file" bucket=<xxx> error="rpc error: code = Unavailable desc = transport is closing" key=<xxx>/<yyy>-logs.gz logSource="pkg/cloudprovider/backup_service.go:148"
time="2018-05-16T00:01:16Z" level=info msg="Backup completed" backup=heptio-ark/<yyy> logSource="pkg/controller/backup_controller.go:375"
time="2018-05-16T00:01:17Z" level=error msg="backup failed" error="rpc error: code = Internal desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin457010137: connect: connection refused\"" key=heptio-ark/<yyy> logSource="pkg/controller/backup_controller.go:273"
time="2018-05-16T00:01:17Z" level=debug msg="Updating backup's final status" key=heptio-ark/<yyy> logSource="pkg/controller/backup_controller.go:277"```

lli-hiya · 2018-06-12T19:10:54Z

Just add to this thread. We identified the reason for the failure. It was because the memory usage jumped beyond its limit and causing the cloudprovider process to be killed.

ncdc · 2018-06-12T19:11:36Z

@lli-hiya thanks for the update. What limit did you have set for the pod?

lli-hiya · 2018-06-12T22:11:09Z

@ncdc We increased the memory limit to avoid it to be killed. Now it looks like

            limits:
              cpu: 200m
              memory: 300Mi
            requests:
              cpu: 10m
              memory: 40Mi

rosskukulinski · 2018-06-24T23:05:06Z

References #94 regarding cpu/memory limits. PR #495 in-flight which should resolve this issue.

ncdc · 2018-07-31T16:36:58Z

Fixed by #710

iamakulov · 2019-09-07T11:31:48Z

I just encountered the same issue with velero 1.1.0:

$ velero version
Client:
        Version: v1.1.0
        Git commit: a357f21aec6b39a8244dd23e469cc4519f1fe608
Server:
        Version: v1.1.0

I’ll create a new issue if this reproduces again. More details are below.

A few hours before, backups were completing successfully. Now, backups are marked as Failed with the following in the container logs:

$ kubectl logs velero-74c7b8c545-w5cgn -n velero | grep error     
time="2019-09-07T11:11:55Z" level=error msg="Error uploading log file" backup=before-kops-1.11 bucket=chorus-backup-bucket-velero-production error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/persistence/object_store.go:196" prefix=
time="2019-09-07T11:11:55Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" key=velero/before-kops-1.11 logSource="pkg/controller/backup_controller.go:233"
time="2019-09-07T11:12:20Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:13:20Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:14:21Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:15:21Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:16:22Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:17:23Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/backup_sync_controller.go:180"

Exec-ing into the pod shows that the plugins process is not running:

nobody@velero-74c7b8c545-w5cgn:/$ ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nobody       1  1.3 12.0 1128296 986756 ?      Ssl  09:02   1:52 /velero server
nobody    1485  0.0  0.0  18508  3412 ?        Ss   11:19   0:00 bash
nobody    1493  0.0  0.0  34400  2792 ?        R+   11:19   0:00 ps -aux

After deleting the pod, the new pod launched with the plugins process:

nobody@velero-74c7b8c545-scsq8:/$ ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nobody       1  0.4  0.4 157140 40664 ?        Ssl  11:24   0:00 /velero server
nobody      22  0.2  0.5 157140 41816 ?        Sl   11:24   0:00 /velero run-plugins --log-level info
nobody      40  0.0  0.0  18508  3308 ?        Ss   11:25   0:00 bash
nobody      46  0.0  0.0  34400  2896 ?        R+   11:25   0:00 ps -aux

The pod has the following limits:

resources:
  limits:
    cpu: 400m
    memory: 1Gi
  requests:
    cpu: 200m
    memory: 512Mi

carlisia · 2019-09-10T19:24:39Z

@iamakulov please follow this issue for an error like what you reported: #1856.

ncdc added the question label May 10, 2018

ncdc changed the title ~~[Question and need help] Backup failed and ark client stops responding~~ Ark server needs to handle crashed/stopped plugin processes May 11, 2018

ncdc added Bug p1 - Important and removed question labels May 11, 2018

ncdc mentioned this issue May 16, 2018

Refactor plugin management #495

Closed

6 tasks

rosskukulinski added this to the v1.0.0 milestone Jun 24, 2018

ncdc closed this as completed Jul 31, 2018

prydonius mentioned this issue Sep 9, 2019

Backup fails with transport is closing #1856

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ark server needs to handle crashed/stopped plugin processes #481

Ark server needs to handle crashed/stopped plugin processes #481

lli-hiya commented May 10, 2018 •

edited

Loading

ncdc commented May 10, 2018

ncdc commented May 10, 2018

ncdc commented May 10, 2018

jfoy commented May 10, 2018

ncdc commented May 10, 2018

lli-hiya commented May 10, 2018

skriss commented May 10, 2018

lli-hiya commented May 11, 2018 •

edited

Loading

lli-hiya commented May 14, 2018

ncdc commented May 14, 2018

lli-hiya commented May 14, 2018

lli-hiya commented May 16, 2018

lli-hiya commented Jun 12, 2018

ncdc commented Jun 12, 2018

lli-hiya commented Jun 12, 2018 •

edited

Loading

rosskukulinski commented Jun 24, 2018

ncdc commented Jul 31, 2018

iamakulov commented Sep 7, 2019

carlisia commented Sep 10, 2019

Ark server needs to handle crashed/stopped plugin processes #481

Ark server needs to handle crashed/stopped plugin processes #481

Comments

lli-hiya commented May 10, 2018 • edited Loading

ncdc commented May 10, 2018

ncdc commented May 10, 2018

ncdc commented May 10, 2018

jfoy commented May 10, 2018

ncdc commented May 10, 2018

lli-hiya commented May 10, 2018

skriss commented May 10, 2018

lli-hiya commented May 11, 2018 • edited Loading

lli-hiya commented May 14, 2018

ncdc commented May 14, 2018

lli-hiya commented May 14, 2018

lli-hiya commented May 16, 2018

lli-hiya commented Jun 12, 2018

ncdc commented Jun 12, 2018

lli-hiya commented Jun 12, 2018 • edited Loading

rosskukulinski commented Jun 24, 2018

ncdc commented Jul 31, 2018

iamakulov commented Sep 7, 2019

carlisia commented Sep 10, 2019

lli-hiya commented May 10, 2018 •

edited

Loading

lli-hiya commented May 11, 2018 •

edited

Loading

lli-hiya commented Jun 12, 2018 •

edited

Loading