-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Velero CSI volume snapshot - Timeout issues aws plugin #7742
Comments
Can you help provide us with debug bundle by running command: |
This error seems to have nothing to do with the object storage. It looks like a connection issue from velero pod to k8s api-server. |
Also to mention that we get error even if I configured
|
Hello @allenxu404 here is bundle from recent PartiallyFailed backup |
I'm getting this error:
I'm using cloudflare R2 but the s3 compatibility doesn't work with velero. |
@alimoezzi I found the following issue comment from someone who had the same problem with cloudflare s3 on a different project -- maybe this would resolve it for you too? hashicorp/terraform#33847 (comment) "I was able to use a Cloudflare R2 bucket as a s3 backend with terraform 1.6.6 today. In order to solve the NotImplemented: STREAMING-UNSIGNED-PAYLOAD-TRAILER error I needed to add skip_s3_checksum = true:"
|
@sseago It has no equivalent in velero, I also tried checksumAlgorithm: "" |
Hello, I tried with older version of Velero Chart 5.2.2, app version 1.12.3 and plugins velero/velero-plugin-for-aws:v1.8.2 and velero/velero-plugin-for-csi:v0.6.3. Status is again
bundle-2024-05-08-08-31-46.tar.gz Velero config:
Which means there are 13 DataUpload errors, all are with message:
|
What is the size of the failed DataUpload's related PVC, for example, sports-aio-uof-adapter-feed-3-pvc? |
Ok, I increased timeouts:
also configured on node-agents:
Latest backup PartiallyFailed
sourcePVC: xxx-xxx-pvc 100Gi
sourcePVC: xxx-xxx-pvc 60Gi
Need to mention that with same configuration it is working on Google Bucket, but issue with google is that is generating high Egress costs so we changed bucket to Cloudflare R2, and since it doesn't have native plugin like gcp, we are using aws plugin with CF R2 bucket. |
sourcePVCs which are bigger in size 50Gi+ are Failed:
node-agent pod logs:
DataUpload object is first in Accepted state, it stays in that state until it change state to Failed with message:
Of course with increased timeouts. |
The exposing process doesn't involve object storage so the timeout should be the same for the GCP bucket and the CloudFlare R2 environment.
Please check the Longhorn CSI driver log and the CSI external-snapshotter log to find more information. |
The DataUpload object stalls on larger PVCs, for example, 100GB, meaning the InProgress phase takes longer until all bytes are transferred. Other DataUpload objects remain in the Accepted or Prepared status, waiting in line, while this process is ongoing. As you probably know Velero created intermediate objects (i.e., pods, PVCs, PVs) in Velero namespace or the cluster scope, they are to help data movers to move data. And they will be removed after the backup completes. But when a larger PVC DataUpload is completed, and Phase is changed from InProgress to Completed, I don't see new pods being created to continue the DataUpload process in the Velero namespace. From what I can see, it gets stuck in the ContainerCreating status, and new pods are not created. So all those Accepted or Prepared DataUpload objects are after some time Failed since no new pods are created to do the job correctly. Backup is in Status WaitingForPluginOperations while waiting Backup Item Operations: 164 of 186 completed successfully, 0 failed (specify --details for more information) After restarting the node-agent pods, 2 or 3 DataUpload objects are again in the InProgress status, while the rest are marked as Canceled. And of course status of backup is PartiallyFailed. I'm not sure if data upload to Cloudflare R2 buckets is slower than data upload to buckets on Google, |
I see. |
Yes, I found it, will try it. |
Hello @blackpiglet I created configMap, example:
So just to confirm, we have multiple nodes on cluster, example for node in configMap example above, labels on node:
I need to mount created configMap on node-agent daemonset like so:
Can you confirm that this config is ok especially mountPath: /etc/velero part ? |
The ConfigMap content looks good. velero/pkg/nodeagent/node_agent.go Line 38 in 6499444
I will create a PR to address that: #7790. |
Hello @blackpiglet I created everything, used newest velero helm chart version 6.0.0 app version 1.13.0. Used plugins velero/velero-plugin-for-aws:v1.9.2 and velero/velero-plugin-for-csi:v0.7.1. I also created node-agent-config configMap, I noticed the mistake, so I referenced configmap correctly.
This time I excluded namespaces from velero backup with higher pvc's. But again some dataUpload objects are Failed
|
@MoZadro, If you are seeing the problem only for large volume, it could be because the temporary PVC created from snapshot (to be used as backup source) is not ready in time. This in turn could be due to Longhorn creating multiple replicas and copying data over. I opened longhorn/longhorn#7794 to request a more efficient process in creating PVC from a snapshot. In the meanwhile, #7700 will help as and when it is resolved. |
Hello, as I mentioned previously i excluded namespaces from velero backup with higher pvc's. But again some dataUpload objects are Failed, so not only on large volumes. |
What steps did you take and what happened:
Installed latest helm chart 6.0.0 with velero image 1.13.0 and AWS plugin 1.9.2 and CSI plugin 0.7.1.
We are using Velero with CSI, since our cluster is on Hetzner, we are using longhorn storageClass, so we configured velero to work with CSI plugin. And we are using option With moving snapshot data to backup storage (Velero backs-up K8s resources along with snapshot data to backup storage. A volume snapshot will be created for the PVs and the data from snapshot is moved to backup storage using Kopia.)
What did you expect to happen:
Our cluster is on Hetzner, we have cloud and bare metal worker and master nodes, data center location Falkenstein Germany, our bucket is on Cloudflare R2. We expected that backups will be in Completed status, not Failed or PartiallyFailed.
velero config:
I increased timeouts on nodeAgents:
'--data-mover-prepare-timeout=2h'
I also increased
--csi-snapshot-timeout=90m0s
on velero backupThe following information will help us better understand what's going on:
We have a different timeout errors on the DataUpload object:
With these settings, and even less timeouts we had no problems with the gcp plugin and bucket on google. Here we are trying with aws plugin on Cloudflare R2 S3 storage.
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: