Distinguish between different VDDK validation errors #969

jonner · 2024-07-23T21:42:05Z

There are multiple cases that can lead to a "VDDK Init image is invalid" error message for a migration plan. They are currently handled with a single VDDKInvalid condition.

This patch adds a new error condition VDDKImageNotFound (and a new advisory condition VDDKImageNotReady) to help diagnose an issue of a missing image or incorrect image url.

Resolves: https://issues.redhat.com/browse/MTV-1150

ahadas

thanks @jonner for taking that issue and posting this fix!

pkg/controller/plan/validation.go

jonner · 2024-07-24T17:04:36Z

I just realized that this is probably not a full fix.

Example scenario:

I enter a URL A for the vddk image and create a migration plan with this provider.
- A validation job is started for this vddk init image
- the image cannot be pulled (for whatever reason)
- with the patch above, the validation job will indicate a VDDKImageNotFound error.
I change the vddk image in the provider to a different url B
- Due to the vddk url change, we kick off another validation job
- Results the same as above
Now I change the URL back to A
- due to the vddk url change, we try to validate the vddk init image again, so we try to ensure a validation job exists
- we find the cached job from step 1 that has the same labels, we don't create a new one and thus no validator pod is started and we don't try to re-pull the init image
- since no new job is started, there are no pods to query about InitContainerStatuses, and the only thing we know is that the job failed
- we now report VDDKInvalid instead of VDDKImageNotFound

sonarcloud · 2024-07-24T18:44:42Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

jonner · 2024-07-24T18:45:12Z

The new patch I just pushed does not yet address the issue I discussed in this comment: #969 (comment)

ahadas · 2024-07-25T09:14:20Z

@jonner yeah and there's even more basic scenario that we don't handle so well -

we try to validate URL A
the job fails on pulling the image
we sort out the issue with pulling the image from URL A
-> the job needs to be recreated in order to validate A again

honestly, we were not really sure whether it makes sense to add this job back then because the only case we've heard of about the VDDK image was when someone created it on a filesystem that doesn't support symbolic links.. but apparently, we see that this job actually detects issues in the field and in that case it makes less sense to expect from users to clone the plan or archive-and-recreate the plan in order to initiate another job...

specifically about what you wrote above, since no new job is started, there are no pods to query about InitContainerStatuses - wouldn't we still have the pod that ran before?

pkg/controller/plan/validation.go

ahadas · 2024-07-25T09:18:23Z

pkg/controller/plan/validation.go

+		pods := &core.PodList{}
+		if err = ctx.Destination.Client.List(context.TODO(), pods, &client.ListOptions{
+			Namespace:     plan.Spec.TargetNamespace,
+			LabelSelector: k8slabels.SelectorFromSet(map[string]string{"job-name": job.Name}),


cool, now this call would work in all scenarios but I think it should be placed elsewhere since in many reconcile-cycles we'll find the job in JobComplete and won't need the information from the pod

codecov · 2024-07-25T11:20:29Z

Codecov Report

Attention: Patch coverage is 0% with 82 lines in your changes missing coverage. Please review.

Project coverage is 16.36%. Comparing base (f08a3a9) to head (0570a4d).

Files with missing lines	Patch %	Lines
pkg/controller/plan/validation.go	0.00%	82 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #969   +/-   ##
=======================================
  Coverage   16.36%   16.36%           
=======================================
  Files         109      109           
  Lines       19628    19682   +54     
=======================================
+ Hits         3212     3221    +9     
- Misses      16135    16178   +43     
- Partials      281      283    +2

Flag	Coverage Δ
unittests	`16.36% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jonner · 2024-07-25T13:40:10Z

@jonner yeah and there's even more basic scenario that we don't handle so well -
1. we try to validate URL `A`

2. the job fails on pulling the image

3. we sort out the issue with pulling the image from URL `A`
   -> the job needs to be recreated in order to validate A again
honestly, we were not really sure whether it makes sense to add this job back then because the only case we've heard of about the VDDK image was when someone created it on a filesystem that doesn't support symbolic links.. but apparently, we see that this job actually detects issues in the field and in that case it makes less sense to expect from users to clone the plan or archive-and-recreate the plan in order to initiate another job...

It's true that the more basic scenario has those issues, but at least it was consistent before: It failed with VDDKInvalid on the first run, and any subsequent changes didn't change that. But in the scenario I described, it would fail with VDDKImageNotFound on the first run and then any subsequent checks would report VDDKInvalid, which adds confusion, I think.

specifically about what you wrote above, since no new job is started, there are no pods to query about InitContainerStatuses - wouldn't we still have the pod that ran before?

Well, you know the code better than I do, but in my testing, the pod related to the job doesn't seem to be retained after the job completes. Specifically, after the job completes, the call to cxt.Destination.Client.List() returns no pods. Should the pod still be available somewhere?

ahadas · 2024-07-29T08:18:28Z

@jonner yeah and there's even more basic scenario that we don't handle so well -
1. we try to validate URL `A`

2. the job fails on pulling the image

3. we sort out the issue with pulling the image from URL `A`
   -> the job needs to be recreated in order to validate A again
honestly, we were not really sure whether it makes sense to add this job back then because the only case we've heard of about the VDDK image was when someone created it on a filesystem that doesn't support symbolic links.. but apparently, we see that this job actually detects issues in the field and in that case it makes less sense to expect from users to clone the plan or archive-and-recreate the plan in order to initiate another job...
It's true that the more basic scenario has those issues, but at least it was consistent before: It failed with VDDKInvalid on the first run, and any subsequent changes didn't change that. But in the scenario I described, it would fail with VDDKImageNotFound on the first run and then any subsequent checks would report VDDKInvalid, which adds confusion, I think.

yeah, I agree

specifically about what you wrote above, since no new job is started, there are no pods to query about InitContainerStatuses - wouldn't we still have the pod that ran before?

Well, you know the code better than I do, but in my testing, the pod related to the job doesn't seem to be retained after the job completes. Specifically, after the job completes, the call to cxt.Destination.Client.List() returns no pods. Should the pod still be available somewhere

unless I'm missing something, I read the changes in this PR as introducing logic that says: "once the job completes, we check the status of the pod that was triggered for the job to figure out the reason for a failure", so it assumes the pod exists also when the job is completed with a failure, right?
it could be that there's cleanup of the job's completed pods at some point (not by us but by k8s/OpenShift), I think I also saw it happening (having a completed job without associates pods), but I was puzzled by that part that suggests it's always the case that completed jobs have no associated pods

jonner · 2024-07-29T12:26:32Z

unless I'm missing something, I read the changes in this PR as introducing logic that says: "once the job completes, we check the status of the pod that was triggered for the job to figure out the reason for a failure", so it assumes the pod exists also when the job is completed with a failure, right?

Not quite. What happens is that while the job is running, validateVddkImage() checks the vddk-validator-* pod associated with the job and sets the condition to VDDKImageNotReady if the pod is waiting.

At some point, the image will either be pulled (and the pod will start running), or the pod will time out waiting for the image. In the first scenario, the job should run to completion and either succeed or fail based on the validation result. In the second scenario, the job will fail.

Immediately after the job finishes (either successfully or due to timing out), the next call to validateVddkImage() will not find a pod associated with that job (in my testing cxt.Destination.Client.List() returns no pods at this point). If the job has a JobFailed condition, the only indication about why the job failed is checking whether I have set a VDDKImageNotReady status condition on the plan during the previous reconcile (this is the reason that I made the VDDKImageNotReady condition durable: so after the pod disappears, the next reconcile still has a hint about what might have caused the job to fail).

But the above description does not apply to a cached job. In my experience, there are no pods returned for a cached job, so there's no way to determine the failure reason in this case.

But maybe I am simply doing something wrong and there is a way to get pods for a completed job that I don't know about?

bkhizgiy · 2024-07-31T14:16:49Z

@jonner I've been investigating the issue with pod deletion, and it turns out that it happens only when the pod is not fully created. If the VDDK image is invalid (the pod is unable to pull the image) and the job reaches the VDDK_JOB_ACTIVE_DEADLINE limit, the job fails, and the pod is deleted. However, if the job is successful, the pod remains as it is. Adding additional configuration to the job spec won't assist in this case.

After discussing this with @ahadas , we believe the best approach is to remove the VDDK_JOB_ACTIVE_DEADLINE limit from the job and handle it at the plan level. We'll add a condition to the plan controller to check if the job is still running and if the time exceeds the VDDK_JOB_ACTIVE_DEADLINE parameter. If it does, we will propagate the pod status to the plan without failing the job and notify the user of the issue. This approach also addresses another issue: since the job remains running, if the user updates the VDDK URL with a correct one, the job should successfully complete under the same job and plan.

What do you think of this solution?

jonner · 2024-07-31T14:22:11Z

Interesting idea. There are some details of your proposed solution that aren't yet 100% clear in my mind, but I will investigate. It sounds promising.

jonner · 2024-08-07T13:15:39Z

After discussing this with @ahadas , we believe the best approach is to remove the VDDK_JOB_ACTIVE_DEADLINE limit from the job and handle it at the plan level. We'll add a condition to the plan controller to check if the job is still running and if the time exceeds the VDDK_JOB_ACTIVE_DEADLINE parameter. If it does, we will propagate the pod status to the plan without failing the job and notify the user of the issue.

A validation job is labeled with both the UID of the plan as well as the md5sum of the vddk init image URL. So with the above approach the following statement is not actually true:

This approach also addresses another issue: since the job remains running, if the user updates the VDDK URL with a correct one, the job should successfully complete under the same job and plan.

When the vddk url is updated for the source provider, the next reconcile will look up a Job associated with the given plan and the new vddk URL, and will find none. So the old job will continue running (trying to pull the old init image), but we will ignore it. Instead, we'll create a new Job for the new plan/URL combination and run that.

So I think it'll require some slightly more complicated job management to handle this situation.

pkg/controller/plan/validation.go

Several functions accept a vddk image argument even though the vddk image can be retrieved directly from the plan. Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>

Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>

This code previously returned nil if the source and destination providers were not set for the plan when validating the vddk image, but it seems to make more sense to return an error instead. Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>

Rather than passing the labels to the function, just query it using the utility function. Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>

If the vddk validator pod fails, we don't need to keep re-trying. The container simply checks for the existence of a file, so restarting the pod is unlikely to change anything. In addition, by specifying `Never` for the restart policy, the completed pod should be retained for examination after the job fails, which can be helpful for determining the cause of failure. Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>

jonner · 2024-09-19T16:31:38Z

So, I took a slightly different approach than mentioned in my last comment. In the patch series that I just pushed, I did in fact remove the active deadline for the job, and changed a few additional things. I think it handles the scenarios that were discussed above. See the commit log for the "Distinguish between different VDDK validation errors" patch for more details

pkg/controller/plan/validation.go

codecov-commenter · 2024-09-24T15:16:12Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 101 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@650d73d). Learn more about missing BASE report.
Report is 37 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/controller/plan/validation.go	0.00%	101 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #969   +/-   ##
=======================================
  Coverage        ?   16.27%           
=======================================
  Files           ?      112           
  Lines           ?    19858           
  Branches        ?        0           
=======================================
  Hits            ?     3232           
  Misses          ?    16339           
  Partials        ?      287

Flag	Coverage Δ
unittests	`16.27% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

There are multiple cases that can lead to a "VDDK Init image is invalid" error message for a migration plan. They are currently handled with a single VDDKInvalid condition. One of the most common is when the vddk image cannot be pulled (either due to network issues or due to the user typing an incorrect image URL). Categorizing this type of error as an "invalid VDDK image" is confusing to the user. When the initContainer cannot pull the VDDK init image, the vddk-validator-* pod has something like the following status: initContainerStatuses: - name: vddk-side-car state: waiting: reason: ErrImagePull message: 'reading manifest 8.0.3.14 in default-route-openshift-image-registry.apps-crc.testing/openshift/vddk: manifest unknown' lastState: {} ready: false restartCount: 0 image: 'default-route-openshift-image-registry.apps-crc.testing/openshift/vddk:8.0.3.14' imageID: '' started: false We can use the existence of the 'waiting' state on the pod to indicate that the image cannot be pulled. Unfortunately, the validation job's pods are deleted when the job fails due to a failure to pull the image. Because of this, there's no way to examine the pod status to see why the failure occurred after the deadline. So this patch removes the deadline from the validation job, which requires overhauling the validation logic slightly. We add a new advisory condition `VDDKInitImageNotReady` to indicate that we are still waiting to pull the VDDK init image, and a new critical condition `VDDKInitImageUnavailable` to indicate that the condition has persisted for longer than the active deadline setting. Since the job will now retry pulling the vddk image indefinitely (due to the removal of the job deadline), we need to make sure that orphaned jobs don't run forever. So when the vddk image for a plan changes, we need to cancel all active validation jobs that are still running for the old vddk image. This overall approach has several advantages: - The user gets an early indication (via `VDDKInitImageNotReady`) that the image can't be pulled - The validation will automatically complete when any network interruption is resolved, without needing to delete and re-create the plan to start a new validation - The validation will no longer report a VDDKInvalid error when the image pull is very slow due to network issues because there is no longer a deadline for the job. Resolves: https://issues.redhat.com/browse/MTV-1150 Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>

sonarcloud · 2024-09-24T16:17:28Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

jonner requested review from ahadas, bennyz, liranr23 and bkhizgiy as code owners July 23, 2024 21:42

ahadas reviewed Jul 24, 2024

View reviewed changes

pkg/controller/plan/validation.go Outdated Show resolved Hide resolved

pkg/controller/plan/validation.go Outdated Show resolved Hide resolved

jsakil14 mentioned this pull request Jul 24, 2024

InvalidVDDKImage Error - how to troubleshoot / fix #971

Closed

jonner force-pushed the vddk-validate branch from 21c7f97 to a79d59f Compare July 24, 2024 18:44

ahadas reviewed Jul 25, 2024

View reviewed changes

jonner force-pushed the vddk-validate branch from a79d59f to 0570a4d Compare September 13, 2024 21:17

jonner requested review from mnecas and yaacov as code owners September 13, 2024 21:17

mnecas reviewed Sep 16, 2024

View reviewed changes

jonner added 5 commits September 19, 2024 10:40

Don't pass around the vddk image url unless necessary

8962235

Several functions accept a vddk image argument even though the vddk image can be retrieved directly from the plan. Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>

factor out the code for validating the vddk image validation Job

e0d014d

Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>

Don't pass labels to createVddkCheckJob()

9dff5ca

Rather than passing the labels to the function, just query it using the utility function. Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>

jonner force-pushed the vddk-validate branch from 0570a4d to 9d70a89 Compare September 19, 2024 16:25

jonner force-pushed the vddk-validate branch from 9d70a89 to 4791fc9 Compare September 19, 2024 16:36

mnecas reviewed Sep 24, 2024

View reviewed changes

pkg/controller/plan/validation.go Show resolved Hide resolved

jonner force-pushed the vddk-validate branch from 4791fc9 to 56de205 Compare September 24, 2024 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between different VDDK validation errors #969

Distinguish between different VDDK validation errors #969

jonner commented Jul 23, 2024

ahadas left a comment

jonner commented Jul 24, 2024

sonarcloud bot commented Jul 24, 2024

jonner commented Jul 24, 2024

ahadas commented Jul 25, 2024 •

edited

Loading

ahadas Jul 25, 2024

codecov bot commented Jul 25, 2024 •

edited

Loading

jonner commented Jul 25, 2024

ahadas commented Jul 29, 2024

jonner commented Jul 29, 2024

bkhizgiy commented Jul 31, 2024

jonner commented Jul 31, 2024

jonner commented Aug 7, 2024

jonner commented Sep 19, 2024

codecov-commenter commented Sep 24, 2024 •

edited

Loading

sonarcloud bot commented Sep 24, 2024

Distinguish between different VDDK validation errors #969

Are you sure you want to change the base?

Distinguish between different VDDK validation errors #969

Conversation

jonner commented Jul 23, 2024

ahadas left a comment

Choose a reason for hiding this comment

jonner commented Jul 24, 2024

sonarcloud bot commented Jul 24, 2024

Quality Gate passed

jonner commented Jul 24, 2024

ahadas commented Jul 25, 2024 • edited Loading

ahadas Jul 25, 2024

Choose a reason for hiding this comment

codecov bot commented Jul 25, 2024 • edited Loading

Codecov Report

jonner commented Jul 25, 2024

ahadas commented Jul 29, 2024

jonner commented Jul 29, 2024

bkhizgiy commented Jul 31, 2024

jonner commented Jul 31, 2024

jonner commented Aug 7, 2024

jonner commented Sep 19, 2024

codecov-commenter commented Sep 24, 2024 • edited Loading

Codecov Report

sonarcloud bot commented Sep 24, 2024

Quality Gate passed

ahadas commented Jul 25, 2024 •

edited

Loading

codecov bot commented Jul 25, 2024 •

edited

Loading

codecov-commenter commented Sep 24, 2024 •

edited

Loading