-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly vSphere 6.0 HA test fails with Invalid configuration for device '1' #4666
Comments
portlayer.log:
|
Seen in longevity run on VC 6.5 + vSAN on iteration 14: container-logs.zip |
vpxa.log:
later:
Here's vmdk device portion of the spec:
I don't know if this means anything or would cause an However the directory @hickeng any thoughts? |
@stuclem The workaround for this for now is to retry creating the container. |
@jzt The failing element is the back.parent.fileName which is referring to the image disk:
My concern just looking at this output is that it's using the symlink |
Speculative change - this uses the UUID instead of the symlink for the endpointVM if placing the imagestore within its namespace. If this functions then we may see similar problems if explicitly using a dedicated namespace when we don't have an opportunity to perform this translation: diff --git a/lib/install/management/appliance.go b/lib/install/management/appliance.go
index e50fa0b..7b38eaf 100644
--- a/lib/install/management/appliance.go
+++ b/lib/install/management/appliance.go
@@ -596,8 +596,14 @@ func (d *Dispatcher) createAppliance(conf *config.VirtualContainerHostConfigSpec
},
)
+ // fix up those parts of the config that depend on the final applianceVM folder name
conf.BootstrapImagePath = fmt.Sprintf("[%s] %s/%s", conf.ImageStores[0].Host, d.vmPathName, settings.BootstrapISO)
+ if len(conf.ImageStores[0].Path) == 0 {
+ conf.ImageStores[0].Path = d.vmPathName
+ }
+
+ // apply the fixed-up configuration
spec, err = d.reconfigureApplianceSpec(vm2, conf, settings)
if err != nil {
log.Errorf("Error while getting appliance reconfig spec: %s", err)
@@ -646,12 +652,7 @@ func (d *Dispatcher) reconfigureApplianceSpec(vm *vm.VirtualMachine, conf *confi
var devices object.VirtualDeviceList
var err error
- spec := &types.VirtualMachineConfigSpec{
- Name: conf.Name,
- GuestId: string(types.VirtualMachineGuestOsIdentifierOtherGuest64),
- AlternateGuestName: constants.DefaultAltVCHGuestName(),
- Files: &types.VirtualMachineFileInfo{VmPathName: fmt.Sprintf("[%s]", conf.ImageStores[0].Host)},
- }
+ spec := &types.VirtualMachineConfigSpec{}
// create new devices
if devices, err = d.configIso(conf, vm, settings); err != nil {
diff --git a/lib/install/validate/storage.go b/lib/install/validate/storage.go
index 9731d88..abf99cc 100644
--- a/lib/install/validate/storage.go
+++ b/lib/install/validate/storage.go
@@ -44,7 +44,7 @@ func (v *Validator) storage(ctx context.Context, input *data.Data, conf *config.
// provide a default path if only a DS name is provided
if imageDSpath.Path == "" {
- imageDSpath.Path = input.DisplayName
+ log.Debug("No path specified on datastore for image store - will use the endpointVM folder")
}
if ds != nil { Can confirm on @caglar10ur 's nested vSAN env that we get the imagestore referenced as expected:
and that the resulting VMDK descriptor for a containerVM is using the UUID path:
|
From flanders hostd.1.gz, the most likely key message from the host
I have been able to find the following for what I assume is the creation of the parent disk, also in hostd.1:
It seems we are also unregistering a VM that held a reference to it a little while before the failure:
|
Uploading hostd logs from all hosts for the time period in question - only flanders and krusty seem involved with |
In case it's useful I've been using the following to map times:
vpxd to krusty time offset:
vpxd to flanders time offset: Time of interest -> enoexist (flanders): time of interest on krusty |
My nested vsan env is using 6.5 if it matters. |
@jzt thanks for providing the workaround. If you could also provide me a summary of the problem I'd be most grateful, as I have no clue what this issue is about! |
@stuclem The main gist of this issue is that, when attempting to create a VMDK for the RW layer of a container during container create, there exists a scenario in which the parent VMDK cannot be accessed/located, resulting in an |
Thanks @jzt. Here's a belated attempt at a write up:
Is this OK? Thanks! |
@hickeng what do you think? |
@stuclem I don't think we should add the speculation. How about:
|
@hickeng done! |
Seen in build 9992: |
Longevity failure:
This test is running VC build: Version 6.0.0 Build 5318172 |
Seen again during a |
Seen again during a docker run in HA regression tests. Provided are the pertinent logs from the nightly run of 10-29-2017. for VC 6.0 ---> |
Seen again during a VC 6.0 HA test on 11/1/2017: |
Seen again during a VC 6.0 HA test on 11/2/2017(second day in a row): |
Seen again during a VC 6.0 nightly HA test on 11/4/2017: |
seen again during a VC 6.0 nightly HA test failure on 11/6/2017: |
Set up a nimbus deployment to use for recreate grind using the nightly scripts for precise duplication. I've seen #6263 so have put in a workaround for that. I've also seen that the NFS datastore becomes disconnected/inactive from all hosts during the HA failover, preventing continuation of the test. The nfsDatastore appears to be dropped from all the hosts (this could be related to having rerun the test multiple times against the same cluster - it may be that reconnecting the host is not reconnecting the datastore). The only remediation I've found is to manually remount the NFS datastore to an ESX:
then use Also seen the following:
Tasks:
|
Having addressed various setup related issues (#6772 and #6263) I've been unable to recreate this failure (3 clean runs in sequence at this point). Either the failure is related to an issue in a specific instance of the deployed test infrastructure and my current infrastructure doesn't present the problem, or it's a lot more intermittent than currently presented. I believe that to diagnose further we need a mechanism to leave the nimbus infrastructure in place after a failed test for inspection. I think we'll at the very least need the NFS datastore server logs to determine whether there was actually a file access attempted. |
@cgtexmex I think this may be related to #4858 but a different error path possibly. I have found the following:
In step 6 we discover that the image store is inconsistent and we try to delete it so it can be recreated; however we have containerVMs using the store which prevents in use files from being erased. It looks like we cannot delete the actual data files for the VMDKs so we have xxx-delta.vmdk still existing, but no xxx.vmdk or manifest files. I suspect that the invalid config is because of the missing vmdk descriptor file - however I've yet to confirm that this isn't a business as usual quirk of an NFS datastore.
portlayer:
There is also evidence that the image file was deleted previously in a hostd.log - my working assumption is that this is the
|
@hickeng I just glanced at the logs you posted (log entries for Dec. 7th) that doesn't appear to have the fix for #4858 as there's no retry and the updated logging (specifically new operation logging) isn't present. I'd really be interested to see if we hit this anymore once the nightlies are run with that code... |
I can no longer reproduce this issue on latest master |
The text was updated successfully, but these errors were encountered: