Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

deploy failure: etcd failed to start when deploy private Cluster failed because of #2242

Closed
hydracz opened this issue Oct 31, 2019 · 3 comments
Labels
bug Something isn't working

Comments

@hydracz
Copy link

hydracz commented Oct 31, 2019

Describe the bug

I'm deploying a private aks cluster using below configuration:

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.13",
"kubernetesConfig": {
"privateCluster": {
"enabled": true,
"jumpboxProfile": {
"name": "jumpbox",
"vmSize": "Standard_D2_v3",
"osDiskSizeGB": 30,
"username": "azureuser",
"publicKey": "ssh-rsa .........."
}
}
}
},
"masterProfile": {
"count": 3,
"dnsPrefix": "myakse",
"vmSize": "Standard_D2_v3",
"distro": "aks-ubuntu-18.04",
"etcdVersion": "3.3.15",
"vnetSubnetId": "/subscriptions/.........../subnets/MasterSubnet",
"firstConsecutiveStaticIP": "10.100.0.240"
},
"agentPoolProfiles": [
{
"name": "agents",
"count": 3,
"availabilityProfile": "VirtualMachineScaleSets",
"vmSize": "Standard_D2_v3",
"distro": "aks-ubuntu-18.04",
"vnetSubnetId": "/subscriptions/........../subnets/AgentSubnet"
}
],
"linuxProfile": {
"adminUsername": "azureuser",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa .........."
}
]
}
},
"servicePrincipalProfile": {
"clientId": ".......................",
"secret": "........................"
}
}
}

VM for master can be created, but start etcd service, it gets error, and cause deployment failure.

etcd.service failed to start and reports bad certificated error when doing the handshake with peers,
the ServerName is null ""

Steps To Reproduce

  1. aks-engine generate --api-model configdata.json
  2. az group deployment create

Expected behavior

successful installation and etcd service in master VM can be started

AKS Engine version
0.4.15

Kubernetes version
1.13.2

Additional context

  1. aks uses a self signed certificate that doesn't containes the ServerName, but only ip addresss,
    in that case etcd will fail to valide the certificate and fail to start,

  2. etcd releases 3.3.17 version with a new DEAMON_ARGS to skip verifing the SAN from certification, looks like we need to include this changes in when deploying

https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3317-2019-10-11
etcd-io/etcd#11196

@hydracz hydracz added the bug Something isn't working label Oct 31, 2019
@welcome
Copy link

welcome bot commented Oct 31, 2019

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

@hydracz
Copy link
Author

hydracz commented Oct 31, 2019

details steps to reproduce

  1. create a configdata.json as below
    {
    "apiVersion": "vlabs",
    "properties": {
    "orchestratorProfile": {
    "orchestratorType": "Kubernetes",
    "orchestratorRelease": "1.13",
    "kubernetesConfig": {
    "privateCluster": {
    "enabled": true,
    "jumpboxProfile": {
    "name": "jumpbox",
    "vmSize": "Standard_D2_v3",
    "osDiskSizeGB": 30,
    "username": "azureuser",
    "publicKey": "ssh-rsa .........."
    }
    }
    }
    },
    "masterProfile": {
    "count": 3,
    "dnsPrefix": "myakse",
    "vmSize": "Standard_D2_v3",
    "distro": "aks-ubuntu-18.04",
    "etcdVersion": "3.3.15",
    "vnetSubnetId": "/subscriptions/.........../subnets/MasterSubnet",
    "firstConsecutiveStaticIP": "10.100.0.240"
    },
    "agentPoolProfiles": [
    {
    "name": "agents",
    "count": 3,
    "availabilityProfile": "VirtualMachineScaleSets",
    "vmSize": "Standard_D2_v3",
    "distro": "aks-ubuntu-18.04",
    "vnetSubnetId": "/subscriptions/........../subnets/AgentSubnet"
    }
    ],
    "linuxProfile": {
    "adminUsername": "azureuser",
    "ssh": {
    "publicKeys": [
    {
    "keyData": "ssh-rsa .........."
    }
    ]
    }
    },
    "servicePrincipalProfile": {
    "clientId": ".......................",
    "secret": "........................"
    }
    }
    }

  2. run below command to generate arm template into _output
    aks-engine generate --api-model configdata.json

  3. deploy the arm template.
    az group deployment create -g aks-engine-test --template-file azuredeploy.json --parameters "@azuredeploy.parameters.json" --verbose --name "AKSEDeployment11"

  4. wait for master VM created, and we can login to via the jump VM and check etcd status

root@k8s-master-48189569-0:/var/log/azure# systemctl status etcd
● etcd.service - etcd - highly-available key value store
Loaded: loaded (/etc/systemd/system/etcd.service; disabled; vendor preset: enabled)
Active: activating (start) since Thu 2019-10-31 05:08:52 UTC; 19s ago
Docs: https://github.com/coreos/etcd
man:etcd
Main PID: 6099 (etcd)
Tasks: 7 (limit: 9512)
CGroup: /system.slice/etcd.service
└─6099 /usr/bin/etcd --name k8s-master-48189569-0 --peer-client-cert-auth --peer-trusted-ca-file=/etc/kubernetes/certs/ca.crt --peer-cert-file=/etc/kubernetes/certs/e

Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36602" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.242:37456" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36608" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.242:37458" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36616" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36618" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36620" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.242:37460" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.242:37466" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36626" (error "remote error: tls: bad certificate", ServerName "")

  1. etcd failed to start due the mentioned problem when using a self-signed certificate and cause the entire deployment failure.

I have tried different etcd version number other than 3.3.15 (the default one), but doesn't change anything.
I think we can to use etcd 3.3.17 with the new args "--experimental-peer-skip-client-san-verification" to solve the problem.

@hydracz
Copy link
Author

hydracz commented Nov 7, 2019

I have solved this issue, it was my mistake, that the machine runs aks-engine command doesn't have the correct system time, so my self-signed certification gernated with a start time which is in the further, that cause etcd failed to start..

even after I upgrade etcd to the newest verison and I still got same error, that makes me to look at the detail certification file. and got the root cause..

my suggest is to always keeps NTP time before generate the cerfication..

@hydracz hydracz closed this as completed Nov 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant