deploy failure: etcd failed to start when deploy private Cluster failed because of #2242

hydracz · 2019-10-31T03:15:27Z

Describe the bug

I'm deploying a private aks cluster using below configuration:

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.13",
"kubernetesConfig": {
"privateCluster": {
"enabled": true,
"jumpboxProfile": {
"name": "jumpbox",
"vmSize": "Standard_D2_v3",
"osDiskSizeGB": 30,
"username": "azureuser",
"publicKey": "ssh-rsa .........."
}
}
}
},
"masterProfile": {
"count": 3,
"dnsPrefix": "myakse",
"vmSize": "Standard_D2_v3",
"distro": "aks-ubuntu-18.04",
"etcdVersion": "3.3.15",
"vnetSubnetId": "/subscriptions/.........../subnets/MasterSubnet",
"firstConsecutiveStaticIP": "10.100.0.240"
},
"agentPoolProfiles": [
{
"name": "agents",
"count": 3,
"availabilityProfile": "VirtualMachineScaleSets",
"vmSize": "Standard_D2_v3",
"distro": "aks-ubuntu-18.04",
"vnetSubnetId": "/subscriptions/........../subnets/AgentSubnet"
}
],
"linuxProfile": {
"adminUsername": "azureuser",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa .........."
}
]
}
},
"servicePrincipalProfile": {
"clientId": ".......................",
"secret": "........................"
}
}
}

VM for master can be created, but start etcd service, it gets error, and cause deployment failure.

etcd.service failed to start and reports bad certificated error when doing the handshake with peers,
the ServerName is null ""

Steps To Reproduce

aks-engine generate --api-model configdata.json
az group deployment create

Expected behavior

successful installation and etcd service in master VM can be started

AKS Engine version
0.4.15

Kubernetes version
1.13.2

Additional context

aks uses a self signed certificate that doesn't containes the ServerName, but only ip addresss，
in that case etcd will fail to valide the certificate and fail to start,
etcd releases 3.3.17 version with a new DEAMON_ARGS to skip verifing the SAN from certification, looks like we need to include this changes in when deploying

https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3317-2019-10-11
etcd-io/etcd#11196

welcome · 2019-10-31T03:15:28Z

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

hydracz · 2019-10-31T05:18:48Z

details steps to reproduce

create a configdata.json as below
{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.13",
"kubernetesConfig": {
"privateCluster": {
"enabled": true,
"jumpboxProfile": {
"name": "jumpbox",
"vmSize": "Standard_D2_v3",
"osDiskSizeGB": 30,
"username": "azureuser",
"publicKey": "ssh-rsa .........."
}
}
}
},
"masterProfile": {
"count": 3,
"dnsPrefix": "myakse",
"vmSize": "Standard_D2_v3",
"distro": "aks-ubuntu-18.04",
"etcdVersion": "3.3.15",
"vnetSubnetId": "/subscriptions/.........../subnets/MasterSubnet",
"firstConsecutiveStaticIP": "10.100.0.240"
},
"agentPoolProfiles": [
{
"name": "agents",
"count": 3,
"availabilityProfile": "VirtualMachineScaleSets",
"vmSize": "Standard_D2_v3",
"distro": "aks-ubuntu-18.04",
"vnetSubnetId": "/subscriptions/........../subnets/AgentSubnet"
}
],
"linuxProfile": {
"adminUsername": "azureuser",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa .........."
}
]
}
},
"servicePrincipalProfile": {
"clientId": ".......................",
"secret": "........................"
}
}
}
run below command to generate arm template into _output
aks-engine generate --api-model configdata.json
deploy the arm template.
az group deployment create -g aks-engine-test --template-file azuredeploy.json --parameters "@azuredeploy.parameters.json" --verbose --name "AKSEDeployment11"
wait for master VM created, and we can login to via the jump VM and check etcd status

root@k8s-master-48189569-0:/var/log/azure# systemctl status etcd
● etcd.service - etcd - highly-available key value store
Loaded: loaded (/etc/systemd/system/etcd.service; disabled; vendor preset: enabled)
Active: activating (start) since Thu 2019-10-31 05:08:52 UTC; 19s ago
Docs: https://github.com/coreos/etcd
man:etcd
Main PID: 6099 (etcd)
Tasks: 7 (limit: 9512)
CGroup: /system.slice/etcd.service
└─6099 /usr/bin/etcd --name k8s-master-48189569-0 --peer-client-cert-auth --peer-trusted-ca-file=/etc/kubernetes/certs/ca.crt --peer-cert-file=/etc/kubernetes/certs/e

Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36602" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.242:37456" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36608" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.242:37458" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36616" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36618" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36620" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.242:37460" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.242:37466" (error "remote error: tls: bad certificate", ServerName "")
Oct 31 05:09:11 k8s-master-48189569-0 etcd[6099]: rejected connection from "10.100.0.241:36626" (error "remote error: tls: bad certificate", ServerName "")

etcd failed to start due the mentioned problem when using a self-signed certificate and cause the entire deployment failure.

I have tried different etcd version number other than 3.3.15 (the default one), but doesn't change anything.
I think we can to use etcd 3.3.17 with the new args "--experimental-peer-skip-client-san-verification" to solve the problem.

hydracz · 2019-11-07T01:24:55Z

I have solved this issue， it was my mistake， that the machine runs aks-engine command doesn't have the correct system time, so my self-signed certification gernated with a start time which is in the further, that cause etcd failed to start..

even after I upgrade etcd to the newest verison and I still got same error, that makes me to look at the detail certification file. and got the root cause..

my suggest is to always keeps NTP time before generate the cerfication..

hydracz added the bug Something isn't working label Oct 31, 2019

hydracz closed this as completed Nov 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy failure: etcd failed to start when deploy private Cluster failed because of #2242

deploy failure: etcd failed to start when deploy private Cluster failed because of #2242

hydracz commented Oct 31, 2019

welcome bot commented Oct 31, 2019

hydracz commented Oct 31, 2019

hydracz commented Nov 7, 2019

deploy failure: etcd failed to start when deploy private Cluster failed because of #2242

deploy failure: etcd failed to start when deploy private Cluster failed because of #2242

Comments

hydracz commented Oct 31, 2019

welcome bot commented Oct 31, 2019

hydracz commented Oct 31, 2019

hydracz commented Nov 7, 2019