Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link etcd certificates for calico-node error #3464

Closed
forkballpitch opened this issue Oct 6, 2018 · 28 comments
Closed

Link etcd certificates for calico-node error #3464

forkballpitch opened this issue Oct 6, 2018 · 28 comments

Comments

@forkballpitch
Copy link

forkballpitch commented Oct 6, 2018

i've got a error here

help me please.

failed: [node5] (item={u's': u'node-node5.pem', u'd': u'cert.crt'}) => {"changed": false, "item": {"d": "cert.crt", "s": "node-node5.pem"}, "msg": "Error while linking: [Errno 2] No such file or directory", "path": "/etc/calico/certs/cert.crt", "state": "absent"}

and my host.ini file is...

[k8s-cluster:children]
kube-master
kube-node

[all]
node1 ansible_host=209.XXX.188.XX ip=209.XXX.188.XX
node2 ansible_host=209.XXX.188.XXX ip=209.XXX.188.XX
node3 ansible_host=209.XXX.188.XXX ip=209.XXX.188.XX
node4 ansible_host=209.XXX.188.XXX ip=209.XXX.188.XX
node5 ansible_host=209.XXX.188.XXX ip=209.XXX.188.XX

[kube-master]
node1
node2
node3

[kube-node]
node4
node5

[etcd]
node1
node2
node3

[calico-rr]

[vault]
node1
node2
node3

@mirwan
Copy link
Contributor

mirwan commented Oct 7, 2018

@forkballpitch Could you provide the information listed in the issue template (OS, distrib,.., command-line) and the task name that raises the error?

@mirwan mirwan added the triage/needs-information Indicates an issue needs more information in order to work on it. label Oct 7, 2018
@forkballpitch
Copy link
Author

forkballpitch commented Oct 8, 2018

@mirwan i just cloned this source "https://github.com/kubernetes-incubator/kubespray.git"
and added more worker server.
if you need more information please tell me.
thank you!

os : ubuntu 16.04.4
kubespray version: latest
command line : ansible-playbook -b -v -i inventory/prod/hosts.ini cluster.yml

@bartlaarhoven
Copy link
Contributor

I'm having the same problem. Ubuntu 16.04 clean installs on both kubespray host and kube nodes, kubespray pulled from git, command line:

ansible-playbook -i inventory/kube-cluster-01/hosts.ini cluster.yml

@mirwan
Copy link
Contributor

mirwan commented Oct 8, 2018

First can you confirm that:

  • the (last) failed task reported is "Calico | Link etcd certificates for calico-node"
  • cert_management is set to "vault"
  • ansible-playbook has been executed with "-b"
  • the source file for the link does not actually exist (e.g. /etc/ssl/etcd/ssl/node-node5.pem)

If so, could you check if there was any failed task before (on etcd servers during cert generation, memory checks...) ?

@bartlaarhoven
Copy link
Contributor

For me:

  • I attached the output of the failed last task: kubespray-failed-last-task.txt
  • cert_management was unset (commented out in inventory/kube-cluster-01/group_vars/all/all.yml
  • the command was ansible-playbook -i inventory/kube-cluster-01/hosts.ini cluster.yml, so no -b
  • on the failed nodes, the failed source files do not exist indeed

Other possibly related errors or warnings are:

TASK [kubernetes/secrets : Check_certs | Set 'sync_certs' to true on nodes] ***********************************************************************************************************************
Monday 08 October 2018  17:03:51 +0200 (0:00:04.885)       0:05:34.603 ********
 [WARNING]: when statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: inventory_hostname in groups['kube-node'] and inventory_hostname != groups['kube-
master'][0] and (not item in kubecert_node.files | map(attribute='path') | map("basename") | list or kubecert_node.files | selectattr("path", "equalto", "{{ kube_cert_dir }}/{{ item }}") |
map(attribute="checksum")|first|default('') != kubecert_master.files | selectattr("path", "equalto", "{{ kube_cert_dir }}/{{ item }}") | map(attribute="checksum")|first|default(''))

but also in the same task:

ok: [node5] => (item=node-node5-key.pem)

I didn't find any failed tasks.

Does this help?

@bartlaarhoven
Copy link
Contributor

Additional notes:

  • node1, node2 and node3 have vault and etcd labels
  • the node-node5.pem file does exist on node1, node2 and node3 in /etc/ssl/etcd/ssl/ (and so do the other missing files)
  • on the other nodes like node5, the /etc/ssl/etcd/ssl directory contains ca.pem, node-node1-key.pem and node-node1.pem. That's it.

I'm completely new to ansible and trying kubespray for the first time, so I'd love to help out but I'm still figuring out how it works.

@mirwan
Copy link
Contributor

mirwan commented Oct 8, 2018

First, I think you must used -b flag (the documentation is being updated that way).
Then, if cert_management is not set in group_vars, there is no need to populate the vault group as the cert management defaults to "script".
Anyway, if node5 cert and key do not exist, it certainly means that it was either not generated or not synced to node5. Can you look at the whole playbook output and see if the "Gen_certs | run cert generation script", "Gen_certs | Gather etcd node certs" and "Gen_certs | Write etcd node certs" tasks run properly?

@forkballpitch
Copy link
Author

forkballpitch commented Oct 9, 2018

i have a somethin dont understand. first ini file is error file and second one has no error
error is "cat not find /etc/calico/certs/cert.crt"
i have kubespray pulled from git, command line:

ansible-playbook -b -v -i inventory/prod/hosts.ini cluster.yml


host.ini ( error in node4)

[k8s-cluster:children]
kube-master
kube-node

[all]
node1 ansible_host=~ ip=~
node2 ansible_host=~ ip=~
node3 ansible_host=~ ip=~
node4 ansible_host=~ ip=~

[kube-master]
node1
node2

[kube-node]
node1
node2
node3
node4

[etcd]
node1
node2
node3

[calico-rr]


host.ini (no error, i remove node1~3 in node part)

[k8s-cluster:children]
kube-master
kube-node

[all]
node1 ansible_host=~ ip=~
node2 ansible_host=~ ip=~
node3 ansible_host=~ ip=~
node4 ansible_host=~ ip=~

[kube-master]
node1
node2

[kube-node]

node4

[etcd]
node1
node2
node3

[calico-rr]

and it works~!
root@k-01:~/kubespray# kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1 Ready master,node 14m v1.12.1
node2 Ready master,node 14m v1.12.1
node3 Ready node 14m v1.12.1
node4 Ready node 14m v1.12.1

@mirwan
Copy link
Contributor

mirwan commented Oct 9, 2018

@forkballpitch I didn't think a server could in kube-node and in etcd/kube-master at the same time. The doc says it can, I will inquire.
@bartlaarhoven Maybe it is the same for you?

@mirwan
Copy link
Contributor

mirwan commented Oct 9, 2018

Actually, mixing masters/etcd and workload (i.e. nodes) is not a best practice in production.
As far as you have enough servers, you should have nodes on one hand and masters and/or etcd on the other hand.
Our current CI only handles mixing master/etcd with nodes when deploying a less than or equal to 3 nodes cluster

@mirwan
Copy link
Contributor

mirwan commented Oct 9, 2018

@forkballpitch Btw have you reset your servers (with reset.yml playbook) between your deployments with the 2 inventories? kubectl get nodes should not report node1 and node2 as nodes

@bartlaarhoven
Copy link
Contributor

I've played around with Ansible and kubespray and opened #3486 as that is what fixed it for me.

@dkozlov
Copy link

dkozlov commented Oct 10, 2018

I have reproduced this issue with ansible==2.7.0
As workaround you can install ansible==2.6.3

@mirwan
Copy link
Contributor

mirwan commented Oct 11, 2018

@bartlaarhoven Regarding @dkozlov 's comment, what version of ansible are you using?

@tadeugr
Copy link

tadeugr commented Oct 11, 2018

@bartlaarhoven Regarding @dkozlov 's comment, what version of ansible are you using?

@mirwa, I'm having the same problem and I could confirm Kubespray revision 3b750ca returns this error when using Ansible 2.7.0.

It works with Ansible 2.6.3 as dkozlov said.
It also works with Ansible 2.6.5.

@bartlaarhoven
Copy link
Contributor

@dkozlov @mirwan I've used the most recent version of Ansible (fresh install)

ansible-playbook 2.7.0
  config file = None
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
  executable location = /usr/local/bin/ansible-playbook
  python version = 2.7.12 (default, Dec  4 2017, 14:50:18) [GCC 5.4.0 20160609]

@mirwan
Copy link
Contributor

mirwan commented Oct 12, 2018

I was able to reproduce the issue with ansible 2.7.
It seems that ansible gets messed up at task "etcd : Gen_certs | Write etcd node certs" (cert from one node is written both on the node and another)
Btw, the "etcd : Gen_certs | Get etcd certificate serials" wrongly succeed for the node with the wrong cert.
I'm looking into it

@mirwan
Copy link
Contributor

mirwan commented Oct 12, 2018

I think we currently hit that issue: ansible/ansible#46600
Maybe there is a fix consisting using another ansible module...

@bartlaarhoven
Copy link
Contributor

I have issues signing the collaboration document (as it should be from my company etc.) but I'd like to point again to my PR #3486 as that fixed it for me in Ansible 2.7 and it uses the same way of distributing certificates as in other parts of kubespray.

@mirwan
Copy link
Contributor

mirwan commented Oct 12, 2018

@bartlaarhoven I'm currently testing your branch ;-)

@caruccio
Copy link
Contributor

hey @mirwan any news on this topic? This is a show stopper for me...

@mirwan
Copy link
Contributor

mirwan commented Oct 15, 2018

@caruccio There's only one step left before merging the PR#3486 (and I guess you know what's left to be done and certainly why this step cannot be skipped). In the meantime, downgrading to ansible 2.6 could do the trick.

@caruccio
Copy link
Contributor

I see... I live in Brazil and I really known what bureaucracy means for life on earth.

@mirwan mirwan removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Oct 15, 2018
@thiguetta
Copy link

I'm still facing this problem on v2.7 and master
any updates on this?

@bartlaarhoven
Copy link
Contributor

@mirwan Do you have a contact point for me at TLF to get me another agreement?

@ant31
Copy link
Contributor

ant31 commented Oct 22, 2018

@thiguetta as said, it's a bug in ansible 2.7, it's not something we can fix in kubespray.
The only update we have is to use ansible 2.6.x until the ansible team fixes the issue.

@mirwan
Copy link
Contributor

mirwan commented Oct 22, 2018

@bartlaarhoven I don't have any contact point except the one mentioned by the bot (helpdesk@rt.linuxfoundation.org) :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants