Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auth token invalid after watch reconnects #11954

Closed
jschwinger233 opened this issue May 29, 2020 · 12 comments · Fixed by #12264
Closed

auth token invalid after watch reconnects #11954

jschwinger233 opened this issue May 29, 2020 · 12 comments · Fixed by #12264
Assignees

Comments

@jschwinger233
Copy link

jschwinger233 commented May 29, 2020

Reproduce Procedure

  1. prepare etcd cluster with 3 nodes and respective etcdctl
/go/src/github.com/coreos/etcd # etcd --version
etcd Version: 3.4.9
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.14.2
Go OS/Arch: linux/amd64

though I don't think configure matters under this issue, let me present one of them

name: etcd2
data-dir: /data/zz_6129484611666145821
listen-peer-urls: http://0.0.0.0:23800
listen-client-urls: http://0.0.0.0:23790
initial-advertise-peer-urls: http://10.213.20.39:23800
advertise-client-urls: http://10.213.20.39:23790
initial-cluster: etcd0=http://10.213.20.40:23800,etcd1=http://10.213.20.38:23800,etcd2=http://10.213.20.39:23800
initial-cluster-token: zz
initial-cluster-state: new
auto-compaction-retention: "1"
quota-backend-bytes: -1

and the etcd processes are running simply by

etcd --config-file /etc/etcd.conf

no envirnments, no command line arguments

  1. enable auth
etcdctl user add root
# then type root password, mine is 'root'
etcdctl --user root:root auth enable
  1. watch
etcdctl --user root:root watch / --prefix

we can put a key to ensure the watch is working right now

etcdctl put /a b
  1. wait for token deleting

wait for 5 min until the token is deleted by simpleTokenKeeper, then kill the etcd processes one by one and restart them immediately after kill

be noted do NOT kill the process until the cluster recovers healthy

  1. watch fails

then you'll realize the watch is down with the output permission deny

Analysis

the issue is cause by simpleTokenKeeper, here is the timeline

  1. etcdctl dials grpc and fetches an auth token, let's say TOKEN-A
  2. etcdctl dials grpc with TOKEN-A, and watch / --prefix as expected
  3. after 5 min, simpleTokenKeeper delete TOKEN-A
  4. watch continues working even if TOKEN-A has been deleted, because token is only checked upon grpc invocation
  5. killing etcd process terminates connection, and etcdv3 client will re-invoke grpc Watch with the same token TOKEN-A
  6. authStore.AuthInfoFromCtx will return ErrInvalidAuthToken due to TOKEN-A no longer exists

Impact

the experiment is conducted using v3.4.9, the good part in this version is client will raise error permission deny and terminate watching;

however in our live cluster, the etcd server is v3.4.3, etcdv3 client is v3.3.8, and there will be no error, no log, no output, no termination, everything looks good but the watch has failed in silence, this is bad.

sometimes we can barely control the client version, such as calico-felix v3.4 binds with clientv3 v3.3.8, and upgrade calico version is subtle in live.

Improvement

In my opinion there are 2 ways to improve:

  1. we improve the watch keepalive mechanism. Correct me if wrong, we already have watch control message from server to client which are sent periodically; if we can have watch client respond with a keepalive response or something like this, we can invoke simpleTokenKeeper.resetSimpleToken to renew TTL
  2. we improve client side. The etcdv3 Watcher is merely an interface that returns <-chan WatchResponse, if we could encapsulate the re-fetching token and re-dialing grpc connection after receiving transport is closing WatchResponse, the Watcher client, outside the interface, would not be influenced.

Related issues

I presume the following issues are talking the exact same thing as I talk
#11121
#11381

looking forward to your kind feedback

@tedyu
Copy link
Contributor

tedyu commented May 31, 2020

It seems a Calico issue should be logged so that client v3.4.9 or higher is used for Calico.

@jschwinger233
Copy link
Author

It seems a Calico issue should be logged so that client v3.4.9 or higher is used for Calico.

upgrading client on calico won't eradicate the issue as watching would be terminated and I don't think there is any re-watch mechanism in calico.

@cfc4n
Copy link
Contributor

cfc4n commented Jun 12, 2020

mark

@tangcong
Copy link
Contributor

tangcong commented Aug 18, 2020

The following best practices can quickly avoid this problem:

  1. please use CN based auth and JWT token in the production environment. it is more safe and reliable.
  2. you can try to use Watch ProgressNotify feature to solve error("everything looks good but the watch has failed in silence, this is bad").
    @jschwinger23

@tangcong
Copy link
Contributor

/cc @mitake

@mitake
Copy link
Contributor

mitake commented Aug 19, 2020

As @tangcong shared, you should use CN based auth or JWT token in realistic use cases. But token expire can happen for JWT token too, so #12135 is valuable, thanks a lot for rising the issue nad sending PR @jschwinger23 !

@jschwinger233
Copy link
Author

As @tangcong shared, you should use CN based auth or JWT token in realistic use cases. But token expire can happen for JWT token too, so #12135 is valuable, thanks a lot for rising the issue nad sending PR @jschwinger23 !

Sorry for late reply, I was on sick leave during last a few days.

In my first attempt, I was going to implement refresh on the grpc mechanism, PerRPCCredentials.GetRequestMetadata, until I encountered a grpc issue grpc/grpc-go#3749, which was fixed in grpc/grpc-go#3677 just two months ago, and I don't think etcd could upgrade grpc deps in such a shot time that I decided to push ahead in a detour, stream interceptor.

Even in my taste, I don't like my PR neither, it's smelly, too hard-coded, too inflexible, but we have to fix it, or it's not reliable on production.

I noted PR #12165 won't eradicate the issue, though @cfc4n mentioned that they will raise another PR for it:

I got it. I'll give a PR after PR #12165 merged.

from #12157 (comment)

so I was wondering what's the status of this issue, is it on @cfc4n 's charge (thanks~), and subsequently I close my PR?

@cfc4n
Copy link
Contributor

cfc4n commented Aug 20, 2020

Yes, It is on my todolists. I'm waiting for PR #12165 merged.

please assign to me
/cc @gyuho @mitake

@cfc4n
Copy link
Contributor

cfc4n commented Sep 1, 2020

@jschwinger23 Can you try PR #12264 , thanks.

@jschwinger233
Copy link
Author

jschwinger233 commented Sep 1, 2020 via email

@jschwinger233
Copy link
Author

@jschwinger23 Can you try PR #12264 , thanks.

working now, thanks.

@cfc4n
Copy link
Contributor

cfc4n commented Sep 7, 2020

@jschwinger23 Can you try PR #12264 , thanks.

working now, thanks.

Can you use this PR #12264 to replace PR #12135 ,and close it ?

aanm added a commit to aanm/cilium that referenced this issue Dec 2, 2020
This version fixes etcd error "auth token invalid after
watch reconnects" etcd-io/etcd#11954.

Signed-off-by: André Martins <andre@cilium.io>
aanm added a commit to cilium/cilium that referenced this issue Dec 2, 2020
This version fixes etcd error "auth token invalid after
watch reconnects" etcd-io/etcd#11954.

Signed-off-by: André Martins <andre@cilium.io>
aanm added a commit to cilium/cilium that referenced this issue Dec 3, 2020
[ upstream commit 7b0037c ]

This version fixes etcd error "auth token invalid after
watch reconnects" etcd-io/etcd#11954.

Signed-off-by: André Martins <andre@cilium.io>
joestringer pushed a commit to cilium/cilium that referenced this issue Dec 3, 2020
[ upstream commit 7b0037c ]

[ Backporter's notes: This is a complete re-vendoring of the relevant
  library, so includes a wider set of changes than the original commit. ]

This version fixes etcd error "auth token invalid after
watch reconnects" etcd-io/etcd#11954.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
pchaigno pushed a commit to cilium/cilium that referenced this issue Dec 4, 2020
[ upstream commit 7b0037c ]

This version fixes etcd error "auth token invalid after
watch reconnects" etcd-io/etcd#11954.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
aanm added a commit to cilium/cilium that referenced this issue Dec 4, 2020
[ upstream commit 7b0037c ]

This version fixes etcd error "auth token invalid after
watch reconnects" etcd-io/etcd#11954.

Signed-off-by: André Martins <andre@cilium.io>
aanm added a commit to cilium/cilium that referenced this issue Dec 4, 2020
[ upstream commit 7b0037c ]

This version fixes etcd error "auth token invalid after
watch reconnects" etcd-io/etcd#11954.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants