Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch Connection leak when HTTP_GONE happens #1943

Closed
yanzq opened this issue Jan 13, 2020 · 10 comments
Closed

Watch Connection leak when HTTP_GONE happens #1943

yanzq opened this issue Jan 13, 2020 · 10 comments

Comments

@yanzq
Copy link

yanzq commented Jan 13, 2020

When HTTP_GONE happens("too old resource version exception"), the web socket set to null first, the close function can not close the connection;
image

The code is in:
File : kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
Line : 260
image

line:260 should be deleted?

@yanzq
Copy link
Author

yanzq commented Jan 15, 2020

need help?

@rohanKanojia
Copy link
Member

Actually we removed it in past, but it caused some unwanted regressions( see #1800 for more details)

@kolorful
Copy link
Contributor

kolorful commented Jan 20, 2020

@rohanKanojia I encountered a similar issue when using shared informers. How to reproduce:

  • start a shared informer for Job and launch a Job
  • stop kube-apiserver process
  • restart kube-apiserver process after a few seconds
  • you shall see the same error keep showing up in the logs and informer always returns true for hasSynced(), but the cache won't get any update until the next re-list.

Is it possible to let shared informer detect such issue and start a re-list? Or there is some other workaround that can let client detect such error?

#1075 (comment) seems like a dirty workaround that's probably not desirable.

Jan 15 05:40:08: [OkHttp https://127.0.0.1:6443/...] ERROR io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Could not deserialize watch event: {"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 544015 (544921)","reason":"Gone","code":410}}
Jan 15 05:40:08: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot construct instance of `io.fabric8.kubernetes.api.model.NodeStatus` (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('Failure')
Jan 15 05:40:08: at [Source: (String)"{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 544015 (544921)","reason":"Gone","code":410}"; line: 1, column: 59] (through reference chain: io.fabric8.kubernetes.api.model.Node["status"])
Jan 15 05:40:08: at com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:63)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1429)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.DeserializationContext.handleMissingInstantiator(DeserializationContext.java:1059)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.ValueInstantiator._createFromStringFallbacks(ValueInstantiator.java:371)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromString(StdValueInstantiator.java:323)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromString(BeanDeserializerBase.java:1373)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeOther(BeanDeserializer.java:171)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:161)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:288)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4202)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3205)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3173)
Jan 15 05:40:08: at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:279)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
Jan 15 05:40:08: at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
Jan 15 05:40:08: at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
Jan 15 05:40:08: at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
Jan 15 05:40:08: at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
Jan 15 05:40:08: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jan 15 05:40:08: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jan 15 05:40:08: at java.lang.Thread.run(Thread.java:745)

@rohanKanojia
Copy link
Member

@kolorful : Thanks for your report. I think we can add code to handle it on SharedInformer side when 410 occurs.

@kolorful
Copy link
Contributor

kolorful commented Jan 20, 2020

Thank you! I'm using v4.6.3 and the error comes from this line:

watcher.eventReceived(Watcher.Action.valueOf(watchEventType), mapper.readValue(watchObjectAsString, baseOperation.getType()));

I think it somehow mistakens Status as CRD and fall into the wrong logic block.

should in theory handle it?

@kolorful
Copy link
Contributor

kolorful commented Jan 22, 2020

The other issue I observe is that informer's hasSynced() always returns true even after re-sync period is passed and re-sync failed.

How to reproduce:

  • start a shared informer for Job and launch a Job and keep checking if informer is cached
  • stop kube-apiserver process
  • wait for the re-sync period to pass

What I expect: re-list should fail due to kube-apiserver unavailable and hasSynced() returns false.
What actually happens: hasSynced() keep returning true without the cache being updated and client has no way to catch this error and restart the informers.

Is this expected behaviour?

It seems like in client-go you can pass a stop channel to the informer and if anything went wrong client can catch that and try restarting the informers, but we don't have that ability in java yet.

@rohanKanojia
Copy link
Member

@kolorful : Hmm, I see. I think we should file a separate issue for this use case. WDYT? I'll try to fix these before cutting 4.7.1

@kolorful
Copy link
Contributor

yes, I filed #1961 and thank you!

@stale
Copy link

stale bot commented Apr 23, 2020

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

@stale stale bot added the status/stale label Apr 23, 2020
@stale stale bot closed this as completed Apr 30, 2020
@manusa manusa reopened this Apr 30, 2020
@stale
Copy link

stale bot commented Jul 29, 2020

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

@stale stale bot added the status/stale label Jul 29, 2020
@stale stale bot closed this as completed Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants