Watch Connection leak when HTTP_GONE happens #1943

yanzq · 2020-01-13T03:22:53Z

When HTTP_GONE happens("too old resource version exception"), the web socket set to null first, the close function can not close the connection；

The code is in:
File : kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
Line : 260

line:260 should be deleted?

yanzq · 2020-01-15T02:24:33Z

need help?

rohanKanojia · 2020-01-15T07:52:42Z

Actually we removed it in past, but it caused some unwanted regressions( see #1800 for more details)

kolorful · 2020-01-20T00:44:19Z

@rohanKanojia I encountered a similar issue when using shared informers. How to reproduce:

start a shared informer for Job and launch a Job
stop kube-apiserver process
restart kube-apiserver process after a few seconds
you shall see the same error keep showing up in the logs and informer always returns true for hasSynced(), but the cache won't get any update until the next re-list.

Is it possible to let shared informer detect such issue and start a re-list? Or there is some other workaround that can let client detect such error?

#1075 (comment) seems like a dirty workaround that's probably not desirable.

Jan 15 05:40:08: [OkHttp https://127.0.0.1:6443/...] ERROR io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Could not deserialize watch event: {"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 544015 (544921)","reason":"Gone","code":410}}
Jan 15 05:40:08: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot construct instance of `io.fabric8.kubernetes.api.model.NodeStatus` (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('Failure')
Jan 15 05:40:08: at [Source: (String)"{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 544015 (544921)","reason":"Gone","code":410}"; line: 1, column: 59] (through reference chain: io.fabric8.kubernetes.api.model.Node["status"])
Jan 15 05:40:08: at com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:63)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1429)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.DeserializationContext.handleMissingInstantiator(DeserializationContext.java:1059)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.ValueInstantiator._createFromStringFallbacks(ValueInstantiator.java:371)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromString(StdValueInstantiator.java:323)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromString(BeanDeserializerBase.java:1373)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeOther(BeanDeserializer.java:171)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:161)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:288)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4202)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3205)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3173)
Jan 15 05:40:08: at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:279)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
Jan 15 05:40:08: at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
Jan 15 05:40:08: at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
Jan 15 05:40:08: at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
Jan 15 05:40:08: at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
Jan 15 05:40:08: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jan 15 05:40:08: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jan 15 05:40:08: at java.lang.Thread.run(Thread.java:745)

rohanKanojia · 2020-01-20T08:46:50Z

@kolorful : Thanks for your report. I think we can add code to handle it on SharedInformer side when 410 occurs.

kolorful · 2020-01-20T14:11:10Z

Thank you! I'm using v4.6.3 and the error comes from this line:

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java

Line 279 in b595e42

    
           watcher.eventReceived(Watcher.Action.valueOf(watchEventType), mapper.readValue(watchObjectAsString, baseOperation.getType()));

I think it somehow mistakens Status as CRD and fall into the wrong logic block.

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java

Line 259 in b595e42

if (status.getCode() == HTTP_GONE) {

should in theory handle it?

kolorful · 2020-01-22T00:00:56Z

The other issue I observe is that informer's hasSynced() always returns true even after re-sync period is passed and re-sync failed.

How to reproduce:

start a shared informer for Job and launch a Job and keep checking if informer is cached
stop kube-apiserver process
wait for the re-sync period to pass

What I expect: re-list should fail due to kube-apiserver unavailable and hasSynced() returns false.
What actually happens: hasSynced() keep returning true without the cache being updated and client has no way to catch this error and restart the informers.

Is this expected behaviour?

It seems like in client-go you can pass a stop channel to the informer and if anything went wrong client can catch that and try restarting the informers, but we don't have that ability in java yet.

rohanKanojia · 2020-01-22T08:15:21Z

@kolorful : Hmm, I see. I think we should file a separate issue for this use case. WDYT? I'll try to fix these before cutting 4.7.1

kolorful · 2020-01-22T13:22:41Z

yes, I filed #1961 and thank you!

stale · 2020-04-23T04:17:31Z

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

stale · 2020-07-29T06:27:51Z

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

kolorful mentioned this issue Jan 22, 2020

Two SharedInformer issues related to kube-apiserver unavailable and relisting #1961

Closed

rohanKanojia mentioned this issue Apr 15, 2020

io.fabric8.kubernetes.client.KubernetesClientException: too old resource version #2135

Closed

stale bot added the status/stale label Apr 23, 2020

stale bot closed this as completed Apr 30, 2020

manusa removed the status/stale label Apr 30, 2020

manusa reopened this Apr 30, 2020

stale bot added the status/stale label Jul 29, 2020

stale bot closed this as completed Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch Connection leak when HTTP_GONE happens #1943

Watch Connection leak when HTTP_GONE happens #1943

yanzq commented Jan 13, 2020 •

edited

Loading

yanzq commented Jan 15, 2020

rohanKanojia commented Jan 15, 2020

kolorful commented Jan 20, 2020 •

edited

Loading

rohanKanojia commented Jan 20, 2020

kolorful commented Jan 20, 2020 •

edited

Loading

kolorful commented Jan 22, 2020 •

edited

Loading

rohanKanojia commented Jan 22, 2020

kolorful commented Jan 22, 2020

stale bot commented Apr 23, 2020

stale bot commented Jul 29, 2020

Watch Connection leak when HTTP_GONE happens #1943

Watch Connection leak when HTTP_GONE happens #1943

Comments

yanzq commented Jan 13, 2020 • edited Loading

yanzq commented Jan 15, 2020

rohanKanojia commented Jan 15, 2020

kolorful commented Jan 20, 2020 • edited Loading

rohanKanojia commented Jan 20, 2020

kolorful commented Jan 20, 2020 • edited Loading

kolorful commented Jan 22, 2020 • edited Loading

rohanKanojia commented Jan 22, 2020

kolorful commented Jan 22, 2020

stale bot commented Apr 23, 2020

stale bot commented Jul 29, 2020

yanzq commented Jan 13, 2020 •

edited

Loading

kolorful commented Jan 20, 2020 •

edited

Loading

kolorful commented Jan 20, 2020 •

edited

Loading

kolorful commented Jan 22, 2020 •

edited

Loading