-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java based producer fails to re-connect after upgrade #229
Comments
These podIPs do not exist anymore, after the replacement of the statefulset. Edit: I realized later that I didn't check |
I scaled up the test and the new pod's producer has no issues. The old one hasn't recovered. This is the Kafka 2.1.0 image. |
Unfortunately the failing pod got OOMKilled after about an hour so I will never know if it would have recovered. The resterted container connected successfully after a few seconds. After another 5 minutes the producer container logged the following:
That might be explained by my current 500Mi memory limit on Kafka. I see two restarts on kafka-1. |
I might have some kind of name resolution problem in my test cluster. For example zookeeper pods repeatedly log Edit: Nope, that was #231. I can't see a relation between that issue to the kafka broker problem here, but maybe there is one. |
Got this error again now after testing another upgrade. Maybe it's reproducible by simply deleting and recreating the kafka statefulset.
|
Edit: A comment has been deleted right above this one. That's what the ping was referring to. It was interesting, suggesting that maybe brokers report their IP address to Zk and not a DNS name. I assumed that it was the client that had resolved the IP addresses and didn't refresh them when brokers restarted. @qbast That sounds like a plausible explanation for the error, but I don't know where that POD_IP is set. Where did you find the If we end up changing |
Tested #228 and #227. Kafkacat recovered nicely from downtime, bu the "produce-consume" test from 4.3.0 did badly. Producer logs:
It seems to never recover.
The text was updated successfully, but these errors were encountered: