-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
leader query should wait until leader election finished #14898
Conversation
I don't think this change would be helpful. The While an Overlord is in the process of becoming leader, the @YongGang , could you please share some more info on the root problem that you are trying to solve with this change? |
During K8s task runner testing, we saw there were cases that Kafka tasks will be marked as failed by the supervisor even though they have succeeded. Overlord was bounced around that time and leadership changed. When looking through the logs to understand the root cause, we saw thousands of following logs, from the Overlord instance even though it was elected as the leader. (can see the log like "By the power of Grayskull, I have the power!")
This was due to when Thus I propose to add lock around leader query, so the query result is determined: either Overlord is the leader or not, not in an intermediate state. This is especially true when Overlord is restarting as the becoming leader process can take tens of seconds to finish, it's hard to reason about what's the return value of getTaskRunner() and getTaskQueue() during that time. I feel like this is also related to the other issue I'm trying to solve: if one component is depending on another one, it should wait until the dependent component fully start/stop (in above case is |
Thanks for the explanation, @YongGang .
Okay, so IIUC, the supervisor has started on a node which has been recently elected leader but not fully ready yet. I think the fix should be more along the lines of supervisor waiting for the leader election to be complete before starting its duties. Alternatively,
In my opinion, these side effects are only to be expected. It is better than being in a state where an Overlord starts doing things before it has properly become the leader and fails inevitably. Let me know what you think. |
i think we may have found a solution here that is simpler, to move some logic from restore() (which is called asyncronously in Task.manage) to start (which is called during LifecycleStart of the taskRunner before SupervisorManager starts) |
Great! I am not sure I fully understand the solution, but if it works, I would be happy to review the PR. |
Hi @kfaraz , this is the PR to fix the original issue #14909 But here I still like to address the leader query problem (maybe not in the way I proposed, that's why it's submitted as draft for discussion). I will think more about it. |
Description
The becomeLeader lifecycle is likely to take several seconds to finish. During the becoming leader process, the call to methods like
getTaskRunner()
will return wrong results which caused issues.Now change to wait on the
giantLock
for leader query.Release note
Key changed/added classes in this PR
This PR has: