Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leader query should wait until leader election finished #14898

Closed
wants to merge 1 commit into from

Conversation

YongGang
Copy link
Contributor

Description

The becomeLeader lifecycle is likely to take several seconds to finish. During the becoming leader process, the call to methods like getTaskRunner() will return wrong results which caused issues.

Now change to wait on the giantLock for leader query.

Release note


Key changed/added classes in this PR

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@YongGang YongGang changed the title leadership query should wait until leader elected leader query should wait until leader election finished Aug 23, 2023
@kfaraz
Copy link
Contributor

kfaraz commented Aug 23, 2023

I don't think this change would be helpful. The giant lock is used to guard several other methods and is often found to be a bottleneck. In all of these cases then, we would not be able to answer APIs on whether we are leader or not.

While an Overlord is in the process of becoming leader, the TaskMaster.getTaskRunner() returns empty which is the preferable behaviour.

@YongGang , could you please share some more info on the root problem that you are trying to solve with this change?

@YongGang
Copy link
Contributor Author

During K8s task runner testing, we saw there were cases that Kafka tasks will be marked as failed by the supervisor even though they have succeeded.
After digging into the code, the issue that triggered task being stopped is from here (haven't found why)

Overlord was bounced around that time and leadership changed. When looking through the logs to understand the root cause, we saw thousands of following logs, from the Overlord instance even though it was elected as the leader. (can see the log like "By the power of Grayskull, I have the power!")

Failed to get task runner because I'm not the leader!

Failed to get task queue because I'm not the leader!

This was due to when SeekableStreamSupervisor started, it looped through the tasks from TaskStorage and query TaskMaster for leader status. The query returns false as the becoming leader process hasn't finished yet. But that created confusion as some tasks can get right TaskRunner some doesn't, all depends on whether leader election finished.

Thus I propose to add lock around leader query, so the query result is determined: either Overlord is the leader or not, not in an intermediate state. This is especially true when Overlord is restarting as the becoming leader process can take tens of seconds to finish, it's hard to reason about what's the return value of getTaskRunner() and getTaskQueue() during that time.

I feel like this is also related to the other issue I'm trying to solve: if one component is depending on another one, it should wait until the dependent component fully start/stop (in above case is SeekableStreamSupervisor -> TaskMaster), we use lock to achieve this purpose here.

@kfaraz
Copy link
Contributor

kfaraz commented Aug 24, 2023

Thanks for the explanation, @YongGang .

This was due to when SeekableStreamSupervisor started, it looped through the tasks from TaskStorage and query TaskMaster for leader status. The query returns false as the becoming leader process hasn't finished yet. But that created confusion as some tasks can get right TaskRunner some doesn't, all depends on whether leader election finished.

Okay, so IIUC, the supervisor has started on a node which has been recently elected leader but not fully ready yet.

I think the fix should be more along the lines of supervisor waiting for the leader election to be complete before starting its duties. Alternatively, SupervisorManager itself should become active after leader election is complete. This might have the following impact:

  • While leader election is in progress, we cannot perform any supervisor CRUD, which makes sense
  • SupervisorManager initialization would be delayed a little
  • Others?

In my opinion, these side effects are only to be expected. It is better than being in a state where an Overlord starts doing things before it has properly become the leader and fails inevitably.

Let me know what you think.

@georgew5656
Copy link
Contributor

Thanks for the explanation, @YongGang .

This was due to when SeekableStreamSupervisor started, it looped through the tasks from TaskStorage and query TaskMaster for leader status. The query returns false as the becoming leader process hasn't finished yet. But that created confusion as some tasks can get right TaskRunner some doesn't, all depends on whether leader election finished.

Okay, so IIUC, the supervisor has started on a node which has been recently elected leader but not fully ready yet.

I think the fix should be more along the lines of supervisor waiting for the leader election to be complete before starting its duties. Alternatively, SupervisorManager itself should become active after leader election is complete. This might have the following impact:

  • While leader election is in progress, we cannot perform any supervisor CRUD, which makes sense
  • SupervisorManager initialization would be delayed a little
  • Others?

In my opinion, these side effects are only to be expected. It is better than being in a state where an Overlord starts doing things before it has properly become the leader and fails inevitably.

Let me know what you think.

i think we may have found a solution here that is simpler, to move some logic from restore() (which is called asyncronously in Task.manage) to start (which is called during LifecycleStart of the taskRunner before SupervisorManager starts)

@kfaraz
Copy link
Contributor

kfaraz commented Aug 25, 2023

i think we may have found a solution here that is simpler, to move some logic from restore() (which is called asyncronously in Task.manage) to start (which is called during LifecycleStart of the taskRunner before SupervisorManager starts)

Great! I am not sure I fully understand the solution, but if it works, I would be happy to review the PR.

@YongGang
Copy link
Contributor Author

i think we may have found a solution here that is simpler, to move some logic from restore() (which is called asyncronously in Task.manage) to start (which is called during LifecycleStart of the taskRunner before SupervisorManager starts)

Great! I am not sure I fully understand the solution, but if it works, I would be happy to review the PR.

Hi @kfaraz , this is the PR to fix the original issue #14909

But here I still like to address the leader query problem (maybe not in the way I proposed, that's why it's submitted as draft for discussion). I will think more about it.

@YongGang YongGang closed this Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants