Restore tasks when lifecycle start #14909

YongGang · 2023-08-25T19:17:21Z

Description

Move the task restore logic to lifecycle start method in KubernetesTaskRunner, this is also align with what other remote task runners do.

Release note

Change to restore tasks when lifecycle start

Key changed/added classes in this PR

In KubernetesTaskRunner move the tasks restoration logic from restore to start method.

This PR has:

georgew5656 · 2023-08-25T20:05:00Z

...verlord-extensions/src/test/java/org/apache/druid/k8s/overlord/KubernetesTaskRunnerTest.java

@@ -263,80 +284,6 @@ public void test_shutdown_withoutExistingTask()
    runner.shutdown(task.getId(), "");
  }

-  @Test
-  public void test_restore_withExistingJobs() throws IOException


why do we need to delete all these tests again?

added the missing one back.

gianm · 2023-08-28T21:07:25Z

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

    for (Job job : client.getPeonJobs()) {
      try {
-        Task task = adapter.toTask(job);
-        restoredTasks.add(Pair.of(task, joinAsync(task)));
+        joinAsync(adapter.toTask(job));


Does this really fix the bug? It looks like after this change, once start() completes we can be sure that tasks has all the right task IDs in it, but we can't be sure that it has the most up-to-date statuses. The statuses are still being restored asynchronously by joinAsync, and that could be happening in the background after start() exists.

Could you please consider this, and determine if it's OK or not? If it's OK please add a comment here explaining why, so future readers don't need to wonder if the async restoration is OK. If it's not OK, then please update the code to wait for the most up to date statuses to be loaded before returning from start().

the behavior we saw was this change made task failures during rollover less common (probably because there's still a race condition here). i think we should look at how the HttpRemoteTaskRunner solves this

@gianm

I tried to add some code in the start logic of KubernetesTaskRunner to check the tasks that have completed before the overlord came up and wait on their futures to complete, but this didn't actually solve the problem.

I think this is because the callbacks that are responsible for listening to the taskRunner future and updating the task's status in TaskStorage are added by the TaskQueue manageInternalCritical block, which doesn't get run during LIfeycleStart.

I think this may actually be a general race condition-y thing with supervisors that mm-less ingestion may be running into more often. Will need more time to figure out what's happening but for now I think this PR can be merged because the logic of list all jobs in k8s -> add them to the tasks map definitely belongs in the start method and not the restore method. (the restore method appears to be called in every manage() run in TaskQueue).

As George mentioned, this change has reduced the frequency of task failures during rollovers. To comprehensively address the issue, we might consider persisting the status of SeekableStreamIndexTaskRunner in the database. This would allow for accurate restoration upon startup, so don't need to rely on TaskRunner for the latest task statuses when start. However, this enhancement will be tackled in a subsequent PR.
(we've also observed similar symptoms with Middle Manager streaming ingestion)

i don't think this change fixes the issue described in the PR description (task failures during rollovers). it still seems like a good idea because restore() gets called a lot by TaskQueue.

@YongGang i might change the description of the PR since this isn't targeted at fixing the bugs you were seeing anymore.

suneet-s

LGTM

Having the TaskRunner read the peon jobs from the k8s client in the start method seems to be like the correct place to do this.

suneet-s · 2023-09-22T19:03:26Z

@gianm I'm going to merge this change since it is in a contrib extension and makes things more stable. We will address the comment about explaining why joinAsync is ok in the start method in a future patch once some more testing is done.

YongGang added 2 commits August 25, 2023 11:52

K8s tasks restore should be from lifecycle start

d452c93

add test

e82213c

georgew5656 reviewed Aug 25, 2023

View reviewed changes

YongGang mentioned this pull request Aug 27, 2023

leader query should wait until leader election finished #14898

Closed

10 tasks

YongGang added 2 commits August 27, 2023 14:25

add more tests

ba00f9d

fix test

25bec0a

georgew5656 approved these changes Aug 28, 2023

View reviewed changes

gianm reviewed Aug 28, 2023

View reviewed changes

YongGang added 2 commits August 29, 2023 10:35

wait tasks restore finish when start

2d4a1a8

fix style

9414030

kfaraz added Kubernetes Area - Ingestion labels Aug 31, 2023

revert previous change and add comment

a360c22

github-actions bot removed the Area - Ingestion label Sep 15, 2023

suneet-s approved these changes Sep 22, 2023

View reviewed changes

suneet-s merged commit be3f93e into apache:master Sep 22, 2023
46 checks passed

LakshSingla added this to the 28.0 milestone Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore tasks when lifecycle start #14909

Restore tasks when lifecycle start #14909

YongGang commented Aug 25, 2023 •

edited

Loading

georgew5656 Aug 25, 2023

YongGang Aug 28, 2023

gianm Aug 28, 2023

georgew5656 Aug 28, 2023

georgew5656 Sep 13, 2023

YongGang Sep 15, 2023

georgew5656 Sep 18, 2023

suneet-s left a comment

suneet-s commented Sep 22, 2023

Restore tasks when lifecycle start #14909

Restore tasks when lifecycle start #14909

Conversation

YongGang commented Aug 25, 2023 • edited Loading

Description

Release note

Key changed/added classes in this PR

georgew5656 Aug 25, 2023

Choose a reason for hiding this comment

YongGang Aug 28, 2023

Choose a reason for hiding this comment

gianm Aug 28, 2023

Choose a reason for hiding this comment

georgew5656 Aug 28, 2023

Choose a reason for hiding this comment

georgew5656 Sep 13, 2023

Choose a reason for hiding this comment

YongGang Sep 15, 2023

Choose a reason for hiding this comment

georgew5656 Sep 18, 2023

Choose a reason for hiding this comment

suneet-s left a comment

Choose a reason for hiding this comment

suneet-s commented Sep 22, 2023

YongGang commented Aug 25, 2023 •

edited

Loading