Failed load job is rescheduled when Leadership switch #18581

liiuzq-xiaobai · 2024-04-18T06:39:09Z

Alluxio Version:
v2.9.3

Describe the bug
When using the LoadV2 version to load data (alluxio fs load xxxx --submit), if the task fails in the end, the job status will not be persisted in the journey. If there is a subsequent master-slave switch, the new Master will reschedule the previously stale failed Load job.In many production environments, rescheduling old failed tasks will cause a batch of unnecessary data to be loaded, thus greatly affecting cluster stability.
Furthermore, from the original design, it seems that the failed job is not expected to be rescheduled. See "Additional context" for details, so this should be a bug.

To Reproduce
First, a load job is submitted by loadV2, and then the loadjob fails.

Second, switch the master, check the job status of the LoadJob , and find that the Job has been rescheduled.

Expected behavior
After master-slave switching, failed jobs should not be rescheduled

Urgency
Affects cluster stability after master-slave switching

Are you planning to fix it
Yes

Additional context

First of all, please let us make it clear that the original design of this function is to hope that the job with a clear success or failure status will not be scheduled after the Leadership switch.

hawthorn2025 · 2024-08-21T02:49:42Z

We have also encountered a similar issue, we observed that jobs originally in the JobState.STOPPED state would automatically trigger rescheduling after leader switch. For large jobs that have been explicitly cancelled, this
mechanism can lead to unnecessary resource consumption and stability risks. To mitigate this issue, we have explicitly prohibited jobs in the JobState.STOPPED from being automatically rescheduled after leader switch

liiuzq-xiaobai added the type-bug This issue is about a bug label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed load job is rescheduled when Leadership switch #18581

Failed load job is rescheduled when Leadership switch #18581

liiuzq-xiaobai commented Apr 18, 2024

hawthorn2025 commented Aug 21, 2024

Failed load job is rescheduled when Leadership switch #18581

Failed load job is rescheduled when Leadership switch #18581

Comments

liiuzq-xiaobai commented Apr 18, 2024

hawthorn2025 commented Aug 21, 2024