You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When using the LoadV2 version to load data (alluxio fs load xxxx --submit), if the task fails in the end, the job status will not be persisted in the journey. If there is a subsequent master-slave switch, the new Master will reschedule the previously stale failed Load job.In many production environments, rescheduling old failed tasks will cause a batch of unnecessary data to be loaded, thus greatly affecting cluster stability.
Furthermore, from the original design, it seems that the failed job is not expected to be rescheduled. See "Additional context" for details, so this should be a bug.
To Reproduce
First, a load job is submitted by loadV2, and then the loadjob fails.
Second, switch the master, check the job status of the LoadJob , and find that the Job has been rescheduled.
Expected behavior
After master-slave switching, failed jobs should not be rescheduled
Urgency
Affects cluster stability after master-slave switching
Are you planning to fix it
Yes
Additional context
First of all, please let us make it clear that the original design of this function is to hope that the job with a clear success or failure status will not be scheduled after the Leadership switch.
The text was updated successfully, but these errors were encountered:
We have also encountered a similar issue, we observed that jobs originally in the JobState.STOPPED state would automatically trigger rescheduling after leader switch. For large jobs that have been explicitly cancelled, this
mechanism can lead to unnecessary resource consumption and stability risks. To mitigate this issue, we have explicitly prohibited jobs in the JobState.STOPPED from being automatically rescheduled after leader switch
Alluxio Version:
v2.9.3
Describe the bug
When using the LoadV2 version to load data (alluxio fs load xxxx --submit), if the task fails in the end, the job status will not be persisted in the journey. If there is a subsequent master-slave switch, the new Master will reschedule the previously stale failed Load job.In many production environments, rescheduling old failed tasks will cause a batch of unnecessary data to be loaded, thus greatly affecting cluster stability.
Furthermore, from the original design, it seems that the failed job is not expected to be rescheduled. See "Additional context" for details, so this should be a bug.
To Reproduce
First, a load job is submitted by loadV2, and then the loadjob fails.
Second, switch the master, check the job status of the LoadJob , and find that the Job has been rescheduled.
Expected behavior
After master-slave switching, failed jobs should not be rescheduled
Urgency
Affects cluster stability after master-slave switching
Are you planning to fix it
Yes
Additional context
First of all, please let us make it clear that the original design of this function is to hope that the job with a clear success or failure status will not be scheduled after the Leadership switch.
The text was updated successfully, but these errors were encountered: