When tasks are re-submitted with the same key, `Scheduler.queued` may return stale `TaskState` objects #7504

fjetter · 2023-01-27T16:38:32Z

In distributed/tests/test_client.py::test_threadsafe_get we saw an assertion error during scheduling of queued tasks where a task was already forgotten even though the scheduler just tried to assign this task.

https://github.com/dask/distributed/actions/runs/4026033623/jobs/6920058827

  File "D:\a\distributed\distributed\distributed\scheduler.py", line 4617, in stimulus_queue_slots_maybe_opened

    assert qts.state == "queued", qts.state

AssertionError: forgotten

cc @gjoseph92

The text was updated successfully, but these errors were encountered:

gjoseph92 · 2023-01-28T06:47:06Z

TaskState objects are hashable. Their hash is currently their key. So if a key is forgotten, but its TaskState object still exists, when a task with the same key is submitted, the two TaskState objects will hash and compare as equal, even though they are logically different tasks.

So I think this is what's happening:

Task x is added to the queue.
Task x is removed from the queue (not popped).

Remember that internally, a HeapSet has both a set and a heap of weakrefs. HeapSet.remove removes the object from the set, but not from the heap. So HeapSet._data doesn't contain x, but HeapSet._heap still does.
Task x is forgotten (but something still has a reference to it somewhere).
Task x is re-submitted. A different TaskState object is created.
The new x TaskState is added to the queue.

x in HeapSet._data (the set) is False. So x is added to _data, and x is pushed onto _heap. The heap has an insertion-order tiebreaker, so the new x comes immediately after the old x in the heap.
We peek from the front of the queue.

The old x object eventually comes off the heap first. That old TaskState object is not in the _data set, but because the new x with the same key is in the set, x in self._data is True, and we return the old, stale x TaskState.
This stale TaskState is in state forgotten, causing the assertion error.

This explanation kinda fits with what the test actually does:

distributed/distributed/tests/test_client.py

Lines 5082 to 5092 in b2c20eb

    
           def f(_): 
        
               total = 0 
        
               for _ in range(20): 
        
                   total += (x + random.randint(0, 20)).sum().compute() 
        
                   sleep(0.001) 
        
               return total 
        
           from concurrent.futures import ThreadPoolExecutor 
        
           with ThreadPoolExecutor(30) as e: 
        
               results = list(e.map(f, range(30)))

The test is guaranteed to submit the same graph at least a few times. And in the specific CI run you linked, a worker died in the middle, which could have triggered the removal of a task from the middle.

The underlying issue is that TaskStates shouldn't be hashed or equal based on keys: #7510.

fjetter mentioned this issue Jan 27, 2023

P2P shuffle deduplicates data and can be run several times #7486

Merged

2 tasks

gjoseph92 mentioned this issue Jan 28, 2023

Scheduler TaskState objects should be unique, not hashed by key #7510

Open

gjoseph92 changed the title ~~AssertionError in queuing scheduling logic~~ When tasks are re-submitted with the same key, Scheduler.queued may return stale TaskState objects Jan 28, 2023

gjoseph92 linked a pull request Feb 9, 2023 that will close this issue

Resolve stale TaskStates having colliding hashes #7528

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When tasks are re-submitted with the same key, `Scheduler.queued` may return stale `TaskState` objects #7504

When tasks are re-submitted with the same key, `Scheduler.queued` may return stale `TaskState` objects #7504

fjetter commented Jan 27, 2023

gjoseph92 commented Jan 28, 2023

When tasks are re-submitted with the same key, Scheduler.queued may return stale TaskState objects #7504

When tasks are re-submitted with the same key, Scheduler.queued may return stale TaskState objects #7504

Comments

fjetter commented Jan 27, 2023

gjoseph92 commented Jan 28, 2023

When tasks are re-submitted with the same key, `Scheduler.queued` may return stale `TaskState` objects #7504

When tasks are re-submitted with the same key, `Scheduler.queued` may return stale `TaskState` objects #7504