You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have recently started seeing our CI fail sporadically on a test that hangs forever. The stacktrace points to client.scatter with broadcast=True. This error doesn't happen every time the test is run but enough times to be noticeable. Due to the sporadic nature of the error, it's hard to create a minimal repro.
We have started seeing this recently, so I'm posting here as a hail-mary to see if anyone can link this to a recent change in dask. We have started seeing this since '2021.05.1'
This is the method that is hanging:
defsend_data_to_cluster(self, X, y):
"""Send data to the cluster. The implementation uses caching so the data is only sent once. This follows dask best practices. Args: X (pd.DataFrame): input data for modeling y (pd.Series): target data for modeling Return: dask.Future: the modeling data """data_hash=joblib.hash(X), joblib.hash(y)
ifdata_hashinself._data_futures_cache:
X_future, y_future=self._data_futures_cache[data_hash]
ifnot (X_future.cancelled() ory_future.cancelled()):
returnX_future, y_futureself._data_futures_cache[data_hash] =self.client.scatter(
[X, y], broadcast=True
)
returnself._data_futures_cache[data_hash]
This is the stacktrace. Note that pytest is setting the 360 second timeout. Without it, the test would hang forever.
test_python/lib/python3.8/site-packages/distributed/client.py:2185: inscatterreturnself.sync(
test_python/lib/python3.8/site-packages/distributed/client.py:853: insyncreturnsync(
test_python/lib/python3.8/site-packages/distributed/utils.py:351: insynce.wait(10)
/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py:558: inwaitsignaled=self._cond.wait(timeout)
________________________________________self=<Condition(<unlocked_thread.lockobjectat0x7fcf3bf52f00>, 0)>timeout=10defwait(self, timeout=None):
"""Wait until notified or until a timeout occurs. If the calling thread has not acquired the lock when this method is called, a RuntimeError is raised. This method releases the underlying lock, and then blocks until it is awakened by a notify() or notify_all() call for the same condition variable in another thread, or until the optional timeout occurs. Once awakened or timed out, it re-acquires the lock and returns. When the timeout argument is present and not None, it should be a floating point number specifying a timeout for the operation in seconds (or fractions thereof). When the underlying lock is an RLock, it is not released using its release() method, since this may not actually unlock the lock when it was acquired multiple times recursively. Instead, an internal interface of the RLock class is used, which really unlocks it even when it has been recursively acquired several times. Another internal interface is then used to restore the recursion level when the lock is reacquired. """ifnotself._is_owned():
raiseRuntimeError("cannot wait on un-acquired lock")
waiter=_allocate_lock()
waiter.acquire()
self._waiters.append(waiter)
saved_state=self._release_save()
gotit=Falsetry: # restore state no matter what (e.g., KeyboardInterrupt)iftimeoutisNone:
waiter.acquire()
gotit=Trueelse:
iftimeout>0:
>gotit=waiter.acquire(True, timeout)
EFailed: Timeout>360.0s/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py:306: Failed
The log is spitting this out:
distributed.worker - WARNING - Could not find data: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:41061']} on workers: [] (who_has: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:41061']})
distributed.scheduler - WARNING - Communication failed during replication: {'status': 'missing-data', 'keys': {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ('tcp://127.0.0.1:41061',)}}
distributed.worker - WARNING - Could not find data: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:41061']} on workers: [] (who_has: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:41061']})
distributed.scheduler - WARNING - Communication failed during replication: {'status': 'missing-data', 'keys': {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ('tcp://127.0.0.1:41061',)}}
distributed.worker - WARNING - Could not find data: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:41061']} on workers: [] (who_has: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:41061']})
@freddyaboulton I was getting cluster hanging during computations. For these I was not using scatter, but clusters would hang indefinitely. I was also seeing some issues in the logs with missing data.
What happened:
We have recently started seeing our CI fail sporadically on a test that hangs forever. The stacktrace points to
client.scatter
with broadcast=True. This error doesn't happen every time the test is run but enough times to be noticeable. Due to the sporadic nature of the error, it's hard to create a minimal repro.We have started seeing this recently, so I'm posting here as a hail-mary to see if anyone can link this to a recent change in dask. We have started seeing this since
'2021.05.1'
This is the method that is hanging:
This is the stacktrace. Note that
pytest
is setting the 360 second timeout. Without it, the test would hang forever.The log is spitting this out:
This is a link to the complete stacktrace.
This is a link to the test that's being run.
Any help would be appreciated!
Environment:
The text was updated successfully, but these errors were encountered: