error: can't start new thread #41

Tangolin · 2022-01-19T05:58:33Z

During the training of the model, I frequently encounter the error error: can't start new thread which occurs after <stderr>:BlockingIOError: [Errno 11] Resource temporarily unavailable. I also notice that the CPU usage is incredibly high during the training process.

I am currently following what zoe did in #32, changing the n_workers to 0, however this drastically increases the training time, is there any workaround for this problem?

Here is a more complete error output:

[1,3]<stderr>:Traceback (most recent call last):
[1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 829, in <module>
[1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 509, in start_training
[1,3]<stderr>:    model_saver.save(step=global_step, model=model)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
[1,3]<stderr>:    return func(*args, **kwargs)
[1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 238, in validate
[1,3]<stderr>:    for val_step, batch in enumerate(val_loader):
[1,3]<stderr>:  File "/clipbert/src/datasets/dataloader.py", line 97, in __iter__
[1,3]<stderr>:    loader_it = iter(self.loader)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
[1,3]<stderr>:    return _MultiProcessingDataLoaderIter(self)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
[1,3]<stderr>:    w.start()
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
[1,3]<stderr>:    self._popen = self._Popen(self)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
[1,3]<stderr>:    return _default_context.get_context().Process._Popen(process_obj)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
[1,3]<stderr>:    return Popen(process_obj)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
[1,3]<stderr>:    self._launch(process_obj)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
[1,3]<stderr>:    self.pid = os.fork()
[1,3]<stderr>:BlockingIOError: [Errno 11] Resource temporarily unavailable
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 829, in <module>
[1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 509, in start_training
[1,1]<stderr>:    model_saver.save(step=global_step, model=model)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
[1,1]<stderr>:    return func(*args, **kwargs)
[1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 238, in validate
[1,1]<stderr>:    for val_step, batch in enumerate(val_loader):
[1,1]<stderr>:  File "/clipbert/src/datasets/dataloader.py", line 97, in __iter__
[1,1]<stderr>:    loader_it = iter(self.loader)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
[1,1]<stderr>:    return _MultiProcessingDataLoaderIter(self)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 733, in __init__
[1,1]<stderr>:    pin_memory_thread.start()
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/threading.py", line 846, in start
[1,1]<stderr>:    _start_new_thread(self._bootstrap, ())
[1,1]<stderr>:RuntimeError: can't start new thread
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32362,1],1]
  Exit code:    1

The text was updated successfully, but these errors were encountered:

jayleicn · 2022-01-19T15:09:12Z

Hi @Tangolin, We did not meet this issue when we develop this code, so not sure how it happens, you can search around to see what could cause BlockingIOError error?

Tangolin · 2022-01-20T08:50:33Z

Okay! Also for the high CPU usage, is there any way to lower it?

jayleicn · 2022-01-20T15:11:38Z

Hi @Tangolin, The high CPU usage is mostly caused by the video decoding part, you can resize and downsample the video based on your downstream task needs using the script here: https://github.com/ArrowLuo/CLIP4Clip#compress-video-for-speed-up-optional

Tangolin · 2022-01-24T10:18:40Z

I see, thanks a lot!

Tangolin · 2022-01-26T07:23:42Z

Hi @jayleicn, so sorry for reopening this issue, however I would like to know whether running this script you provided will reduce the overall performance of the model?

Tangolin closed this as completed Jan 24, 2022

Tangolin reopened this Jan 26, 2022

Tangolin closed this as completed Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error: can't start new thread #41

error: can't start new thread #41

Tangolin commented Jan 19, 2022

jayleicn commented Jan 19, 2022

Tangolin commented Jan 20, 2022

jayleicn commented Jan 20, 2022

Tangolin commented Jan 24, 2022

Tangolin commented Jan 26, 2022

error: can't start new thread #41

error: can't start new thread #41

Comments

Tangolin commented Jan 19, 2022

jayleicn commented Jan 19, 2022

Tangolin commented Jan 20, 2022

jayleicn commented Jan 20, 2022

Tangolin commented Jan 24, 2022

Tangolin commented Jan 26, 2022