Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: can't start new thread #41

Closed
Tangolin opened this issue Jan 19, 2022 · 5 comments
Closed

error: can't start new thread #41

Tangolin opened this issue Jan 19, 2022 · 5 comments

Comments

@Tangolin
Copy link

During the training of the model, I frequently encounter the error error: can't start new thread which occurs after <stderr>:BlockingIOError: [Errno 11] Resource temporarily unavailable. I also notice that the CPU usage is incredibly high during the training process.

I am currently following what zoe did in #32, changing the n_workers to 0, however this drastically increases the training time, is there any workaround for this problem?

Here is a more complete error output:

[1,3]<stderr>:Traceback (most recent call last):
[1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 829, in <module>
[1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 509, in start_training
[1,3]<stderr>:    model_saver.save(step=global_step, model=model)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
[1,3]<stderr>:    return func(*args, **kwargs)
[1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 238, in validate
[1,3]<stderr>:    for val_step, batch in enumerate(val_loader):
[1,3]<stderr>:  File "/clipbert/src/datasets/dataloader.py", line 97, in __iter__
[1,3]<stderr>:    loader_it = iter(self.loader)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
[1,3]<stderr>:    return _MultiProcessingDataLoaderIter(self)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
[1,3]<stderr>:    w.start()
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
[1,3]<stderr>:    self._popen = self._Popen(self)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
[1,3]<stderr>:    return _default_context.get_context().Process._Popen(process_obj)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
[1,3]<stderr>:    return Popen(process_obj)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
[1,3]<stderr>:    self._launch(process_obj)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
[1,3]<stderr>:    self.pid = os.fork()
[1,3]<stderr>:BlockingIOError: [Errno 11] Resource temporarily unavailable
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 829, in <module>
[1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 509, in start_training
[1,1]<stderr>:    model_saver.save(step=global_step, model=model)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
[1,1]<stderr>:    return func(*args, **kwargs)
[1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 238, in validate
[1,1]<stderr>:    for val_step, batch in enumerate(val_loader):
[1,1]<stderr>:  File "/clipbert/src/datasets/dataloader.py", line 97, in __iter__
[1,1]<stderr>:    loader_it = iter(self.loader)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
[1,1]<stderr>:    return _MultiProcessingDataLoaderIter(self)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 733, in __init__
[1,1]<stderr>:    pin_memory_thread.start()
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/threading.py", line 846, in start
[1,1]<stderr>:    _start_new_thread(self._bootstrap, ())
[1,1]<stderr>:RuntimeError: can't start new thread
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32362,1],1]
  Exit code:    1
@jayleicn
Copy link
Owner

Hi @Tangolin, We did not meet this issue when we develop this code, so not sure how it happens, you can search around to see what could cause BlockingIOError error?

@Tangolin
Copy link
Author

Okay! Also for the high CPU usage, is there any way to lower it?

@jayleicn
Copy link
Owner

Hi @Tangolin, The high CPU usage is mostly caused by the video decoding part, you can resize and downsample the video based on your downstream task needs using the script here: https://github.com/ArrowLuo/CLIP4Clip#compress-video-for-speed-up-optional

@Tangolin
Copy link
Author

I see, thanks a lot!

@Tangolin
Copy link
Author

Hi @jayleicn, so sorry for reopening this issue, however I would like to know whether running this script you provided will reduce the overall performance of the model?

@Tangolin Tangolin reopened this Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants