Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evolve is leaking files #2142

Closed
thhart opened this issue Feb 5, 2021 · 4 comments
Closed

Evolve is leaking files #2142

thhart opened this issue Feb 5, 2021 · 4 comments
Labels
bug Something isn't working Stale TODO

Comments

@thhart
Copy link

thhart commented Feb 5, 2021

馃悰 Bug

Evolve is not closing its resources. It leaves a lot of files open. In evolve.txt I can see 348 entries (lines), then it is aborting with error below.
Code is checked out latest from repo.

To Reproduce (REQUIRED)

Evolve is called like this:

python ./train.py --img 256 --batch 4 --epochs 8 --data $DATA_BASE/data.yaml --cfg $DATA_BASE/yolov5s.yaml --weights '' --cache --evolve

Output:

Image sizes 256 train, 256 test
Using 4 dataloader workers
Logging results to runs/train/evolve
Starting training for 8 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0%|          | 0/282 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/resource_sharer.py", line 142, in _serve
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 453, in accept
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 599, in accept
  File "/usr/lib/python3.7/socket.py", line 212, in accept
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
  File "/usr/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
  File "/usr/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/resource_sharer.py", line 142, in _serve
    with self._listener.accept() as conn:
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 453, in accept
    c = self._listener.accept()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 599, in accept
    s, self._last_accepted = self._socket.accept()
  File "/usr/lib/python3.7/socket.py", line 212, in accept
    fd, addr = self._accept()
OSError: [Errno 24] Too many open files

Environment

If applicable, add screenshots to help explain your problem.

  • OS: Ubuntu
  • GPU Nvidia 1070
@thhart thhart added the bug Something isn't working label Feb 5, 2021
@glenn-jocher
Copy link
Member

@thhart thanks for the bug report! Do you know what section of the code this problem may be originating from?

@thhart
Copy link
Author

thhart commented Feb 6, 2021

As I see it from the files open I can see a lot of multiple complete python environments kept active. It looks to me as if train.py spawns a new environment for each evaluation cycle without closing it after stored a new generation of hyperparameter set.

@glenn-jocher
Copy link
Member

@thhart thanks! Is this reproducible in a common environment, i.e. evolving COCO128 in the colab notebook?

Are there any fixes you had in mind?

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale TODO
Projects
None yet
Development

No branches or pull requests

2 participants