-
-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation when running on Docker #1552
Comments
@NanoCode012 thanks for the bug report. I'll try to reproduce with yolov5:latest on a GCP instance. I've seen this error in the past when running in-place ops like L150 in your error message with autograd on, but that line has not changed in a long time. PyTorch versions are changing though, so perhaps this is handled differently now. |
Yeah I get the same result. I think the issue is that nvidia seems to prefer pytorch nightly for their FROM images rather than the last stable release, so I can't tell if this is a nightly instability or there's some 1.8 update set to cause errors on this in the future. If I pull latest and then run this line, everything trains fine.
I guess for now I'll simply reset the image FROM tag to 20.10, which I think was working well. |
Or wait, I just had a great idea! I think if I start from a different base image, such as pytorch/pytorch:latest, then this seems to point to the last stable release, and perhaps eliminates maintenance also as the tag never changes. I will try an experiment and see if it works.
|
I tried to create a pytorch:latest image here with this Dockerfile, but the image lacks some dependencies like cv2, which are causing problems on pip install, so I gave up on it. The Dockerfile is here in case anyone can debug this. In the meantime I think a rollback to 20.10 will fix this, I'll get that done. docker pull ultralytics/yolov5:pytorch_latest
|
Verified new image works, problem should be resolved now in PR #1553 |
Thanks glenn! I will wait for image to build from dockerhub and test it! Regarding pytorch:latest, I think it could be dangerous to use it in DockerFile because if there is some breaking change, you may not know till someone reports it. Edit: This would mean that this repo will not be able to use later versions of nvidia's package until this bug is fixed somehow.. |
@NanoCode012 yes that's true. The docker images don't actually have any CI tests, they just build on every commit under the assumption that the github CI tests would mostly apply to docker as well, but it is true that they often may use different PyTorch versions. GitHub also updates their dependencies on their own schedule, so when 1.6 came out for example the next day we had the daily CI test failing. |
@glenn-jocher correct me if I am wrong, but both nvcr.io/nvidia/pytorch:20.11-py3 and nvcr.io/nvidia/pytorch:20.10-py3 seems to use python 3.6 This project requests 3.8 or above. Will this be a problem? |
@cesarandreslopez yes I noticed that as well. I'm not sure if 3.6.0 is compatible with this repo, I think the last one I checked was using 3.6.9. I'm doing all development in 3.8.0, but in general backwards compatibility is something I don't have lots of time to maintain and verify, which is the reason I've simply put 3.8 down as the requirement. But as you're seeing 3.7 appears compatible, as well as possibly much of the 3.6. |
Hi, guys. @NanoCode012 @glenn-jocher The following code works for me: |
@MingcongCao ah, I've resolved the original issue by resetting the base image to Nvidia 20.10, so all docker operations should be operating correctly now. |
I have met this issue with RTX3090 & Cuda 11.1.0. Is there any solution for this configuration?
Using torch 1.8.0.dev20201117 CUDA:0 (GeForce RTX 3090, 24265MB) Traceback (most recent call last): |
@hcodee , could you try stable torch 1.7? |
The torch 1.7 does not work with RTX3090. Takes long time to figure out to run on nightly build Torch 1.8. |
You need to compile pytorch yourself witch cuda 11.1 installed. It is doable, I did it without any hassle ( surprisingly ) from master. Unfortunately I need to do it again for 1.7 |
@batrlatom Cool, Thanks remind. I will try it out. |
Same problem with nighly pytorch version here. Any luck with using the self compiled pytorch 1.8? |
I met the same issue with pytorch 1.8, and the following code works for me:
|
I just ran into this issue myself, so it's time for a fix :) Will add a TODO and prioritize this for a fix ASAP. |
@DoctorKey can confirm your solution works correctly. I will submit a PR for this to master. |
@NanoCode012 @DoctorKey @batrlatom @hcodee this problems should be resolved now by implementing @DoctorKey fix in PR #1759. Docker image for ultralytics/yolov5:latest should be updated in a few minutes with this fix. Let me know if any other issues pop up, and thank you for your contributions! |
Hi i am new to this i just encountered a runtime problem i am using torch 1.8.0+cu101 i really dont know what to do. |
@Nytsirch this error is likely generated by an unsupported 3rd party notebook. Please see the official YOLOv5 Colab Notebook below, and visit the Train Custom Data Tutorial to get started with YOLOv5. Tutorials
RequirementsPython 3.8 or later with all requirements.txt dependencies installed, including $ pip install -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu. |
change the two old lines in 'yolo.py' b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
b[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum()) # cls to new b.data[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
b.data[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum()) # cls |
🐛 Bug
I got the below error message when I try to test out the latest commit cff9263 on a new docker image. I haven't pulled recently, so I'm not sure which commit made this error.
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.
To Reproduce (REQUIRED)
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache
Output:
Expected behavior
Run normally
Environment
Additional context
It seems to run fine when I'm running from an old conda py37 environment with torch 1.6.
I cannot reproduce this error on Google Colab.
Could there be something wrong with Docker dependencies?
The text was updated successfully, but these errors were encountered: