Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation when running on Docker #1552

Closed
NanoCode012 opened this issue Nov 29, 2020 · 24 comments · Fixed by #1553
Labels
bug Something isn't working

Comments

@NanoCode012
Copy link
Contributor

🐛 Bug

I got the below error message when I try to test out the latest commit cff9263 on a new docker image. I haven't pulled recently, so I'm not sure which commit made this error.

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

To Reproduce (REQUIRED)

  1. Pull docker and run it
  2. Run python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache

Output:

 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Traceback (most recent call last):
  File "train.py", line 492, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 83, in train
    model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc).to(device)  # create
  File "/usr/src/app/models/yolo.py", line 95, in __init__
    self._initialize_biases()  # only run once
  File "/usr/src/app/models/yolo.py", line 150, in _initialize_biases
    b[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

Expected behavior

Run normally

Environment

  • Docker + JupyterLab (from my repo)
  • CPU, 1 GPU, Multi-GPU

Additional context

It seems to run fine when I'm running from an old conda py37 environment with torch 1.6.
I cannot reproduce this error on Google Colab.
Could there be something wrong with Docker dependencies?

@NanoCode012 NanoCode012 added the bug Something isn't working label Nov 29, 2020
@glenn-jocher
Copy link
Member

@NanoCode012 thanks for the bug report. I'll try to reproduce with yolov5:latest on a GCP instance.

I've seen this error in the past when running in-place ops like L150 in your error message with autograd on, but that line has not changed in a long time. PyTorch versions are changing though, so perhaps this is handled differently now.

@glenn-jocher
Copy link
Member

Yeah I get the same result. I think the issue is that nvidia seems to prefer pytorch nightly for their FROM images rather than the last stable release, so I can't tell if this is a nightly instability or there's some 1.8 update set to cause errors on this in the future.

If I pull latest and then run this line, everything trains fine.

pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

I guess for now I'll simply reset the image FROM tag to 20.10, which I think was working well.

Screenshot 2020-11-29 at 17 17 50

@glenn-jocher
Copy link
Member

Or wait, I just had a great idea! I think if I start from a different base image, such as pytorch/pytorch:latest, then this seems to point to the last stable release, and perhaps eliminates maintenance also as the tag never changes. I will try an experiment and see if it works.

FROM nvcr.io/nvidia/pytorch:20.11-py3
FROM pytorch/pytorch:latest

@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 29, 2020

I tried to create a pytorch:latest image here with this Dockerfile, but the image lacks some dependencies like cv2, which are causing problems on pip install, so I gave up on it. The Dockerfile is here in case anyone can debug this. In the meantime I think a rollback to 20.10 will fix this, I'll get that done.

docker pull ultralytics/yolov5:pytorch_latest

FROM pytorch/pytorch:latest

# Install dependencies
RUN pip install --upgrade pip
# COPY requirements.txt .
# RUN pip install -r requirements.txt
RUN pip install gsutil

# Create working directory
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app

# Copy contents
COPY . /usr/src/app

@glenn-jocher
Copy link
Member

Verified new image works, problem should be resolved now in PR #1553

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Nov 29, 2020

Thanks glenn! I will wait for image to build from dockerhub and test it!

Regarding pytorch:latest, I think it could be dangerous to use it in DockerFile because if there is some breaking change, you may not know till someone reports it.

Edit: This would mean that this repo will not be able to use later versions of nvidia's package until this bug is fixed somehow..

@glenn-jocher
Copy link
Member

@NanoCode012 yes that's true. The docker images don't actually have any CI tests, they just build on every commit under the assumption that the github CI tests would mostly apply to docker as well, but it is true that they often may use different PyTorch versions. GitHub also updates their dependencies on their own schedule, so when 1.6 came out for example the next day we had the daily CI test failing.

@cesarandreslopez
Copy link

@glenn-jocher correct me if I am wrong, but both nvcr.io/nvidia/pytorch:20.11-py3 and nvcr.io/nvidia/pytorch:20.10-py3 seems to use python 3.6

This project requests 3.8 or above.

Will this be a problem?

@glenn-jocher
Copy link
Member

@cesarandreslopez yes I noticed that as well. I'm not sure if 3.6.0 is compatible with this repo, I think the last one I checked was using 3.6.9. I'm doing all development in 3.8.0, but in general backwards compatibility is something I don't have lots of time to maintain and verify, which is the reason I've simply put 3.8 down as the requirement.

But as you're seeing 3.7 appears compatible, as well as possibly much of the 3.6.

@MingcongCao
Copy link

MingcongCao commented Dec 2, 2020

Hi, guys. @NanoCode012 @glenn-jocher The following code works for me:
with torch.no_grad():
b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
b[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum()) # cls

@glenn-jocher
Copy link
Member

@MingcongCao ah, I've resolved the original issue by resetting the base image to Nvidia 20.10, so all docker operations should be operating correctly now.

@hcodee
Copy link

hcodee commented Dec 7, 2020

I have met this issue with RTX3090 & Cuda 11.1.0. Is there any solution for this configuration?

python train.py --batch-size 64 --data ./data/coco128.yaml --cfg ./models/yolov5s.yaml --weights ''

Using torch 1.8.0.dev20201117 CUDA:0 (GeForce RTX 3090, 24265MB)

Traceback (most recent call last):
File "train.py", line 492, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 91, in train
model = Model(opt.cfg, ch=3, nc=nc).to(device) # create
File "/home/yons/work/yolov5/models/yolo.py", line 95, in init
self._initialize_biases() # only run once
File "/home/yons/work/yolov5/models/yolo.py", line 150, in _initialize_biases
b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

@NanoCode012
Copy link
Contributor Author

@hcodee , could you try stable torch 1.7?

@hcodee
Copy link

hcodee commented Dec 7, 2020

The torch 1.7 does not work with RTX3090. Takes long time to figure out to run on nightly build Torch 1.8.

@batrlatom
Copy link
Contributor

batrlatom commented Dec 7, 2020

The torch 1.7 does not work with RTX3090. Takes long time to figure out to run on nightly build Torch 1.8.

You need to compile pytorch yourself witch cuda 11.1 installed. It is doable, I did it without any hassle ( surprisingly ) from master. Unfortunately I need to do it again for 1.7

@hcodee
Copy link

hcodee commented Dec 7, 2020

@batrlatom Cool, Thanks remind. I will try it out.

@dnth
Copy link

dnth commented Dec 17, 2020

I have met this issue with RTX3090 & Cuda 11.1.0. Is there any solution for this configuration?

python train.py --batch-size 64 --data ./data/coco128.yaml --cfg ./models/yolov5s.yaml --weights ''

Using torch 1.8.0.dev20201117 CUDA:0 (GeForce RTX 3090, 24265MB)

Traceback (most recent call last):
File "train.py", line 492, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 91, in train
model = Model(opt.cfg, ch=3, nc=nc).to(device) # create
File "/home/yons/work/yolov5/models/yolo.py", line 95, in init
self._initialize_biases() # only run once
File "/home/yons/work/yolov5/models/yolo.py", line 150, in _initialize_biases
b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

Same problem with nighly pytorch version here. Any luck with using the self compiled pytorch 1.8?

@DoctorKey
Copy link

I met the same issue with pytorch 1.8, and the following code works for me:

b.data[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
b.data[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum())  # cls

@glenn-jocher
Copy link
Member

I just ran into this issue myself, so it's time for a fix :) Will add a TODO and prioritize this for a fix ASAP.

@glenn-jocher
Copy link
Member

@DoctorKey can confirm your solution works correctly. I will submit a PR for this to master.

@glenn-jocher
Copy link
Member

@NanoCode012 @DoctorKey @batrlatom @hcodee this problems should be resolved now by implementing @DoctorKey fix in PR #1759. Docker image for ultralytics/yolov5:latest should be updated in a few minutes with this fix.

Let me know if any other issues pop up, and thank you for your contributions!

@Nytsirch
Copy link

Hi i am new to this i just encountered a runtime problem
Traceback (most recent call last):
File "train.py", line 492, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 91, in train
model = Model(opt.cfg, ch=3, nc=nc).to(device) # create
File "/content/yolov5/models/yolo.py", line 95, in init
self._initialize_biases() # only run once
File "/content/yolov5/models/yolo.py", line 150, in _initialize_biases
b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

i am using torch 1.8.0+cu101

i really dont know what to do.
any help please

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 15, 2021

@Nytsirch this error is likely generated by an unsupported 3rd party notebook. Please see the official YOLOv5 Colab Notebook below, and visit the Train Custom Data Tutorial to get started with YOLOv5.
https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb
YOLOv5 Colab Notebook

Tutorials

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

@NingAnMe
Copy link

Hi i am new to this i just encountered a runtime problem
Traceback (most recent call last):
File "train.py", line 492, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 91, in train
model = Model(opt.cfg, ch=3, nc=nc).to(device) # create
File "/content/yolov5/models/yolo.py", line 95, in init
self._initialize_biases() # only run once
File "/content/yolov5/models/yolo.py", line 150, in _initialize_biases
b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

i am using torch 1.8.0+cu101

i really dont know what to do.
any help please

change the two old lines in 'yolo.py'

b[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
b[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum())  # cls

to new

b.data[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
b.data[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum())  # cls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants