Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: forward() missing 1 required positional argument: 'input' When training on CityScape DataSet #2

Open
HyuanTan opened this issue Nov 21, 2017 · 18 comments

Comments

@HyuanTan
Copy link

HyuanTan commented Nov 21, 2017

Hello, thanks for your share.
I want to train on CityScape DataSet using /train/main.py, but I usually met some error in encode stage when train or val like:

Traceback (most recent call last): File "main.py", line 538, in <module> main(parser.parse_args()) File "main.py", line 492, in main model = train(args, model, True) #Train encoder File "main.py", line 251, in train outputs = model(inputs, only_encode=enc) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply raise output File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker output = module(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs) TypeError: forward() missing 1 required positional argument: 'input'

I debug in pycharm and found that the images and labels were loaded correctly, but when in inputs = Variable(images), I found some error: cannot call .data on torch.Tensor. Did I really load the data correctly or I make something wrong in other place?

Beside, the NUM_CLASSES = 20 in CityScape DataSet, but when I train I also met an error in val:

----- VALIDATING - EPOCH 1 ----- VAL loss: 0.6922 (epoch: 1, step: 0) // Avg time/img: 0.2710 s ERROR: Unknown label with id 19

So, does the label range from 0~19 or using the trainId in labels.py?

I use Ubuntu16.04, python3.6.3 and cuda9.0.
Thanks!

@Eromera
Copy link
Owner

Eromera commented Nov 21, 2017

Hi, thanks for the message.

Could you tell a little bit more about the context where you are getting these errors? Are you getting these after some modification to the code or is it the same code as in the github?

The second error that you mention when the validation starts seems related to a change in the labels. Are you using 0-19 like in cityscapes or different? In previous code it was using the labels.py but I recently uploaded new custom code (iouEval.py) to evaluate IoU without relying on the cityscapes scripts. In the new code, the default ignore label is 19 so you need to change this if you use different labels. Are you using the new iouEval.py or the previous code?

@HyuanTan
Copy link
Author

@Eromera Thanks for your response. I clone the newest code on github repeatly and train again without other changes except adding some print code, but I met the same error:
labels: [ 0 1 2 5 7 8 10 11 13 17 18 19] filename: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/leftImg8bit/train/dusseldorf/dusseldorf_000106_000019_leftImg8bit.png filenameGt: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/gtFine/train/dusseldorf/dusseldorf_000106_000019_gtFine_labelTrainIds.png filename: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/leftImg8bit/train/tubingen/tubingen_000037_000019_leftImg8bit.png filenameGt: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/gtFine/train/tubingen/tubingen_000037_000019_gtFine_labelTrainIds.png labels: [ 0 1 2 4 5 6 7 8 10 11 13 14 19] filename: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/leftImg8bit/train/bremen/bremen_000106_000019_leftImg8bit.png filenameGt: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/gtFine/train/bremen/bremen_000106_000019_gtFine_labelTrainIds.png filename: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/leftImg8bit/train/bremen/bremen_000023_000019_leftImg8bit.png filenameGt: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/gtFine/train/bremen/bremen_000023_000019_gtFine_labelTrainIds.png labels: [ 0 1 2 5 7 8 9 10 11 19] labels: [ 0 1 2 5 6 7 8 10 13 19] labels: [ 0 1 2 5 6 7 8 10 11 12 13 15 17 18 19] labels: [ 0 1 2 3 4 5 6 7 8 9 10 11 13 19] labels: [ 0 1 2 4 5 7 8 9 10 13 19] labels: [ 0 1 2 5 7 8 10 11 12 13 17 18 19] labels: [ 0 1 2 3 4 5 8 9 10 11 12 13 18 19] labels: [ 0 1 2 3 5 7 8 9 10 11 12 13 18 19] labels: [ 0 1 2 5 7 8 11 19] Traceback (most recent call last): File "main.py", line 538, in <module> main(parser.parse_args()) File "main.py", line 492, in main model = train(args, model, True) #Train encoder File "main.py", line 251, in train outputs = model(inputs, only_encode=enc) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply raise output File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker output = module(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs)

The error occur in:

    `model.train()
    for step, (images, labels) in enumerate(loader):

        start_time = time.time()
        #print (labels.size())
        #print (np.unique(labels.numpy()))
        print("labels: ", np.unique(labels[0].numpy()))
        #labels = torch.ones(4, 1, 512, 1024).long()

        if args.cuda:
            images = images.cuda()
            labels = labels.cuda()

        inputs = Variable(images)
        targets = Variable(labels)
        outputs = model(inputs, only_encode=enc)`

There is same error when I train on my own data.
Thanks!

@Eromera
Copy link
Owner

Eromera commented Nov 22, 2017

Hi,

I am not able to reproduce this error in my end so I think that this must be in some way related to your version of PyTorch. Which version are you using? I'm using latest 0.2

Another difference is that you are using CUDA 9 and I am using CUDA 8, I think that there is not full support in PyTorch for CUDA 9 just yet, maybe this could be causing a bug in Pytorch? The problem seems to be in "parallel_apply" which is related to the DataParallel when using GPU. Also are you using 1 gpu or multiple?

Thanks

@HyuanTan
Copy link
Author

Hi,
I also use the latest pytorch version 0.2 which installed from the source code using python3.6 setup.py install, but when I install it ask me to install cndnn6:

In file included from torch/csrc/cudnn/GridSampler.h:7:0, from torch/csrc/cudnn/GridSampler.cpp:1: torch/csrc/cudnn/cudnn-wrapper.h:10:2: error: #error "CuDNN version not supported" #error "CuDNN version not supported" ^ torch/csrc/cudnn/cudnn-wrapper.h:9:198: note: #pragma message: CuDNN v5 found, but need at least CuDNN v6. You can get the latest version of CuDNN from https://developer.nvidia.com/cudnn or disable CuDNN with NO_CUDNN=1 NG(CUDNN_MAJOR) " found, but need at least CuDNN v6.

That is why I use cuda9.0.
Now I change back to cuda8.0 and reinstall pytorch using pip3 install http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp36-cp36m-manylinux1_x86_64.whl pip3 install torchvision(special pytorch version for my envirenment) and all error gone. They run well on multiple gup.

pytorch

Maybe your are right. There is not full support in PyTorch for CUDA 9 just yet and this could be causing a bug in Pytorch.

@Eromera Thanks!!

@Eromera
Copy link
Owner

Eromera commented Nov 22, 2017

@HyuanTan Great, I'm glad that the problem was found and solved by downgrading to CUDA 8.

According to latest posts in the PyTorch issues like this one, I think that compiling with CUDA 9.0 and CuDNN 7 should have recently been fixed and possible by now, but maybe your problem would only be fixed by applying a workaround that is mentioned in those posts: that you need to install NCCL as well (NVIDIA Collective Communications Library for multi-gpu). If you did not have NCCL installed then this would make sense with your error in DataParallel so if you prefer to have CUDA 9 you could still try that. Otherwise sticking to CUDA 8.0 should be ok for some time, you will not gain much unless you have one of the newest GPUs.

@HyuanTan
Copy link
Author

@Eromera Thanks, I found that maybe I don't need to install NCCL when using CUDA8.0:

gpu

But with CUDA9.0, I will try your advise.

Thanks!!

@Eromera Eromera closed this as completed Nov 22, 2017
@ChengshuLi
Copy link

I got the same issue with CUDA 9.0 and PyTorch 0.4. I realized the reason is because my batch size is not divisible by the number of gpus I have. After I fix it, the error disappear. Hope this could be helpful for someone.

@zhhtu
Copy link

zhhtu commented Jul 13, 2018

One simple solution may just use one GPU, "CUDA_VISIBLE_DEVICES=0, python main.py"

@Saif-03
Copy link

Saif-03 commented Aug 28, 2018

@ChengshuLi can you please share your updated version of the code? Since you use Pytorch 0.4 and some of the functions used here have been depreciated in PyTorch 0.4.

Thanks!

@dmenig
Copy link

dmenig commented Dec 13, 2018

I got the same error when feeding a batch smaller than the number of gpus on my machine.

@vsahil
Copy link

vsahil commented Jan 3, 2019

I got the same issue with CUDA 9.0 and PyTorch 0.4. I realized the reason is because my batch size is not divisible by the number of gpus I have. After I fix it, the error disappear. Hope this could be helpful for someone.

This did not solve my problem. I had 4 GPU's and a batch size of 64. My Pytorch version is 0.4 and Cuda version is 9.0. It was still crashing with this error trace :

    return f(x) if callable(f) else model(x)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
    raise output
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

@phdsky
Copy link

phdsky commented Jan 24, 2019

If have 4 gpu on the machine, just change batch size to whatever 8 16 32... if CUDA memory is enough, Not the 6 in the tutorial. It's not divisible.

@Eromera Eromera reopened this Jan 24, 2019
@Eromera
Copy link
Owner

Eromera commented Jan 24, 2019

I'm reopening the issue since people are still having trouble, please @vsahil confirm if last suggestion by @phdsky worked. Thanks!

@vsahil
Copy link

vsahil commented Jan 24, 2019 via email

@intersun
Copy link

intersun commented Feb 4, 2019

Hi vsahil, can you check your dataloader's output's shape? The problem might come from dataloader as well. For example, if you have 5 data and set batch_size as 4, your second batch will only consist of 1 data instead of 4, this will cause the parallel error as well.

@ShreyasSkandan
Copy link

I believe @siqims is right. That was my issue as well. I rounded off my total dataset size to be a factor of my batch_size and everything seems to work smoothly. It's a quick enough test to also try out for yourself.

@phdsky
Copy link

phdsky commented Apr 24, 2019

@vsahil Sorry to say that, I tested again and think @siqims 's answer is right.

Assume you have N samples, the batch size is b, make sure the div = N / b div must be dividable by GPU numbers. If not, the problem occurs.

@huangwenyi1991
Copy link

huangwenyi1991 commented Jul 13, 2020

I meet the same problem when i training yolov3. Acutually the problem is that the remainder of test numbers and batch size can not be divided by the GPU number. For instance, the initial number of test number of yolov3 is 450 and the initial batch size is 16, the remainder is 2 and it can not be divided by 4 GPUs and the problem will appear. So the best way to solve this issue is to change the test number. In the above intance , the problem will not appear any more when the test number changes to 452.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests