Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with inplace operation when training with Sketchy dataset #30

Open
minhkhoi1026 opened this issue Nov 29, 2021 · 2 comments
Open

Comments

@minhkhoi1026
Copy link

minhkhoi1026 commented Nov 29, 2021

Hi sir,

When searching for an interesting image retrieval idea, I meet your project. Your project was wonderful, I tried to test your model and it works like charm!

However, the problem came up when I try to train with the Sketchy dataset. Base on your instruction in README, I tried to train with the following command:
>>> python3 train.py --dataset Sketchy --dim-out 64 --semantic-models word2vec-google-news --epochs 1 --early-stop 10 --lr 0.0001

Here, I got a weird problem, it announce me with the below message:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [288, 64]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient

I tried to fix it myself, but it didn't work. Can you help me with this problem? Thank you so much!!!

My workspace is Colab, with Pytorch 1.10.0+cu111.

The detailed error message (with torch.autograd.set_detect_anomaly(True)):

Parameters:	Namespace(batch_size=128, dataset='Sketchy', dim_out=64, early_stop=10, epoch_size=100, epochs=1, filter_sketch=False, gamma=0.1, gzs_sbir=False, im_sz=224, lambda_disc_im=0.5, lambda_disc_se=0.25, lambda_disc_sk=0.5, lambda_gen_adv=1.0, lambda_gen_cls=1.0, lambda_gen_cyc=1.0, lambda_gen_reg=0.1, lambda_im=10.0, lambda_regular=0.001, lambda_se=10.0, lambda_sk=10.0, log_interval=1, lr=0.0001, milestones=[], momentum=0.9, ngpu=1, num_workers=4, number_qualit_results=200, save_best_results=False, save_image_results=False, semantic_models=['word2vec-google-news'], sk_sz=224, split_eccv_2018=False, test=False)
Checkpoint path: /content/drive/MyDrive/sem-pcyc/auxs/CheckPoints/Sketchy/word2vec-google-news/64
Logger path: /content/drive/MyDrive/sem-pcyc/auxs/LogFiles/Sketchy/word2vec-google-news/64
Result path: /content/drive/MyDrive/sem-pcyc/auxs/Results/Sketchy/word2vec-google-news/64
Loading data...Done
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
Initializing model variables...Done
Initializing trainable models...Done
Defining optimizers...Done
Defining losses...Done
Initializing variables...Done
Setting logger...Done
Checking cuda...*Cuda exists*...Done
***Train***
/usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
[W python_anomaly_mode.cpp:104] Warning: Error detected in AddmmBackward0. Traceback of forward call that caused the error:
  File "src/train.py", line 358, in <module>
    main()
  File "src/train.py", line 230, in main
    losses = train(train_loader, sem_pcyc_model, epoch, args)
  File "src/train.py", line 323, in train
    loss = sem_pcyc_model.optimize_params(sk, im, cl)
  File "/content/drive/My Drive/sem-pcyc/src/models.py", line 368, in optimize_params
    self.forward(sk, im, se)
  File "/content/drive/My Drive/sem-pcyc/src/models.py", line 259, in forward
    self.sk2se_em = self.gen_sk2se(self.sk_fe)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/My Drive/sem-pcyc/src/models.py", line 64, in forward
    return self.gen(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
 (function _print_stack)
Traceback (most recent call last):
  File "src/train.py", line 358, in <module>
    main()
  File "src/train.py", line 230, in main
    losses = train(train_loader, sem_pcyc_model, epoch, args)
  File "src/train.py", line 323, in train
    loss = sem_pcyc_model.optimize_params(sk, im, cl)
  File "/content/drive/My Drive/sem-pcyc/src/models.py", line 371, in optimize_params
    loss = self.backward(se, num_cls)
  File "/content/drive/My Drive/sem-pcyc/src/models.py", line 325, in backward
    loss_disc_se.backward(retain_graph=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [288, 64]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
@jabader97
Copy link

If anyone is still struggling with this issue, the cause of it seems to be the optimizer: optimizer.step uses inplace operations, so when it is called for one loss before .backward() is called for another loss that shares some models, it causes the error (see pytorch/pytorch#39141). Alternatively, all the steps should happen after .backward() has been called for each loss. Then you should also move all the zero_grad() calls to the beginning, so it doesn't zero the gradient for the re-used models in the middle. That being said, setting inlace=True in relu and leakyrelu actually seems to be fine

@pSGAme
Copy link

pSGAme commented Jun 9, 2024

Hi, sir. Could you please share the link to download the Sketchy Extended dataset and TU-Berlin Image dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants