Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.8.2 calls backward on '_GeneratorContextManager' #2411

Closed
s-rog opened this issue Jun 29, 2020 · 15 comments · Fixed by #2433
Closed

0.8.2 calls backward on '_GeneratorContextManager' #2411

s-rog opened this issue Jun 29, 2020 · 15 comments · Fixed by #2433
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@s-rog
Copy link
Contributor

s-rog commented Jun 29, 2020

🐛 Bug

0.8.2 calls backward on '_GeneratorContextManager' and crashes training.
0.8.1 works correctly. my training_step returns {'loss':loss, 'log':{'learn_rate':self.lr}}

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1100, in run_pretrain_routine
    self.train()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
    self.run_training_epoch()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 630, in run_training_batch
    self.hiddens
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 804, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/core/hooks.py", line 189, in backward
    loss.backward()
AttributeError: '_GeneratorContextManager' object has no attribute 'backward'

Expected behavior

backward is called on the loss and training runs correctly

@s-rog s-rog added bug Something isn't working help wanted Open to be worked on labels Jun 29, 2020
@williamFalcon
Copy link
Contributor

did you override optimizer step?
could you try master? we just pushed a fix to a typo we had

@Anjum48
Copy link

Anjum48 commented Jun 29, 2020

Can confirm this happens on 0.8.3

@williamFalcon
Copy link
Contributor

ok. Can you post a colab example that replicates this?

@williamFalcon
Copy link
Contributor

@Anjum48 @s-rog
colab please

@s-rog
Copy link
Contributor Author

s-rog commented Jun 30, 2020

@williamFalcon my optimizer step was untouched, I can't run more testing atm but I'll get to it as soon as I can

@aeryen
Copy link

aeryen commented Jun 30, 2020

@williamFalcon Hi I also encountered this, with normal Adam optimizer. I don't have a colab to replicate this atm but from what I saw earlier, this can be replicated with any setting as long as the Trainer is set to precision=16 when using Apex. Under this condition, the following lines from training_loop.py and hooks.py will run:

if self.precision == 16 and not self.on_tpu closure_loss = model_ref.amp_scale_loss(closure_loss, optimizer, opt_idx)

scaled_loss = amp.scale_loss(unscaled_loss, optimizer)

will cause the closure_loss be a _GeneratorContextManager object. Which then cannot have a backward() method.

It seems under the current design, pytorch lighting's scale_loss function can only be used as a context?

@Anjum48
Copy link

Anjum48 commented Jun 30, 2020

@williamFalcon Here's a colab example (my first time using colab so let me know if you have issues seeing it) https://colab.research.google.com/drive/1G08jVDpx-T-5HE2c89RLJdq4u67mM2-o?usp=sharing

I suspect the issue lies with Apex AMP as suggested above by @aeryen

@williamFalcon
Copy link
Contributor

ummm. I think this is an apex issue. I can't replicate it with 16-bit native.

image

@Borda
Copy link
Member

Borda commented Jun 30, 2020

@aeryen min share a minimal example to reproduce?

@aeryen
Copy link

aeryen commented Jun 30, 2020

hi sorry for the delay: https://colab.research.google.com/drive/1rjaRRwgBTm4CKPfe9po_WSxnKqY4jDRv?usp=sharing
I agree this is an apex issue, i.e. only occur when NATIVE_AMP_AVALAIBLE is false in the hooks.py

@williamFalcon
Copy link
Contributor

@aeryen , @Anjum48 ,@s-rog this is fixed on master. Give it a try?

@aeryen
Copy link

aeryen commented Jun 30, 2020

@williamFalcon yes, the master version works for me now. Thanks!

@s-rog
Copy link
Contributor Author

s-rog commented Jul 1, 2020

@williamFalcon can confirm as well! and sorry couldn't be more helpful earlier

@Anjum48
Copy link

Anjum48 commented Jul 1, 2020

Hi @williamFalcon thanks for the quick fix. I just upgraded but am now seeing a different error:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1]
Using APEX 16bit precision.
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Loaded pretrained weights for efficientnet-b0
/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py:140: DtypeWarning: Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.
  train_single_fold(args)
Using APEX 16bit precision.
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=ddp
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic

  | Name      | Type             | Params
-----------------------------------------------
0 | critereon | CrossEntropyLoss | 0     
1 | net       | EfficientNet     | 4 M   
Validation sanity check:  50%|███████████████████████▌                       | 1/2 [00:00<00:00,  1.01it/s]Traceback (most recent call last):
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 140, in <module>
    train_single_fold(args)
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 64, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 957, in fit
    self.ddp_train(task, model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 346, in _evaluate
    self.reduce_eval_ddp(eval_results)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 363, in reduce_eval_ddp
    self.reduce_eval_ddp(v)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 365, in reduce_eval_ddp
    dist.all_reduce(v, op=dist.reduce_op.SUM)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
Traceback (most recent call last):
  File "train.py", line 140, in <module>
    train_single_fold(args)
  File "train.py", line 64, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 973, in fit
    self.spawn_ddp_children(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 449, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 346, in _evaluate
    self.reduce_eval_ddp(eval_results)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 363, in reduce_eval_ddp
    self.reduce_eval_ddp(v)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 365, in reduce_eval_ddp
    dist.all_reduce(v, op=dist.reduce_op.SUM)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

I'm not manually assigning tensors to a device (i.e. PL should be assigning all tensors as CUDA tensors) and I am not using sparse tensors (at least not that I am aware of).

EDIT: I found the issue. I guess metrics need to be CUDA tensors now. Thanks again :)

@Borda
Copy link
Member

Borda commented Jul 1, 2020

@Anjum48 mind send a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants