0.8.2 calls backward on '_GeneratorContextManager' #2411

s-rog · 2020-06-29T08:16:57Z

🐛 Bug

0.8.2 calls backward on '_GeneratorContextManager' and crashes training.
0.8.1 works correctly. my training_step returns {'loss':loss, 'log':{'learn_rate':self.lr}}

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1100, in run_pretrain_routine
    self.train()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
    self.run_training_epoch()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 630, in run_training_batch
    self.hiddens
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 804, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/core/hooks.py", line 189, in backward
    loss.backward()
AttributeError: '_GeneratorContextManager' object has no attribute 'backward'

Expected behavior

backward is called on the loss and training runs correctly

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-06-29T11:19:12Z

did you override optimizer step?
could you try master? we just pushed a fix to a typo we had

Anjum48 · 2020-06-29T12:47:22Z

Can confirm this happens on 0.8.3

williamFalcon · 2020-06-29T12:48:36Z

ok. Can you post a colab example that replicates this?

williamFalcon · 2020-06-30T00:40:54Z

@Anjum48 @s-rog
colab please

s-rog · 2020-06-30T02:41:15Z

@williamFalcon my optimizer step was untouched, I can't run more testing atm but I'll get to it as soon as I can

aeryen · 2020-06-30T03:35:30Z

@williamFalcon Hi I also encountered this, with normal Adam optimizer. I don't have a colab to replicate this atm but from what I saw earlier, this can be replicated with any setting as long as the Trainer is set to precision=16 when using Apex. Under this condition, the following lines from training_loop.py and hooks.py will run:

if self.precision == 16 and not self.on_tpu closure_loss = model_ref.amp_scale_loss(closure_loss, optimizer, opt_idx)

scaled_loss = amp.scale_loss(unscaled_loss, optimizer)

will cause the closure_loss be a _GeneratorContextManager object. Which then cannot have a backward() method.

It seems under the current design, pytorch lighting's scale_loss function can only be used as a context?

Anjum48 · 2020-06-30T07:03:11Z

@williamFalcon Here's a colab example (my first time using colab so let me know if you have issues seeing it) https://colab.research.google.com/drive/1G08jVDpx-T-5HE2c89RLJdq4u67mM2-o?usp=sharing

I suspect the issue lies with Apex AMP as suggested above by @aeryen

williamFalcon · 2020-06-30T13:58:53Z

ummm. I think this is an apex issue. I can't replicate it with 16-bit native.

Borda · 2020-06-30T14:03:54Z

@aeryen min share a minimal example to reproduce?

aeryen · 2020-06-30T16:21:22Z

hi sorry for the delay: https://colab.research.google.com/drive/1rjaRRwgBTm4CKPfe9po_WSxnKqY4jDRv?usp=sharing
I agree this is an apex issue, i.e. only occur when NATIVE_AMP_AVALAIBLE is false in the hooks.py

williamFalcon · 2020-06-30T18:52:12Z

@aeryen , @Anjum48 ,@s-rog this is fixed on master. Give it a try?

aeryen · 2020-06-30T20:25:13Z

@williamFalcon yes, the master version works for me now. Thanks!

s-rog · 2020-07-01T00:35:18Z

@williamFalcon can confirm as well! and sorry couldn't be more helpful earlier

Anjum48 · 2020-07-01T03:47:06Z

Hi @williamFalcon thanks for the quick fix. I just upgraded but am now seeing a different error:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1]
Using APEX 16bit precision.
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Loaded pretrained weights for efficientnet-b0
/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py:140: DtypeWarning: Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.
  train_single_fold(args)
Using APEX 16bit precision.
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=ddp
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic

  | Name      | Type             | Params
-----------------------------------------------
0 | critereon | CrossEntropyLoss | 0     
1 | net       | EfficientNet     | 4 M   
Validation sanity check:  50%|███████████████████████▌                       | 1/2 [00:00<00:00,  1.01it/s]Traceback (most recent call last):
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 140, in <module>
    train_single_fold(args)
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 64, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 957, in fit
    self.ddp_train(task, model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 346, in _evaluate
    self.reduce_eval_ddp(eval_results)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 363, in reduce_eval_ddp
    self.reduce_eval_ddp(v)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 365, in reduce_eval_ddp
    dist.all_reduce(v, op=dist.reduce_op.SUM)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
Traceback (most recent call last):
  File "train.py", line 140, in <module>
    train_single_fold(args)
  File "train.py", line 64, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 973, in fit
    self.spawn_ddp_children(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 449, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 346, in _evaluate
    self.reduce_eval_ddp(eval_results)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 363, in reduce_eval_ddp
    self.reduce_eval_ddp(v)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 365, in reduce_eval_ddp
    dist.all_reduce(v, op=dist.reduce_op.SUM)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

I'm not manually assigning tensors to a device (i.e. PL should be assigning all tensors as CUDA tensors) and I am not using sparse tensors (at least not that I am aware of).

EDIT: I found the issue. I guess metrics need to be CUDA tensors now. Thanks again :)

Borda · 2020-07-01T06:15:30Z

@Anjum48 mind send a new issue?

s-rog added bug Something isn't working help wanted Open to be worked on labels Jun 29, 2020

williamFalcon mentioned this issue Jun 30, 2020

Apex #2433

Merged

williamFalcon closed this as completed in #2433 Jun 30, 2020

Anjum48 mentioned this issue Jul 1, 2020

validation_epoch_end needs to return CUDA tensors #2442

Closed

rohitgr7 mentioned this issue Jul 1, 2020

Why am I receiving '_GeneratorContextManager' error when using APEX with 'precision=16'? #2454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.8.2 calls backward on '_GeneratorContextManager' #2411

0.8.2 calls backward on '_GeneratorContextManager' #2411

s-rog commented Jun 29, 2020

williamFalcon commented Jun 29, 2020

Anjum48 commented Jun 29, 2020

williamFalcon commented Jun 29, 2020

williamFalcon commented Jun 30, 2020

s-rog commented Jun 30, 2020

aeryen commented Jun 30, 2020 •

edited

Loading

Anjum48 commented Jun 30, 2020 •

edited

Loading

williamFalcon commented Jun 30, 2020

Borda commented Jun 30, 2020

aeryen commented Jun 30, 2020 •

edited

Loading

williamFalcon commented Jun 30, 2020

aeryen commented Jun 30, 2020 •

edited

Loading

s-rog commented Jul 1, 2020

Anjum48 commented Jul 1, 2020 •

edited

Loading

Borda commented Jul 1, 2020

0.8.2 calls backward on '_GeneratorContextManager' #2411

0.8.2 calls backward on '_GeneratorContextManager' #2411

Comments

s-rog commented Jun 29, 2020

🐛 Bug

Expected behavior

williamFalcon commented Jun 29, 2020

Anjum48 commented Jun 29, 2020

williamFalcon commented Jun 29, 2020

williamFalcon commented Jun 30, 2020

s-rog commented Jun 30, 2020

aeryen commented Jun 30, 2020 • edited Loading

Anjum48 commented Jun 30, 2020 • edited Loading

williamFalcon commented Jun 30, 2020

Borda commented Jun 30, 2020

aeryen commented Jun 30, 2020 • edited Loading

williamFalcon commented Jun 30, 2020

aeryen commented Jun 30, 2020 • edited Loading

s-rog commented Jul 1, 2020

Anjum48 commented Jul 1, 2020 • edited Loading

Borda commented Jul 1, 2020

aeryen commented Jun 30, 2020 •

edited

Loading

Anjum48 commented Jun 30, 2020 •

edited

Loading

aeryen commented Jun 30, 2020 •

edited

Loading

aeryen commented Jun 30, 2020 •

edited

Loading

Anjum48 commented Jul 1, 2020 •

edited

Loading