Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: OrderedDict mutated during iteration #2281

Closed
Kshitij09 opened this issue Jun 19, 2020 · 4 comments · Fixed by #2298
Closed

RuntimeError: OrderedDict mutated during iteration #2281

Kshitij09 opened this issue Jun 19, 2020 · 4 comments · Fixed by #2298
Assignees
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@Kshitij09
Copy link
Contributor

🐛 Bug

I was getting RuntimeError: OrderedDict mutated during iteration.
It seems like using the same LightningModule object with ModelSummary and Trainer causes this error.

To Reproduce

from pytorch_lightning.core.memory import ModelSummary
model = CifarNet() # any pl module would work here
ModelSummary(model,mode='full')
trainer = Trainer(fast_dev_run=True,gpus=1)
trainer.fit(model)

Steps to reproduce the behavior:

  1. View model summary using ModelSummary class
  2. Call trainer.fit with same object.

Stacktrace

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-8badc092c0ba> in <module>()
      1 # Checking for errors
      2 trainer = Trainer(fast_dev_run=True,gpus=1)
----> 3 trainer.fit(model)
11 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders)
    916 
    917         elif self.single_gpu:
--> 918             self.single_gpu_train(model)
    919 
    920         elif self.use_tpu:  # pragma: no-cover
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py in single_gpu_train(self, model)
    174             self.reinit_scheduler_properties(self.optimizers, self.lr_schedulers)
    175 
--> 176         self.run_pretrain_routine(model)
    177 
    178     def tpu_train(self, tpu_core_idx, model):
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
   1091 
   1092         # CORE TRAINING LOOP
-> 1093         self.train()
   1094 
   1095     def test(
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py in train(self)
    373                 # RUN TNG EPOCH
    374                 # -----------------
--> 375                 self.run_training_epoch()
    376 
    377                 if self.max_steps and self.max_steps == self.global_step:
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    456             # RUN TRAIN STEP
    457             # ---------------
--> 458             _outputs = self.run_training_batch(batch, batch_idx)
    459             batch_result, grad_norm_dic, batch_step_metrics, batch_output = _outputs
    460 
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx)
    632 
    633                 # calculate loss
--> 634                 loss, batch_output = optimizer_closure()
    635 
    636                 # check if loss or model weights are nan
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py in optimizer_closure()
    596                                                                     opt_idx, self.hiddens)
    597                         else:
--> 598                             output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
    599 
    600                         # format and reduce outputs accordingly
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py in training_forward(self, batch, batch_idx, opt_idx, hiddens)
    771             batch = self.transfer_batch_to_gpu(batch, gpu_id)
    772             args[0] = batch
--> 773             output = self.model.training_step(*args)
    774 
    775         # TPU support
<ipython-input-11-2482ebcf9d12> in training_step(self, batch, batch_idx)
     55   def training_step(self,batch,batch_idx):
     56     x, y = batch
---> 57     y_hat = self(x)
     58 
     59     return {'loss': F.cross_entropy(y_hat, y)}
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)
<ipython-input-11-2482ebcf9d12> in forward(self, x)
     11 
     12   def forward(self,x):
---> 13     return self.model(x)
     14 
     15   def prepare_data(self):
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    549         else:
    550             result = self.forward(*input, **kwargs)
--> 551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)
    553             if hook_result is not None:
RuntimeError: OrderedDict mutated during iteration

Expected behavior

We should be able to use same object with both the classes

Environment

* CUDA:
	- GPU:
		- Tesla T4
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.18.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.5.0+cu101
	- pytorch-lightning: 0.8.1
	- tensorboard:       2.2.2
	- tqdm:              4.41.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.6.9
	- version:           #1 SMP Wed Feb 19 05:26:34 PST 2020
@Kshitij09 Kshitij09 added bug Something isn't working help wanted Open to be worked on labels Jun 19, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@awaelchli
Copy link
Member

model = CifarNet() # any pl module would work here

Could you paste the minimal code for CifarNet? I cannot reproduce it with PL examples, sorry.

@Kshitij09
Copy link
Contributor Author

Okay ! I'm not sure which part is pertaining to this issue, so here is the link to my colab notebook

@awaelchli
Copy link
Member

awaelchli commented Jun 20, 2020

Thanks, your notebook was very helpful. I fixed the bug here #2298
You can verify that it works by installing from

!pip install --upgrade git+https://github.com/awaelchli/pytorch-lightning@bugfix/summary_hook_handles timm wandb

in the first cell of your notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants