Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After update from 0.5.x to 0.7.3 merge_dicts #1278 sometimes breaks training #1507

Closed
fellnerse opened this issue Apr 16, 2020 · 9 comments · Fixed by #1582
Closed

After update from 0.5.x to 0.7.3 merge_dicts #1278 sometimes breaks training #1507

fellnerse opened this issue Apr 16, 2020 · 9 comments · Fixed by #1582
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@fellnerse
Copy link

fellnerse commented Apr 16, 2020

🐛 Bug

After I updated from a quite old lightning version to the newest one, I sometimes get a TypeError from merge_dicts. I guess it's related to this MR #1278 . This Type error is deterministic, meaning it always occurs at the same global step during training. It somehow seems to be related to val_check_interval as well. For some data changing this value leads to no Error. But for other datasets this does not work. Also this only happens during training step, I suspect the training step after validating.

To Reproduce

Steps to reproduce the behavior:

I have no Idea.

File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train
    self.run_training_epoch()
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 470, in run_training_epoch
    self.log_metrics(batch_step_metrics, grad_norm_dic)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
    self.logger.agg_and_log_metrics(scalar_metrics, step=step)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 128, in agg_and_log_metrics
    agg_step, metrics_to_log = self._aggregate_metrics(metrics=metrics, step=step)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 101, in _aggregate_metrics
    agg_step, agg_mets = self._finalize_agg_metrics()
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 116, in _finalize_agg_metrics
    agg_mets = merge_dicts(self._metrics_to_agg, self._agg_key_funcs, self._agg_default_func)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 347, in merge_dicts
    agg_val = fn([v for v in [d_in.get(k) for d_in in dicts] if v is not None])
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3118, in mean
    out=out, **kwargs)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/numpy/core/_methods.py", line 75, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

Sometimes its also 'dict' and 'int'

Expected behavior

At least should not break training, but maybe a more verbose message what is wrong. Its quite hard for me to debug, as the structure of the logs I'm returning to lightning does not change.

Environment

cuda:
        GPU:
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
        available:           True
        version:             10.1.243
packages:
        numpy:               1.16.4
        pyTorch_debug:       False
        pyTorch_version:     1.3.0
        pytorch-lightning:   0.7.3
        tensorboard:         2.2.0
        tqdm:                4.45.0
system:
        OS:                  Linux
        architecture:
                64bit
                ELF
        processor:           x86_64
        python:              3.7.7
        version:             #97~16.04.1-Ubuntu SMP Wed Apr 1 03:03:31 UTC 2020

Additional context

Also for some reason some runs have an issue with multiprocessing, but it does not break the training:

Traceback (most recent call last):████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00,  8.76it/s]
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/multiprocessing/util.py", line 277, in _run_finalizers
    finalizer()
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/multiprocessing/util.py", line 201, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/multiprocessing/util.py", line 110, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/shutil.py", line 498, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/shutil.py", line 496, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/tmp/pymp-jcqai2xr'
@fellnerse fellnerse added bug Something isn't working help wanted Open to be worked on labels Apr 16, 2020
@Borda Borda added this to the 0.7.4 milestone Apr 16, 2020
@olineumann
Copy link
Contributor

Did you passed any 'agg_key_funcs' to the logger class? If I understand the code correctly, by default np.mean is used to aggregate the dict values returned during training. Maybe numpy tries in the mean function to add (+ func) values which can't be summed up?

Can you maybe post the code snippets where you return the metrics to log in the lightning module and the initialization of the logger if you use one? If you don't use a logger, you can disable it by passing logger=False to the trainer (don't know if your previous version had logger on by default).

Hope I can help :)

@fellnerse
Copy link
Author

fellnerse commented Apr 16, 2020

Thanks for the quick reply!
No I'm not using any 'agg_key_funcs' that I know of.

If I understand the code correctly, by default np.mean is used to aggregate the dict values returned during training.

This only happens when there is a step in time where two times stuff is logged, right? So my guess is that at some point that is the case that two logs have to be "unified" but this fails, because I'm using "dict in dicts". I need this tho, because I want to have i.e. loss train and val in the same graph.

I'm using the TestTubeLogger:
logger = TestTubeLogger(save_dir=log_dir, name=name, description=description)
and just pass this to the Trainer.

The metric logging to lightning is a bit scattered:

  1. train_step in model:
       x, target = batch
       pred = self.forward(x)
       loss = self.loss(pred, target)
       lightning_log = {"loss": loss}

       with torch.no_grad():
           train_acc = self.calculate_accuracy(pred, target)
           tensorboard_log = {"loss": loss, "acc": train_acc}

       return tensorboard_log, lightning_log
  1. this is passed to a function that lets me add train and val to same graph:
    def _construct_lightning_log(
        self,
        tensorboard_log: dict,
        lightning_log: dict = None,
        suffix: str = "train",
        prefix: str = "metrics",
    ):
        lightning_log = lightning_log or {}
        fixed_log = {}

        for metric, value in tensorboard_log.items():
            if isinstance(value, dict):
                fixed_log[f"{prefix}/{metric}"] = value
            else:
                fixed_log[f"{prefix}/{metric}"] = {suffix: value}
        return {"log": fixed_log, **lightning_log}

@olineumann
Copy link
Contributor

Do you pass it after training_step or training_epoch_end? I think lightning collects your logs and tries to aggregate it to one value. I can't test it now. Maybe tomorrow.

But when I quickly type this into python interpreter:

>>> d={}
>>> np.mean([d,d])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in mean
  File "/usr/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 3334, in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File "/usr/lib/python3.8/site-packages/numpy/core/_methods.py", line 151, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

Seems like getting your error.

Maybe print what you exactly return and when it crashes. When I have time tomorrow, I will also make some tests.

@fellnerse
Copy link
Author

After training_step. I not have a training_epoch_end or training_end method defined.

I think lightning collects your logs and tries to aggregate it to one value.

Yes I think so as well.

Ok I return something like this:

{'metrics/aud_std': {'test': tensor(1.6337, device='cuda:0')}, 'metrics/class_loss_diff': {'test': tensor(nan)}, 'metrics/class_loss_val': {'0': tensor(nan), '1': tensor(91.5485)}, 'metrics/loss': {'test': tensor(45.7742, device='cuda:0')}, 'metrics/vid_std': {'test': tensor(1.6506, device='cuda:0')}}

What do you mean by when it crashes exactly? I think when it crashes it's always the train step after an validation step (keep in mind I'm validation several times during one epoch). If I change the val_check_interval the error either disappears or happens at a different batch number.

@alexeykarnachev
Copy link
Contributor

alexeykarnachev commented Apr 17, 2020

Hello.
I think the problem is in your metrics type. Metrics must have the Dict[str, float] type. But in your case, the metrics is a nested dict. So, that's why values are failed to be aggregated.

Is it possible for you to flatten the dictionary?

@fellnerse
Copy link
Author

@alexeykarnachev Hey! Ah yes that's what I thought. Do you know why the metrics dict is enforced to be of this type? In 0.5.x this was not an issue as far as I know.

I mean, yes I can flatten it but I want to have i.e. val/loss and train/loss in the same graph. It's basically this: https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_scalars

I know that here #1144 (comment) It was said that this should not be done, but for me this is essential.

Is there a way that I can overwrite the merge_dicts function? If so how would I do that?

@alexeykarnachev
Copy link
Contributor

alexeykarnachev commented Apr 20, 2020

@fellnerse Okay, I got your point, let's ask Borda's advice)
@Borda, what do you think? Is it possible to combine nested metrics dictionaries with metrics aggregation logic? At first sight, it doesn't look like a big problem. Maybe you can see any side effects of tracking aggregated metrics with nested dictionaries? If no, I can try to fix this issue

@Borda
Copy link
Member

Borda commented Apr 20, 2020

I ques it can be used, just need to care about the depth and the aggregation will be a bit complicated...

@fellnerse
Copy link
Author

Cool, thanks for implementing this so fast!

@Borda Borda modified the milestones: 0.7.4, v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants