distributed training crashes with dp (list comprehension issue from torch?) #1861

Data-drone · 2020-05-17T13:06:10Z

🐛 Bug

I ran a distributed GPU template and get an error with data parallel and scatter_gather from torch nn parallel in particular.

To Reproduce

Steps to reproduce the behavior:

install packages
git clone from master
run basic example gpu job with distributed

Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last): File "gpu_template.py", line 80, in <module> main(hyperparams) File "gpu_template.py", line 41, in main trainer.fit(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit self.dp_train(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train self.run_pretrain_routine(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine False) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 277, in _evaluate output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 424, in evaluation_forward output = model(*args) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward return self.gather(outputs, self.output_device) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather return gather(outputs, output_device, dim=self.dim) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr> for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: zip argument #1 must support iteration

Code sample

run python3 gpu_template.py --gpus 2 --distributed_backend dp

Expected behavior

should run distributed demo job without errors

Environment

CUDA:
- GPU:
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- available: True
- version: 10.2
Packages:
- numpy: 1.18.4
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.7.6
- tensorboard: 2.2.1
- tqdm: 4.46.0
System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.6
- version: #201812030624 SMP Mon Dec 3 11:25:55 UTC 2018

Additional context

python3 gpu_template.py --gpus 2 --distributed_backend ddp works

The text was updated successfully, but these errors were encountered:

github-actions · 2020-05-17T13:06:52Z

Hi! thanks for your contribution!, great first issue!

nsarang · 2020-05-17T14:41:00Z

I was experiencing this problem the other day. It's somewhat related to PyTorch.

if you look at the function that gathers the outputs from GPU devices, https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py#L47

def gather_map(outputs):
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))
        return type(out)(map(gather_map, zip(*outputs)))

You'll see that It only supports tensors or dictionries that contain tensors.

The problem for me was that my training_step function returned something like this:

results = {
    "loss": loss,
    "log": all_logs,
    "progress_bar": progress_logs,
}

progress_logs was a dictionary that contained numbers, since I wanted the progress bar to show the moving average, instead of the exact values. So I came up with a hacky function like below to convert the numbers to tensors and move them to the appropriate device.

def _fix_dp_return_type(self, result, device):
    if isinstance(result, torch.Tensor):
        return result.to(device)
    if isinstance(result, dict):
        return {k:self._fix_dp_return_type(v, device) for k,v in result.items()}
    # Must be a number then
    return torch.Tensor([result]).to(device)

I hope there's a better fix for this :)

Data-drone · 2020-05-20T08:07:27Z

hmmm I am just returning loss and log so I have to convert the loss to tensor and feed into the device?

Data-drone · 2020-05-21T14:03:26Z

feels like this is something that should be covered in the parallel tools in lightning....

though I guess ddp is the recommended backend

nsarang · 2020-05-22T06:36:22Z

I agree. One way to fix is to override the default gather function in pytorch_lightning.overrides.data_parallel.LightningDataParallel.

williamFalcon · 2020-05-22T11:06:51Z

@nsarang maybe submit a PR with this patch? @ananyahjha93

nsarang · 2020-05-22T15:46:04Z

@williamFalcon Alright. Are you referring to #1895?
I'm not sure how I can work on an existing PR :)

ananyahjha93 · 2020-05-23T16:00:29Z

@nsarang you can override the gather function and create a separate PR for it.

stale · 2020-07-22T17:20:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Skylixia · 2021-01-30T12:50:41Z

I was experiencing this problem the other day. It's somewhat related to PyTorch.

if you look at the function that gathers the outputs from GPU devices, https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py#L47
def gather_map(outputs):
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))
        return type(out)(map(gather_map, zip(*outputs)))
You'll see that It only supports tensors or dictionries that contain tensors.

The problem for me was that my training_step function returned something like this:
results = {
    "loss": loss,
    "log": all_logs,
    "progress_bar": progress_logs,
}
progress_logs was a dictionary that contained numbers, since I wanted the progress bar to show the moving average, instead of the exact values. So I came up with a hacky function like below to convert the numbers to tensors and move them to the appropriate device.
def _fix_dp_return_type(self, result, device):
    if isinstance(result, torch.Tensor):
        return result.to(device)
    if isinstance(result, dict):
        return {k:self._fix_dp_return_type(v, device) for k,v in result.items()}
    # Must be a number then
    return torch.Tensor([result]).to(device)
I hope there's a better fix for this :)

Where do you call _fix_dp_return_type to fix the issue?

nsarang · 2021-01-30T12:59:33Z

@Skylixia
Are you using an up-to-date version? This is supposed to be fixed now.

Skylixia · 2021-01-30T13:04:17Z

Yes I am on the latest version.
I don't have any stack trace but seeing this issue I thought it might fix the validation problem I have.

I also have as output "Validation: 0it [00:00, ?it/s]" and used :
results = {
"loss": loss,
"log": all_logs,
"progress_bar": progress_logs,
}
idk what it could be then :/

nsarang · 2021-01-30T13:09:27Z

@Skylixia
I see that you're doing old-style logging.
This might help: https://pytorch-lightning.readthedocs.io/en/latest/logging.html

Data-drone added bug Something isn't working help wanted Open to be worked on labels May 17, 2020

This was referenced May 24, 2020

When using dp mode, only torch.Tensor can be used as the return value of the *_step function. #1904

Closed

Support returning python scalars in DP #1935

Merged

stale bot added the won't fix This will not be worked on label Jul 22, 2020

stale bot closed this as completed Jul 31, 2020

AIasd mentioned this issue Aug 7, 2020

"TypeError: zip argument #1 must support iteration" when training map_model from scratch bradyz/2020_CARLA_challenge#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training crashes with dp (list comprehension issue from torch?) #1861

distributed training crashes with dp (list comprehension issue from torch?) #1861

Data-drone commented May 17, 2020

github-actions bot commented May 17, 2020

nsarang commented May 17, 2020

Data-drone commented May 20, 2020

Data-drone commented May 21, 2020 •

edited

Loading

nsarang commented May 22, 2020

williamFalcon commented May 22, 2020

nsarang commented May 22, 2020

ananyahjha93 commented May 23, 2020

stale bot commented Jul 22, 2020

Skylixia commented Jan 30, 2021

nsarang commented Jan 30, 2021

Skylixia commented Jan 30, 2021

nsarang commented Jan 30, 2021

distributed training crashes with dp (list comprehension issue from torch?) #1861

distributed training crashes with dp (list comprehension issue from torch?) #1861

Comments

Data-drone commented May 17, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented May 17, 2020

nsarang commented May 17, 2020

Data-drone commented May 20, 2020

Data-drone commented May 21, 2020 • edited Loading

nsarang commented May 22, 2020

williamFalcon commented May 22, 2020

nsarang commented May 22, 2020

ananyahjha93 commented May 23, 2020

stale bot commented Jul 22, 2020

Skylixia commented Jan 30, 2021

nsarang commented Jan 30, 2021

Skylixia commented Jan 30, 2021

nsarang commented Jan 30, 2021

Data-drone commented May 21, 2020 •

edited

Loading