Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed training crashes with dp (list comprehension issue from torch?) #1861

Closed
Data-drone opened this issue May 17, 2020 · 13 comments · Fixed by #1935
Closed

distributed training crashes with dp (list comprehension issue from torch?) #1861

Data-drone opened this issue May 17, 2020 · 13 comments · Fixed by #1935
Labels
bug Something isn't working help wanted Open to be worked on won't fix This will not be worked on

Comments

@Data-drone
Copy link

🐛 Bug

I ran a distributed GPU template and get an error with data parallel and scatter_gather from torch nn parallel in particular.

To Reproduce

Steps to reproduce the behavior:

install packages
git clone from master
run basic example gpu job with distributed

Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last): File "gpu_template.py", line 80, in <module> main(hyperparams) File "gpu_template.py", line 41, in main trainer.fit(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit self.dp_train(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train self.run_pretrain_routine(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine False) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 277, in _evaluate output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 424, in evaluation_forward output = model(*args) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward return self.gather(outputs, self.output_device) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather return gather(outputs, output_device, dim=self.dim) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr> for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: zip argument #1 must support iteration

Code sample

run python3 gpu_template.py --gpus 2 --distributed_backend dp

Expected behavior

should run distributed demo job without errors

Environment

  • CUDA:
    - GPU:
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - available: True
    - version: 10.2
  • Packages:
    - numpy: 1.18.4
    - pyTorch_debug: False
    - pyTorch_version: 1.5.0
    - pytorch-lightning: 0.7.6
    - tensorboard: 2.2.1
    - tqdm: 4.46.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor: x86_64
    - python: 3.7.6
    - version: #201812030624 SMP Mon Dec 3 11:25:55 UTC 2018

Additional context

python3 gpu_template.py --gpus 2 --distributed_backend ddp works

@Data-drone Data-drone added bug Something isn't working help wanted Open to be worked on labels May 17, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@nsarang
Copy link
Contributor

nsarang commented May 17, 2020

I was experiencing this problem the other day. It's somewhat related to PyTorch.

if you look at the function that gathers the outputs from GPU devices, https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py#L47

def gather_map(outputs):
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))
        return type(out)(map(gather_map, zip(*outputs)))

You'll see that It only supports tensors or dictionries that contain tensors.

The problem for me was that my training_step function returned something like this:

results = {
    "loss": loss,
    "log": all_logs,
    "progress_bar": progress_logs,
}

progress_logs was a dictionary that contained numbers, since I wanted the progress bar to show the moving average, instead of the exact values. So I came up with a hacky function like below to convert the numbers to tensors and move them to the appropriate device.

def _fix_dp_return_type(self, result, device):
    if isinstance(result, torch.Tensor):
        return result.to(device)
    if isinstance(result, dict):
        return {k:self._fix_dp_return_type(v, device) for k,v in result.items()}
    # Must be a number then
    return torch.Tensor([result]).to(device)

I hope there's a better fix for this :)

@Data-drone
Copy link
Author

hmmm I am just returning loss and log so I have to convert the loss to tensor and feed into the device?

@Data-drone
Copy link
Author

Data-drone commented May 21, 2020

feels like this is something that should be covered in the parallel tools in lightning....

though I guess ddp is the recommended backend

@nsarang
Copy link
Contributor

nsarang commented May 22, 2020

I agree. One way to fix is to override the default gather function in pytorch_lightning.overrides.data_parallel.LightningDataParallel.

@williamFalcon
Copy link
Contributor

@nsarang maybe submit a PR with this patch? @ananyahjha93

@nsarang
Copy link
Contributor

nsarang commented May 22, 2020

@williamFalcon Alright. Are you referring to #1895?
I'm not sure how I can work on an existing PR :)

@ananyahjha93
Copy link
Contributor

@nsarang you can override the gather function and create a separate PR for it.

@stale
Copy link

stale bot commented Jul 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@Skylixia
Copy link

I was experiencing this problem the other day. It's somewhat related to PyTorch.

if you look at the function that gathers the outputs from GPU devices, https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py#L47

def gather_map(outputs):
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))
        return type(out)(map(gather_map, zip(*outputs)))

You'll see that It only supports tensors or dictionries that contain tensors.

The problem for me was that my training_step function returned something like this:

results = {
    "loss": loss,
    "log": all_logs,
    "progress_bar": progress_logs,
}

progress_logs was a dictionary that contained numbers, since I wanted the progress bar to show the moving average, instead of the exact values. So I came up with a hacky function like below to convert the numbers to tensors and move them to the appropriate device.

def _fix_dp_return_type(self, result, device):
    if isinstance(result, torch.Tensor):
        return result.to(device)
    if isinstance(result, dict):
        return {k:self._fix_dp_return_type(v, device) for k,v in result.items()}
    # Must be a number then
    return torch.Tensor([result]).to(device)

I hope there's a better fix for this :)

Where do you call _fix_dp_return_type to fix the issue?

@nsarang
Copy link
Contributor

nsarang commented Jan 30, 2021

@Skylixia
Are you using an up-to-date version? This is supposed to be fixed now.

@Skylixia
Copy link

Yes I am on the latest version.
I don't have any stack trace but seeing this issue I thought it might fix the validation problem I have.

I also have as output "Validation: 0it [00:00, ?it/s]" and used :
results = {
"loss": loss,
"log": all_logs,
"progress_bar": progress_logs,
}
idk what it could be then :/

@nsarang
Copy link
Contributor

nsarang commented Jan 30, 2021

@Skylixia
I see that you're doing old-style logging.
This might help: https://pytorch-lightning.readthedocs.io/en/latest/logging.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on won't fix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants