When the parameter gpus of Trainer> 1, _pickle.PicklingError: #2883

guangmingjian · 2020-08-08T08:31:21Z

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I have 4 GPUs, only Trainer(gpus=1,row_log_interval=10,max_epochs=100) can run normally. But the cpu utilization rate is extremely low, 40 cores only use 1 core. When the parameter gpus=2, the following error will occur.

Traceback (most recent call last):
File "/home/mingjian/pythoncode/GNNPyg/test/testskorch.py", line 51, in
trainer.fit(model, train_loader)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 988, in fit
results = self.__run_ddp_spawn(model, nprocs=self.num_processes)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1068, in __run_ddp_spawn
mp.spawn(self.ddp_train, nprocs=nprocs, args=(q, model, ))
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
process.start()
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle typing.Union[torch.Tensor, NoneType]: it's not the same object as typing.Union

Code

trainer = Trainer(gpus=2,row_log_interval=10,max_epochs=100) trainer.fit(model, train_loader)

What have you tried?

What's your environment?

OS: [ Linux,Ubuntu 16]
Packaging [pip]
Version [0.8.5]

The text was updated successfully, but these errors were encountered:

github-actions · 2020-08-08T08:32:07Z

Hi! thanks for your contribution!, great first issue!

awaelchli · 2020-08-08T08:40:14Z

ddp_spawn requires everything to be pickleable (see https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#make-models-pickleable).

Do you have typing.Union[torch.Tensor, NoneType] anywhere in your codebase (it's what the traceback message tells us)?

I believe you are running into this issue because you store something in your traininscript or LightningModule that is not pickleable. You don't see this error for gpus=1 because there is no need to pickle for multiple processes.

If you can't figure out what it is, try distributed_backend="ddp" setting in the Trainer.

guangmingjian · 2020-08-08T09:27:51Z

Thank you so much! New problem arises. But the frustrating thing is that I have no idea to solve it.The new error is as follows：


Traceback (most recent call last):
  File "/home/mingjian/pythoncode/GNNPyg/test/testskorch.py", line 51, in <module>
Traceback (most recent call last):
  File "/home/mingjian/pythoncode/GNNPyg/test/testskorch.py", line 51, in <module>
    trainer.fit(model, train_loader)
    trainer.fit(model, train_loader)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in fit
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 992, in fit
    results = self.spawn_ddp_children(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 462, in spawn_ddp_children
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.ddp_train(local_rank, q=None, model=model, is_master=True)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
    results = self.run_pretrain_routine(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
    self.train()
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 333, in train
    self.reset_train_dataloader(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 211, in reset_train_dataloader
    self.train()
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 333, in train
    self.train_dataloader = self.auto_add_sampler(self.train_dataloader, train=True)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 167, in auto_add_sampler
    self.reset_train_dataloader(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 211, in reset_train_dataloader
    dataloader = self.replace_sampler(dataloader, sampler)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 179, in replace_sampler
    self.train_dataloader = self.auto_add_sampler(self.train_dataloader, train=True)
    dataloader = type(dataloader)(**dl_args)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 167, in auto_add_sampler
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch_geometric/data/dataloader.py", line 55, in __init__
    collate_fn=Collater(follow_batch), **kwargs)
TypeError: __init__() got multiple values for keyword argument 'collate_fn'
    dataloader = self.replace_sampler(dataloader, sampler)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 179, in replace_sampler
    dataloader = type(dataloader)(**dl_args)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch_geometric/data/dataloader.py", line 55, in __init__
    collate_fn=Collater(follow_batch), **kwargs)
TypeError: __init__() got multiple values for keyword argument 'collate_fn'

and my code

from torch_geometric.datasets import GNNBenchmarkDataset
from torch_geometric.data import DataLoader
dataset = GNNBenchmarkDataset(root=data_root, name=name, split=“train”)
train_loader = DataLoader(dataset, batch_size=256,pin_memory=True,num_workers=8)
trainer = Trainer(gpus=2,row_log_interval=10,max_epochs=100,distributed_backend="ddp")
trainer.fit(model, train_loader)

Is the error due to a problem with the geometry library loading the dataset? Is there a good solution?

awaelchli · 2020-08-08T11:49:53Z

A quick search on their github reveals several issues with pickling. It looks like they still need to work on DDP compatibility. https://github.com/rusty1s/pytorch_geometric/issues?q=is%3Aissue+is%3Aopen+pickle

But the cpu utilization rate is extremely low, 40 cores only use 1 core

probably just increase num_workers to a higher number :)

now back to your new problem. It looks like torch_geometry uses a different type of dataloader so when we try to replace the ddp sampler it fails. Do you have a minimal code you can share in a google colab notebook? It would save me some time to reproduce, but if not it's ok and I'll try myself.

guangmingjian · 2020-08-08T13:53:35Z

Thank you for your patience. Since I have not used google colab notebook. I will learn it as soon as possible. Hope to run it on google colab tomorrow. Please wait for me, thank you！

guangmingjian · 2020-08-08T15:35:54Z

The code has been reproduced in the google colab notebook. However, only one gpu can be used on colab.

https://colab.research.google.com/drive/1yKg_Kb9gydFx1tWkRmOgoa2qJVp01AJP?usp=sharing

trainer = Trainer(gpus=2,max_epochs=100,distributed_backend="ddp")
Error is:

usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py in sanitize_gpu_ids(gpus)
    408                 You requested GPUs: {gpus}
    409                 But your machine only has: {all_available_gpus}
--> 410             """)
    411     return gpus
    412 

MisconfigurationException: 
                You requested GPUs: [0, 1]
                But your machine only has: [0]

When gpu=1 is tested, and it can run normally.

When gpu = 1. Use the command top to observe that the cpu is running on Linux. The cpu utilization never exceeds 100%, but I have 40 cores. In theory, it can be up to 4000%. I suspect it is related to the implementation of the pytorch geometry library. Meanwhile, when using the Cora dataset, the original pytorch training code is used, the cpu can reach 3000%. But using Trainer(gpus=1,max_epochs=100), it will never exceed 100%. I exactly like pytorch-lightning and look forward to solving this problem. Can you give some guidance and suggestions？

awaelchli · 2020-08-09T11:13:04Z

Hi @guangmingjian
Thanks for the code, I was able to run it and investigate the dataloader issue. The solution for the reported error message
TypeError: __init__() got multiple values for keyword argument 'collate_fn' is to disable automatic replacement of the DistributedSampler and create it yourself:

trainer = Trainer(..., replace_sampler_ddp=False)
dataset = ...
sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=num_gpus, rank=trainer.global_rank)
train_loader = torch_geometric.data.DataLoader(..., sampler=sampler)

The reason why we have to do this manually here is that torch geometric has a custom dataloader, and it is impossible for us to know which args we need to pass in. The use a custom collate_fn and set it internally. As far as I know, we cannot automate this.

This will solve the problem of the error message, but you will run into a different new problem, a new error saying that your batch is not on the correct device. I found out that torchgeometric uses a custom DataParallel module to move their Batch data to the device, but they don't implement a module for DistributedDataParallel.

It's a very similar problem as for torchtext. These datasets/dataloaders return objects that are not simple dicts, and so we don't have a generic way of moving data to the device, scatter and gather it.

Maybe the best is to open an issue on their github, although I already see some related ones to distributed.
cc @PyTorchLightning/core-contributors

guangmingjian · 2020-08-09T12:27:42Z

Thanks for your answers!!!

rjanovski · 2020-08-09T13:22:34Z

you can try to use the latest version (not the stable one) e.g.:
pip install --upgrade pytorch-lightning==0.9.0rc11
see the related issues fixed with replacing the distributed backend implementation

ananyahjha93 · 2020-08-09T18:09:27Z

@guangmingjian based on @awaelchli 's answer, closing this issue for now. Re-open if you think there is a need for further discussion.

zhong-yy · 2023-04-03T07:43:05Z

Hi, I encountered a similar problem. When I set devices=2, accelerator="gpu". I got _pickle.UnpicklingError: pickle data was truncated error. I can't find distributed_backend in the document.

guangmingjian added the question Further information is requested label Aug 8, 2020

ananyahjha93 self-assigned this Aug 8, 2020

ananyahjha93 closed this as completed Aug 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the parameter gpus of Trainer> 1, _pickle.PicklingError: #2883

When the parameter gpus of Trainer> 1, _pickle.PicklingError: #2883

guangmingjian commented Aug 8, 2020

github-actions bot commented Aug 8, 2020

awaelchli commented Aug 8, 2020 •

edited

Loading

guangmingjian commented Aug 8, 2020 •

edited

Loading

awaelchli commented Aug 8, 2020

guangmingjian commented Aug 8, 2020

guangmingjian commented Aug 8, 2020 •

edited

Loading

awaelchli commented Aug 9, 2020 •

edited

Loading

guangmingjian commented Aug 9, 2020

rjanovski commented Aug 9, 2020

ananyahjha93 commented Aug 9, 2020

zhong-yy commented Apr 3, 2023

When the parameter gpus of Trainer> 1, _pickle.PicklingError: #2883

When the parameter gpus of Trainer> 1, _pickle.PicklingError: #2883

Comments

guangmingjian commented Aug 8, 2020

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

github-actions bot commented Aug 8, 2020

awaelchli commented Aug 8, 2020 • edited Loading

guangmingjian commented Aug 8, 2020 • edited Loading

awaelchli commented Aug 8, 2020

guangmingjian commented Aug 8, 2020

guangmingjian commented Aug 8, 2020 • edited Loading

awaelchli commented Aug 9, 2020 • edited Loading

guangmingjian commented Aug 9, 2020

rjanovski commented Aug 9, 2020

ananyahjha93 commented Aug 9, 2020

zhong-yy commented Apr 3, 2023

awaelchli commented Aug 8, 2020 •

edited

Loading

guangmingjian commented Aug 8, 2020 •

edited

Loading

guangmingjian commented Aug 8, 2020 •

edited

Loading

awaelchli commented Aug 9, 2020 •

edited

Loading