Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the parameter gpus of Trainer> 1, _pickle.PicklingError: #2883

Closed
guangmingjian opened this issue Aug 8, 2020 · 11 comments
Closed

When the parameter gpus of Trainer> 1, _pickle.PicklingError: #2883

guangmingjian opened this issue Aug 8, 2020 · 11 comments
Assignees
Labels
question Further information is requested

Comments

@guangmingjian
Copy link

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I have 4 GPUs, only Trainer(gpus=1,row_log_interval=10,max_epochs=100) can run normally. But the cpu utilization rate is extremely low, 40 cores only use 1 core. When the parameter gpus=2, the following error will occur.

Traceback (most recent call last):
File "/home/mingjian/pythoncode/GNNPyg/test/testskorch.py", line 51, in
trainer.fit(model, train_loader)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 988, in fit
results = self.__run_ddp_spawn(model, nprocs=self.num_processes)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1068, in __run_ddp_spawn
mp.spawn(self.ddp_train, nprocs=nprocs, args=(q, model, ))
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
process.start()
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle typing.Union[torch.Tensor, NoneType]: it's not the same object as typing.Union

Code

trainer = Trainer(gpus=2,row_log_interval=10,max_epochs=100) trainer.fit(model, train_loader)

What have you tried?

What's your environment?

  • OS: [ Linux,Ubuntu 16]
  • Packaging [pip]
  • Version [0.8.5]
@guangmingjian guangmingjian added the question Further information is requested label Aug 8, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2020

Hi! thanks for your contribution!, great first issue!

@awaelchli
Copy link
Member

awaelchli commented Aug 8, 2020

ddp_spawn requires everything to be pickleable (see https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#make-models-pickleable).

Do you have typing.Union[torch.Tensor, NoneType] anywhere in your codebase (it's what the traceback message tells us)?

I believe you are running into this issue because you store something in your traininscript or LightningModule that is not pickleable. You don't see this error for gpus=1 because there is no need to pickle for multiple processes.

If you can't figure out what it is, try distributed_backend="ddp" setting in the Trainer.

@guangmingjian
Copy link
Author

guangmingjian commented Aug 8, 2020

Thank you so much! New problem arises. But the frustrating thing is that I have no idea to solve it.The new error is as follows:


Traceback (most recent call last):
  File "/home/mingjian/pythoncode/GNNPyg/test/testskorch.py", line 51, in <module>
Traceback (most recent call last):
  File "/home/mingjian/pythoncode/GNNPyg/test/testskorch.py", line 51, in <module>
    trainer.fit(model, train_loader)
    trainer.fit(model, train_loader)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in fit
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 992, in fit
    results = self.spawn_ddp_children(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 462, in spawn_ddp_children
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.ddp_train(local_rank, q=None, model=model, is_master=True)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
    results = self.run_pretrain_routine(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
    self.train()
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 333, in train
    self.reset_train_dataloader(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 211, in reset_train_dataloader
    self.train()
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 333, in train
    self.train_dataloader = self.auto_add_sampler(self.train_dataloader, train=True)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 167, in auto_add_sampler
    self.reset_train_dataloader(model)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 211, in reset_train_dataloader
    dataloader = self.replace_sampler(dataloader, sampler)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 179, in replace_sampler
    self.train_dataloader = self.auto_add_sampler(self.train_dataloader, train=True)
    dataloader = type(dataloader)(**dl_args)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 167, in auto_add_sampler
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch_geometric/data/dataloader.py", line 55, in __init__
    collate_fn=Collater(follow_batch), **kwargs)
TypeError: __init__() got multiple values for keyword argument 'collate_fn'
    dataloader = self.replace_sampler(dataloader, sampler)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 179, in replace_sampler
    dataloader = type(dataloader)(**dl_args)
  File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch_geometric/data/dataloader.py", line 55, in __init__
    collate_fn=Collater(follow_batch), **kwargs)
TypeError: __init__() got multiple values for keyword argument 'collate_fn'

and my code

from torch_geometric.datasets import GNNBenchmarkDataset
from torch_geometric.data import DataLoader
dataset = GNNBenchmarkDataset(root=data_root, name=name, split=“train”)
train_loader = DataLoader(dataset, batch_size=256,pin_memory=True,num_workers=8)
trainer = Trainer(gpus=2,row_log_interval=10,max_epochs=100,distributed_backend="ddp")
trainer.fit(model, train_loader)

Is the error due to a problem with the geometry library loading the dataset? Is there a good solution?

@awaelchli
Copy link
Member

A quick search on their github reveals several issues with pickling. It looks like they still need to work on DDP compatibility. https://github.com/rusty1s/pytorch_geometric/issues?q=is%3Aissue+is%3Aopen+pickle

But the cpu utilization rate is extremely low, 40 cores only use 1 core

probably just increase num_workers to a higher number :)

now back to your new problem. It looks like torch_geometry uses a different type of dataloader so when we try to replace the ddp sampler it fails. Do you have a minimal code you can share in a google colab notebook? It would save me some time to reproduce, but if not it's ok and I'll try myself.

@guangmingjian
Copy link
Author

Thank you for your patience. Since I have not used google colab notebook. I will learn it as soon as possible. Hope to run it on google colab tomorrow. Please wait for me, thank you!

@guangmingjian
Copy link
Author

guangmingjian commented Aug 8, 2020

The code has been reproduced in the google colab notebook. However, only one gpu can be used on colab.

https://colab.research.google.com/drive/1yKg_Kb9gydFx1tWkRmOgoa2qJVp01AJP?usp=sharing

trainer = Trainer(gpus=2,max_epochs=100,distributed_backend="ddp")
Error is:

usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py in sanitize_gpu_ids(gpus)
    408                 You requested GPUs: {gpus}
    409                 But your machine only has: {all_available_gpus}
--> 410             """)
    411     return gpus
    412 

MisconfigurationException: 
                You requested GPUs: [0, 1]
                But your machine only has: [0]

When gpu=1 is tested, and it can run normally.

When gpu = 1. Use the command top to observe that the cpu is running on Linux. The cpu utilization never exceeds 100%, but I have 40 cores. In theory, it can be up to 4000%. I suspect it is related to the implementation of the pytorch geometry library. Meanwhile, when using the Cora dataset, the original pytorch training code is used, the cpu can reach 3000%. But using Trainer(gpus=1,max_epochs=100), it will never exceed 100%. I exactly like pytorch-lightning and look forward to solving this problem. Can you give some guidance and suggestions?

@ananyahjha93 ananyahjha93 self-assigned this Aug 8, 2020
@awaelchli
Copy link
Member

awaelchli commented Aug 9, 2020

Hi @guangmingjian
Thanks for the code, I was able to run it and investigate the dataloader issue. The solution for the reported error message
TypeError: __init__() got multiple values for keyword argument 'collate_fn' is to disable automatic replacement of the DistributedSampler and create it yourself:

trainer = Trainer(..., replace_sampler_ddp=False)
dataset = ...
sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=num_gpus, rank=trainer.global_rank)
train_loader = torch_geometric.data.DataLoader(..., sampler=sampler)

The reason why we have to do this manually here is that torch geometric has a custom dataloader, and it is impossible for us to know which args we need to pass in. The use a custom collate_fn and set it internally. As far as I know, we cannot automate this.

This will solve the problem of the error message, but you will run into a different new problem, a new error saying that your batch is not on the correct device. I found out that torchgeometric uses a custom DataParallel module to move their Batch data to the device, but they don't implement a module for DistributedDataParallel.

It's a very similar problem as for torchtext. These datasets/dataloaders return objects that are not simple dicts, and so we don't have a generic way of moving data to the device, scatter and gather it.

Maybe the best is to open an issue on their github, although I already see some related ones to distributed.
cc @PyTorchLightning/core-contributors

@guangmingjian
Copy link
Author

Thanks for your answers!!!

@rjanovski
Copy link

you can try to use the latest version (not the stable one) e.g.:
pip install --upgrade pytorch-lightning==0.9.0rc11
see the related issues fixed with replacing the distributed backend implementation

@ananyahjha93
Copy link
Contributor

@guangmingjian based on @awaelchli 's answer, closing this issue for now. Re-open if you think there is a need for further discussion.

@zhong-yy
Copy link

zhong-yy commented Apr 3, 2023

Hi, I encountered a similar problem. When I set devices=2, accelerator="gpu". I got _pickle.UnpicklingError: pickle data was truncated error. I can't find distributed_backend in the document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants