RuntimeError: expected scalar type Float but found Half .. : when training image-gpt model with deepspeed #8125

GeorgeQ-Q · 2021-06-25T10:14:46Z

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

####1.

git clone https://github.com/teddykoker/image-gpt.git

(add plugin deepspeed_stage_2 into pl.Trainer )
####2.change:

trainer = pl.Trainer(
            max_steps=config["steps"],
            gpus=config["gpus"],
            precision=config["precision"],
            accumulate_grad_batches=config["accumulate_grad_batches"],
            checkpoint_callback=checkpoint,
            logger=logger,
        )

into:

trainer = pl.Trainer(
            max_steps=config["steps"],
            gpus=config["gpus"],
            precision=config["precision"],
            accumulate_grad_batches=config["accumulate_grad_batches"],
            checkpoint_callback=checkpoint,
            logger=logger,
            plugins='deepspeed_stage_2',
        )

####3. run

python src/run.py --dataset mnist train configs/s_gen.yml

Use following BoringModel and post here

Expected behavior

Traceback (most recent call last):
  File "src/run.py", line 96, in <module>
    args.func(args)
  File "src/run.py", line 65, in train
    trainer.fit(model, train_dl, valid_dl)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in run_train
    self.run_sanity_check(self.lightning_module)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1107, in run_sanity_check
    self.run_evaluation()
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 174, in evaluation_step
    output = self.trainer.accelerator.validation_step(args)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in validation_step
    return self.training_type_plugin.validation_step(*args)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 326, in validation_step
    return self.model(*args, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1098, in forward
    loss = self.module(*inputs, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 62, in forward
    return super().forward(*inputs, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/overrides/base.py", line 57, in forward
    output = self.module.validation_step(*inputs, **kwargs)
  File "/qiuzihan/image-gpt/src/image_gpt.py", line 125, in validation_step
    logits = self.gpt(x)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/src/gpt2.py", line 74, in forward
    h = layer(h)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/src/gpt2.py", line 24, in forward
    x = self.ln_1(x)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/normalization.py", line 171, in forward
    input, self.normalized_shape, self.weight, self.bias, self.eps)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/functional.py", line 2202, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: expected scalar type Float but found Half

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

IDE: Please, use our python bug_report_model.py template.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://github.com/raw/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

PyTorch Version (e.g., 1.0): 1.8.0
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip3
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version: 11.1
GPU models and configuration: V100 -32GB
Any other relevant information:

 Package                Version
---------------------- ------------
absl-py                0.13.0
aiohttp                3.7.4.post0
async-timeout          3.0.1
attrs                  21.2.0
cachetools             4.2.2
certifi                2021.5.30
chardet                4.0.0
dataclasses            0.8
deepspeed              0.4.1
fairscale              0.3.7
fsspec                 2021.6.1
future                 0.18.2
google-auth            1.32.0
google-auth-oauthlib   0.4.4
grpcio                 1.38.1
idna                   2.10
idna-ssl               1.1.0
importlib-metadata     4.5.0
Markdown               3.3.4
multidict              5.1.0
ninja                  1.10.0.post2
numpy                  1.19.5
oauthlib               3.1.1
packaging              20.9
Pillow                 8.2.0
pip                    21.1.2
protobuf               3.17.3
psutil                 5.8.0
pyasn1                 0.4.8
pyasn1-modules         0.2.8
pyDeprecate            0.3.0
pyparsing              2.4.7
pytorch-lightning      1.3.7.post0
PyYAML                 5.4.1
requests               2.25.1
requests-oauthlib      1.3.0
rsa                    4.7.2
setuptools             57.0.0
six                    1.16.0
tensorboard            2.4.1
tensorboard-plugin-wit 1.8.0
tensorboardX           1.8
torch                  1.8.0+cu111
torchmetrics           0.3.2
torchvision            0.9.0+cu111
tqdm                   4.61.1
triton                 0.4.2
typing-extensions      3.10.0.0
urllib3                1.26.5
Werkzeug               2.0.1
wheel                  0.36.2
yarl                   1.6.3
zipp                   3.4.1

Additional context

The text was updated successfully, but these errors were encountered:

tchaton · 2021-06-28T12:28:08Z

Dear @GeorgeQ-Q,

Would it be possible for you to reproduce this behaviour with the BoringModel ?

Best,
T.C

GeorgeQ-Q · 2021-06-30T08:18:14Z

boringModel

This is the best I can do, since colab do not support ddp

Best,
G

griff4692 · 2021-07-01T19:13:08Z

following this as I am having the same issue :(

I didn't see any memory improvement with fairscale so am hoping deepspeed offers some

SeanNaren · 2021-07-26T11:10:32Z

Thanks for your reproducible sample @ GeorgeQ-Q

I've made a fix in lightning bolts here: Lightning-Universe/lightning-bolts#694 with the latest DeepSpeed this works as they've fixed the underlying issue with GPT vision models as well :)

For anyone who is doing custom code, make sure the types are correct of any tensors you're making within the forward pass of your module.

GeorgeQ-Q added bug Something isn't working help wanted Open to be worked on labels Jun 25, 2021

edenlightning assigned SeanNaren Jul 6, 2021

edenlightning added the priority: 0 High priority task label Jul 6, 2021

edenlightning added this to the v1.3.x milestone Jul 6, 2021

Borda modified the milestones: v1.3.x, v1.4 Jul 6, 2021

edenlightning modified the milestones: v1.4, v1.3.x Jul 6, 2021

SeanNaren mentioned this issue Jul 26, 2021

Set the dtype correctly for vision GPT model Lightning-Universe/lightning-bolts#694

Merged

8 tasks

Borda closed this as completed in Lightning-Universe/lightning-bolts#694 Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: expected scalar type Float but found Half .. : when training image-gpt model with deepspeed #8125

RuntimeError: expected scalar type Float but found Half .. : when training image-gpt model with deepspeed #8125

GeorgeQ-Q commented Jun 25, 2021

tchaton commented Jun 28, 2021

GeorgeQ-Q commented Jun 30, 2021

griff4692 commented Jul 1, 2021 •

edited

Loading

SeanNaren commented Jul 26, 2021

RuntimeError: expected scalar type Float but found Half .. : when training image-gpt model with deepspeed #8125

RuntimeError: expected scalar type Float but found Half .. : when training image-gpt model with deepspeed #8125

Comments

GeorgeQ-Q commented Jun 25, 2021

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

tchaton commented Jun 28, 2021

GeorgeQ-Q commented Jun 30, 2021

griff4692 commented Jul 1, 2021 • edited Loading

SeanNaren commented Jul 26, 2021

griff4692 commented Jul 1, 2021 •

edited

Loading