Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megatron 11b training: CUDA error: invalid device function #3681

Closed
apeterswu opened this issue Jul 3, 2021 · 2 comments
Closed

Megatron 11b training: CUDA error: invalid device function #3681

apeterswu opened this issue Jul 3, 2021 · 2 comments

Comments

@apeterswu
Copy link

What is your question?

Can't train the megatron-11b model for finetuning, it shows CUDA error: invalid device function. I guess this is related to pytorch version and the cuda version or apex version, but it is not sure what is the correct way, since apex running is also hard.

Code

Traceback (most recent call last):
  File "/opt/miniconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/tmp/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/tmp/fairseq/fairseq_cli/train.py", line 173, in main

PREFIX=/blob2/v-lijuwu/fairseq/examples/megatron_11b
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/opt/miniconda/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/tmp/fairseq/fairseq_cli/train.py", line 284, in train
    log_output = trainer.train_step(samples)
  File "/opt/miniconda/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/tmp/fairseq/fairseq/trainer.py", line 701, in train_step
    raise e
  File "/tmp/fairseq/fairseq/trainer.py", line 675, in train_step
    ignore_grad=is_dummy_batch,
  File "/tmp/fairseq/fairseq/tasks/fairseq_task.py", line 475, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/model_parallel/criterions/vocab_parallel_cross_entropy.py", line 42, in forward
    net_output = model(**sample["net_input"])
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/training_daemon/utils/hook.py", line 170, in wrapper
    return func(*args, **kwargs)
  File "/tmp/fairseq/fairseq/models/fairseq_model.py", line 496, in forward
    return self.decoder(src_tokens, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/models/transformer.py", line 825, in forward
    alignment_heads=alignment_heads,
  File "/tmp/fairseq/fairseq/models/transformer.py", line 847, in extract_features
    alignment_heads,
  File "/tmp/fairseq/fairseq/models/transformer.py", line 951, in extract_features_scriptable
    need_head_weights=bool((idx == alignment_layer)),
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/modules/transformer_layer.py", line 353, in forward
    attn_mask=self_attn_mask,
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/model_parallel/modules/multihead_attention.py", line 135, in forward
    q = self.q_proj(query)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/model_parallel/megatron/mpu/layers.py", line 243, in forward
    output_parallel = F.linear(input_parallel, self.weight, self.bias)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 1678, in linear
    output += bias
RuntimeError: CUDA error: invalid device function

Training code:

PREFIX=~/fairseq/examples/megatron_11b
DATA_PATH=$PREFIX/wiki103_bin

WARM_MODEL_PATH=$PREFIX/megatron_11b_model
MODEL_NAME=wiki_tune_base

fairseq-train $DATA_PATH \
  --distributed-world-size 8  \
  --memory-efficient-fp16 \
  --num-workers 2 \
  --model-parallel-size 8 \
  --criterion vocab_parallel_cross_entropy \
  --task language_modeling \
  --sample-break-mode none \
  --tokens-per-sample 1024 \
  --arch transformer_lm_megatron_11b \
  --restore-file $WARM_MODEL_PATH/model.pt \
  --save-dir $PREFIX/models/$MODEL_NAME \
  --share-decoder-input-output-embed \
  --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 \
  --lr-scheduler inverse_sqrt --lr 0.0001 \
  --warmup-updates 3000 --weight-decay 0.01 \
  --dropout 0.1 --attention-dropout 0.1 \
  --batch-size 1 \
  --max-update 50000

What have you tried?

I follow the https://github.com/pytorch/fairseq/tree/v0.10.2/examples/megatron_11b to set up the data, model and training code, environment. Since apex is required, I install the apex accordingly, with one modification is that I follow NVIDIA/apex#323 (comment) to remove the version checking code.

After installing the apex, I try to run the training with reloading the pre-trained model, then the error shows as before.

I searched but didn't find a clear answer and solution.

What's your environment?

  • fairseq Version (e.g., 1.0 or master): 1.0.0
  • PyTorch Version (e.g., 1.0): 1.6.0
  • OS (e.g., Linux): Linux 18.04
  • How you installed fairseq (pip, source): pip install
  • Build command you used (if compiling from source):
  • Python version: 3.7.0
  • CUDA/cuDNN version: cuda10.0 cudnn 7
  • GPU models and configuration: v100*32G
  • Any other relevant information:
@stale
Copy link

stale bot commented Apr 17, 2022

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

@stale stale bot added the stale label Apr 17, 2022
@stale
Copy link

stale bot commented Apr 30, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

@stale stale bot closed this as completed Apr 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant