分布式多机多卡训练卡住，超时后报错 #123

koking0 · 2022-08-19T08:10:23Z

接 #111 ，我们搭建了两个相同环境（500G内存、8块1080Ti 11G显卡）的服务器，想尝试多机多卡训练方案，加载模型成功了，但是并没有开始训练，过了一段时间后应该是超时退出了。

#!/bin/bash

set -x -e

echo "START TIME: $(date)"
MICRO_BATCH_SIZE=1
ROOT_DIR=$(pwd)

ZERO_STAGE=3

config_json="$ROOT_DIR/training_config.json"
export MASTER_PORT=$((RANDOM % 10000 + 30000))

cat <<EOT >$config_json
{
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "steps_per_print": 1000,
  "gradient_clipping": 1,
  "zero_optimization": {
    "stage": ${ZERO_STAGE},
    "allgather_partitions": false,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "stage3_max_live_parameters" : 2e8,
    "stage3_max_reuse_distance" : 2e8,
    "stage3_prefetch_bucket_size": 2e8,
    "stage3_param_persistence_threshold": 2e8,
    "sub_group_size" : 2e8,
    "round_robin_gradients": true
  },
  "bf16": {
    "enabled": true
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-5,
      "betas": [0.9,0.95],
      "eps": 1e-8,
      "weight_decay": 1e-2
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params":{
      "warmup_min_lr": 5e-6,
      "warmup_max_lr": 1e-5
    }
  }
}
EOT

export PL_DEEPSPEED_CONFIG_PATH=$config_json
TRAINER_ARGS="
    --max_epochs 1 \
    --num_nodes 2 \
    --gpus 8 \
    --strategy deepspeed_stage_${ZERO_STAGE}_offload \
    --default_root_dir $ROOT_DIR \
    --dirpath $ROOT_DIR/ckpt \
    --save_top_k 3 \
    --monitor train_loss \
    --mode min \
    --save_last \
"

DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
DATA_ARGS="
    --data_dir $DATA_DIR \
    --max_seq_length 64 \
    --train_batchsize $MICRO_BATCH_SIZE \
    --valid_batchsize $MICRO_BATCH_SIZE \
    --train_data test_train.txt \
    --valid_data test.txt \
    --test_data  test.txt
"

PRETRAINED_MODEL_PATH="IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese"
MODEL_ARGS="
    --pretrained_model_path ${PRETRAINED_MODEL_PATH} \
    --output_save_path $ROOT_DIR/predict.json \
    --learning_rate 1e-4 \
    --weight_decay 0.1 \
    --warmup 0.01 \
"

DISTRIBUTED_ARGS="
    --nnodes 2 \
    --nproc_per_node=8 \
    --master_addr 192.168.1.14 \
    --master_port 9005 \
    --node_rank 0 \
    --max_restarts=1
"

SCRIPTS_PATH=${ROOT_DIR}/finetune_gpt2.py

export CMD=" \
    $DISTRIBUTED_ARGS \
    $SCRIPTS_PATH \
    $TRAINER_ARGS \
    $MODEL_ARGS \
    $DATA_ARGS \
"

export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=enp129s0f0

#python ${CMD}
torchrun ${CMD}

Node0报错如下：

[E ProcessGroupNCCL.cpp:737] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:737] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801313 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
Fatal Python error: Aborted

Thread 0x00007fa5abfff700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 312 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/multiprocessing/queues.py", line 233 in _feed
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 910 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007fa431fff700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 312 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/multiprocessing/queues.py", line 233 in _feed
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 910 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007fa5e75ff700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 574 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007fa6dc4a6340 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 738 in <lambda>
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 625 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 738 in cpu
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 147 in cpu
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 474 in teardown
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1298 in _teardown
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 736 in _call_and_handle_interrupt
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 768 in fit
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 216 in train
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 224 in <module>
[E ProcessGroupNCCL.cpp:737] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801477 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801477 milliseconds before timing out.
Fatal Python error: Aborted

Thread 0x00007f3c0ffff700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 312 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/multiprocessing/queues.py", line 233 in _feed
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 910 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007f3c167fc700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 312 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/multiprocessing/queues.py", line 233 in _feed
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 910 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007f3c4b4bf700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 574 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007f3d40350340 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 738 in <lambda>
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 738 in cpu
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 147 in cpu
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 474 in teardown
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1298 in _teardown
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 736 in _call_and_handle_interrupt
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 768 in fit
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 216 in train
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 224 in <module>
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801313 milliseconds before timing out.
Fatal Python error: Aborted

Thread 0x00007fa897fff700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 312 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/multiprocessing/queues.py", line 233 in _feed
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 910 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007fa89effd700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 312 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/multiprocessing/queues.py", line 233 in _feed
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 910 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007fa8d51ec700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 574 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007fa9ca07d340 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 738 in <lambda>
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579 in _apply
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 738 in cpu
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 147 in cpu
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 474 in teardown
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1298 in _teardown
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 736 in _call_and_handle_interrupt
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 768 in fit
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 216 in train
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 224 in <module>
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 59889 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 59890 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 59891 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 59892 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 59893 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 59894 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 59895 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 7 (pid: 59896) of binary: /home/liuzhaofeng/anaconda3/bin/python

有意思的现象是，Node0报错后退出执行程序，而Node1则直接退出了SSH。

client_loop: send disconnect: Broken pipe

我查阅了一些资料，在finetune_gpt2.py增加了一些配置，但也没有起效果。

查阅内容内容如下：
ultralytics/yolov5#7481
https://www.zhihu.com/question/512132168
https://discuss.pytorch.org/t/nccl-timed-out-when-using-the-torch-distributed-run/153276
https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out

The text was updated successfully, but these errors were encountered:

koking0 · 2022-09-01T08:56:33Z

添加了各种配置之后，现在是卡在训练开始的阶段，Epoch为0的训练一直没有开始，但是两台机器的显存占用了，GPU利用率也是100%，但是就是没有开始训练。

查阅了一些资料：
pytorch 多机多卡卡住问题汇总
 Script freezes with no output when using DistributedDataParallel
PyTorch 训练时中遇到的卡住停住等问题
 PyTorch训练时，Dataloader卡死、挂起，跑一个epoch停了，问题解决方案
 运行开始训练，卡住半小时，一直不动
 关于炼丹，你是否知道这些细节？

主要做了以下修改：

启动命令前增加了OMP_NUM_THREADS=1 MKL_NUM_THREADS=1，避免多线程导致死锁；
去掉了加载数据时的tqdm；
记在数据的DataLoader的drop_last设置为True，pin_memory设置为True，num_workers设置为0；

更新后的启动脚本：

#!/bin/bash

set -x -e

echo "START TIME: $(date)"
MICRO_BATCH_SIZE=1
ROOT_DIR=$(pwd)

ZERO_STAGE=3

config_json="$ROOT_DIR/training_config.json"

cat <<EOT >$config_json
{
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "steps_per_print": 1000,
  "gradient_clipping": 1,
  "zero_optimization": {
    "stage": ${ZERO_STAGE},
    "allgather_partitions": false,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "stage3_max_live_parameters" : 2e8,
    "stage3_max_reuse_distance" : 2e8,
    "stage3_prefetch_bucket_size": 2e8,
    "stage3_param_persistence_threshold": 2e8,
    "sub_group_size" : 2e8,
    "round_robin_gradients": true
  },
  "bf16": {
    "enabled": true
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-5,
      "betas": [0.9,0.95],
      "eps": 1e-8,
      "weight_decay": 1e-2
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params":{
      "warmup_min_lr": 5e-6,
      "warmup_max_lr": 1e-5
    }
  }
}
EOT

export PL_DEEPSPEED_CONFIG_PATH=$config_json
TRAINER_ARGS="
    --max_epochs 1 \
    --num_nodes 2 \
    --gpus 8 \
    --strategy deepspeed_stage_${ZERO_STAGE}_offload \
    --default_root_dir $ROOT_DIR \
    --dirpath $ROOT_DIR/ckpt \
    --save_top_k 3 \
    --monitor train_loss \
    --mode min \
    --save_last \
"

DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
DATA_ARGS="
    --data_dir $DATA_DIR \
    --max_seq_length 64 \
    --train_batchsize $MICRO_BATCH_SIZE \
    --valid_batchsize $MICRO_BATCH_SIZE \
    --train_data test_train.txt \
    --valid_data test.txt \
    --test_data  test.txt
"

PRETRAINED_MODEL_PATH="IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese"
MODEL_ARGS="
    --pretrained_model_path ${PRETRAINED_MODEL_PATH} \
    --output_save_path $ROOT_DIR/predict.json \
    --learning_rate 1e-4 \
    --weight_decay 0.1 \
    --warmup 0.01 \
"

MASTER_ADDR="IP"
MASTER_PORT="9010"
DISTRIBUTED_ARGS="
    --nnodes 2 \
    --nproc_per_node=8 \
    --master_addr ${MASTER_ADDR} \
    --master_port ${MASTER_PORT} \
    --node_rank 1 \
    --max_restarts=0
"

SCRIPTS_PATH=${ROOT_DIR}/finetune_gpt2.py

export CMD=" \
    $DISTRIBUTED_ARGS \
    $SCRIPTS_PATH \
    $TRAINER_ARGS \
    $MODEL_ARGS \
    $DATA_ARGS \
"

export NCCL_SOCKET_IFNAME=enp129s0f0
export NCCL_IB_DISABLE=1

#python ${CMD}
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 torchrun ${CMD}
#python -m torch.distributed.launch ${CMD}

训练脚本脚本：

# -*- coding: utf-8 -*-
# @Time        : 2022/8/9 11:46
# @File        : finetune_gpt2.py
# @Description : None
# ----------------------------------------------
# ☆ ☆ ☆ ☆ ☆ ☆ ☆
# >>> Author    : Alex
# >>> Mail      : liu_zhao_feng_alex@163.com
# >>> Github    : https://github.com/koking0
# >>> Blog      : https://alex007.blog.csdn.net/
# ☆ ☆ ☆ ☆ ☆ ☆ ☆
import argparse
import os

import pytorch_lightning as pl
import torch as th
from pytorch_lightning import Trainer, loggers
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.distributed.elastic.multiprocessing.errors import record
from torch.utils.data import DataLoader, Dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers.optimization import get_linear_schedule_with_warmup


class GPT2Dataset(Dataset):
	"""
	Dataset Used for yuyuan medical qa task.
	Just surpport small datasets, when deal with large datasets it may be slowly.
	for large datasets please use mmapdatasets(doing)
	"""

	def __init__(self, data_path, name, args):
		super().__init__()
		self.tokenizer = GPT2Tokenizer.from_pretrained(args.pretrained_model_path)
		self.tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})
		self.data_size = os.path.getsize(data_path) / 1024 / 1024 / 1024
		self.data_type_name = name
		self.data = self.load_data(data_path)
		self.max_seq_length = args.max_seq_length

	def __len__(self):
		return len(self.data)

	def __getitem__(self, index):
		return self.encode(self.data[index])

	def load_data(self, data_path):
		# 有进度条展示
		if self.data_size <= 5:
			with open(data_path, "rt", encoding='utf8') as f:
				lines = f.readlines()
			data_gen = lines
		else:
			data_gen = open(data_path, "rt", encoding='utf8')

		data = []
		for idx, line in enumerate(data_gen):
			data.append(line)

		if self.data_size > 5:
			data_gen.close()
		return data

	def encode(self, item):
		"""
		将数据转换成模型训练的输入
		"""
		inputs_dict = self.tokenizer.encode_plus(item, max_length=self.max_seq_length, padding='max_length',
		                                         truncation=True, return_tensors='pt')
		target = inputs_dict["input_ids"]
		labels = target.clone().detach()
		labels[target == self.tokenizer.pad_token_id] = -100

		labels = labels.squeeze().numpy().tolist()
		if -100 in labels:
			labels[labels.index(-100)] = 50256

		return {
			"input_ids": inputs_dict["input_ids"].squeeze(),
			"attention_mask": inputs_dict["attention_mask"].squeeze(),
			"labels": th.tensor(labels)
		}


class GPT2DataModel(pl.LightningDataModule):
	@staticmethod
	def add_data_specific_args(parent_args):
		parser = parent_args.add_argument_group('GPT2DataModel')
		parser.add_argument('--data_dir', type=str, required=True)
		parser.add_argument('--num_workers', default=0, type=int)
		parser.add_argument('--train_data', default='train.txt', type=str)
		parser.add_argument('--valid_data', default='valid.txt', type=str)
		parser.add_argument('--test_data', default='test.txt', type=str)
		parser.add_argument('--train_batchsize', type=int, required=True)
		parser.add_argument('--valid_batchsize', type=int, required=True)
		parser.add_argument('--max_seq_length', default=512, type=int)
		return parent_args

	def __init__(self, args):
		super().__init__()
		self.args = args
		self.train_batchsize = args.train_batchsize
		self.valid_batchsize = args.valid_batchsize
		if not args.do_eval_only:
			self.train_data = GPT2Dataset(os.path.join(args.data_dir, args.train_data), '训练集', args)
			self.valid_data = GPT2Dataset(os.path.join(args.data_dir, args.valid_data), '验证集', args)
		self.test_data = GPT2Dataset(os.path.join(args.data_dir, args.test_data), '测试集', args)

	def train_dataloader(self):
		return DataLoader(self.train_data, shuffle=True, batch_size=self.train_batchsize, drop_last=True,
		                  pin_memory=True, num_workers=self.args.num_workers)

	def val_dataloader(self):
		return DataLoader(self.valid_data, shuffle=False, batch_size=self.valid_batchsize, drop_last=True,
		                  pin_memory=True, num_workers=self.args.num_workers)

	def predict_dataloader(self):
		return DataLoader(self.test_data, shuffle=False, batch_size=self.valid_batchsize, drop_last=True,
		                  pin_memory=True, num_workers=self.args.num_workers)


class GPT2FinetuneMedicalQAModelCheckpoint:
	@staticmethod
	def add_argparse_args(parent_args):
		parser = parent_args.add_argument_group('BaseModel')

		parser.add_argument('--monitor', default='train_loss', type=str)
		parser.add_argument('--mode', default='min', type=str)
		parser.add_argument('--dirpath', default='./ckpt/', type=str)
		parser.add_argument('--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
		parser.add_argument('--save_last', action='store_true', default=True)
		parser.add_argument('--save_top_k', default=3, type=float)
		parser.add_argument('--every_n_train_steps', default=1000, type=float)
		parser.add_argument('--save_weights_only', default=True, type=bool)

		return parent_args

	def __init__(self, args):
		self.callbacks = ModelCheckpoint(monitor=args.monitor, save_top_k=args.save_top_k, mode=args.mode,
		                                 save_weights_only=args.save_weights_only, dirpath=args.dirpath,
		                                 filename=args.filename, save_last=args.save_last)


class GPT2Finetune(pl.LightningModule):

	@staticmethod
	def add_model_specific_args(parent_args):
		parser = parent_args.add_argument_group("BaseModel")
		parser.add_argument("--learning_rate", default=1e-4, type=float)
		parser.add_argument("--weight_decay", default=0.1, type=float)
		parser.add_argument("--warmup", default=0.01, type=float)
		return parent_args

	def __init__(self, args, num_data):
		super().__init__()
		self.args = args
		self.num_data = num_data
		self.model = GPT2LMHeadModel.from_pretrained(args.pretrained_model_path)

	def setup(self, stage) -> None:
		if stage == 'fit':
			num_gpus = self.trainer.gpus if self.trainer.gpus is not None else 0
			self.total_step = int(self.trainer.max_epochs * self.num_data /
			                      (max(1, num_gpus) * self.trainer.accumulate_grad_batches))
			print('Total training step:', self.total_step)

	def training_step(self, batch, batch_idx):
		output = self.model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'],
		                    labels=batch['labels'])
		# output = self.model(input_ids=batch['input_ids'], labels=batch['labels'])
		# acc = self.comput_metrix(output.logits, batch['labels'])
		self.log('train_loss', output.loss)
		return output.loss

	def comput_metrix(self, logits, labels):
		y_pred = th.argmax(logits, dim=-1)
		y_pred = y_pred.view(size=(-1,))
		y_true = labels.view(size=(-1,)).float()
		corr = th.eq(y_pred, y_true)
		acc = th.sum(corr.float()) / labels.size()[0]
		return acc

	def validation_step(self, batch, batch_idx):
		output = self.model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'],
		                    labels=batch['labels'])
		self.log('val_loss', output.loss)

	def configure_optimizers(self):
		no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
		paras = list(filter(lambda p: p[1].requires_grad, self.named_parameters()))
		paras = [{
			'params':
				[p for n, p in paras if not any(nd in n for nd in no_decay)],
			'weight_decay': self.args.weight_decay
		}, {
			'params': [p for n, p in paras if any(nd in n for nd in no_decay)],
			'weight_decay': 0.0
		}]
		optimizer = th.optim.AdamW(paras, lr=self.args.learning_rate)
		scheduler = get_linear_schedule_with_warmup(
			optimizer, int(self.total_step * self.args.warmup),
			self.total_step)

		return [{
			'optimizer': optimizer,
			'lr_scheduler': {
				'scheduler': scheduler,
				'interval': 'step',
				'frequency': 1
			}
		}]


@record
def train():
	total_parser = argparse.ArgumentParser("Summary Task")
	total_parser.add_argument('--local_rank', type=int)
	total_parser.add_argument('--do_eval_only', action='store_true', default=False)
	total_parser.add_argument('--pretrained_model_path', default=None, type=str)
	total_parser.add_argument('--output_save_path', default='./predict.json', type=str)
	# * Args for data preprocessing
	total_parser = GPT2DataModel.add_data_specific_args(total_parser)
	# * Args for training
	total_parser = Trainer.add_argparse_args(total_parser)
	total_parser = GPT2FinetuneMedicalQAModelCheckpoint.add_argparse_args(total_parser)
	total_parser = GPT2Finetune.add_model_specific_args(total_parser)
	# * Args for base model
	args = total_parser.parse_args()

	data_model = GPT2DataModel(args)
	model = GPT2Finetune(args, len(data_model.train_dataloader()))
	checkpoint_callback = GPT2FinetuneMedicalQAModelCheckpoint(args).callbacks
	logger = loggers.TensorBoardLogger(save_dir=os.path.join(args.default_root_dir, 'log/'), name='MedicalQA-GPT2')
	trainer = Trainer.from_argparse_args(args, logger=logger, callbacks=[checkpoint_callback])
	trainer.tune(model)
	trainer.fit(model, data_model)

	model.model.save_pretrained("./models/finetune/gpt2")


if __name__ == '__main__':
	train()

Node0机器日志：

$ bash finetune_gpt2.sh 
++ date
+ echo 'START TIME: 2022年 09月 01日 星期四 16:34:02 CST'
START TIME: 2022年 09月 01日 星期四 16:34:02 CST
+ MICRO_BATCH_SIZE=1
++ pwd
+ ROOT_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog
+ ZERO_STAGE=3
+ config_json=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ cat
+ export PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ TRAINER_ARGS='
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last '
+ DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
+ DATA_ARGS='
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
'
+ PRETRAINED_MODEL_PATH=IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese
+ MODEL_ARGS='
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01 '
+ MASTER_ADDR=IP
+ MASTER_PORT=9010
+ DISTRIBUTED_ARGS='
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 0     --max_restarts=0
'
+ SCRIPTS_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
+ export 'CMD=     
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 0     --max_restarts=0
     /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py     
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last      
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01      
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
 '
+ CMD='     
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 0     --max_restarts=0
     /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py     
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last      
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01      
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
 '
+ export NCCL_SOCKET_IFNAME=enp129s0f0
+ NCCL_SOCKET_IFNAME=enp129s0f0
+ export NCCL_IB_DISABLE=1
+ NCCL_IB_DISABLE=1
+ OMP_NUM_THREADS=1
+ MKL_NUM_THREADS=1
+ torchrun --nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 0 --max_restarts=0 /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py --max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01 --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
initializing deepspeed distributed: GLOBAL_RANK: 5, MEMBER: 6/16
initializing deepspeed distributed: GLOBAL_RANK: 7, MEMBER: 8/16
/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:445: LightningDeprecationWarning: Setting `Trainer(gpus=8)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=8)` instead.
  rank_zero_deprecation(
Loading DeepSpeed config from set PL_DEEPSPEED_CONFIG_PATH environment variable
initializing deepspeed distributed: GLOBAL_RANK: 3, MEMBER: 4/16
initializing deepspeed distributed: GLOBAL_RANK: 4, MEMBER: 5/16
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/16
initializing deepspeed distributed: GLOBAL_RANK: 6, MEMBER: 7/16
initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/16
initializing deepspeed distributed: GLOBAL_RANK: 2, MEMBER: 3/16
Total training step: 5776
Total training step: 5776
Total training step: 5776
Total training step: 5776
Total training step:Total training step: 5776 5776

Total training step: 5776
/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:2179: LightningDeprecationWarning: `Trainer.gpus` was deprecated in v1.6 and will be removed in v1.8. Please use `Trainer.num_devices` or `Trainer.device_ids` to get device information instead.
  rank_zero_deprecation(
Total training step: 5776
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
You have specified an optimizer and/or scheduler within the DeepSpeed config. It is recommended to define it in `LightningModule.configure_optimizers`.
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.11814284324646 seconds
Time to load cpu_adam op: 2.935582160949707 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.088909149169922 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8845674991607666 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.0597143173217773 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.990757942199707 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.0667808055877686 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.0561718940734863 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.41974496841430664 seconds
Loading extension module utils...
Time to load utils op: 0.20378637313842773 seconds
Loading extension module utils...
Time to load utils op: 0.4045596122741699 seconds
Loading extension module utils...
Time to load utils op: 0.40416693687438965 seconds
Loading extension module utils...
Time to load utils op: 0.30461955070495605 seconds
Loading extension module utils...
Time to load utils op: 0.40419793128967285 seconds
Loading extension module utils...
Time to load utils op: 0.5049059391021729 seconds
Loading extension module utils...
Time to load utils op: 0.4040186405181885 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Time to load utils op: 0.00046753883361816406 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005242824554443359 secondsNo modifications detected for re-loaded extension module utils, skipping build step...

Loading extension module utils...
Time to load utils op: 0.0005707740783691406 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005829334259033203 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007224082946777344 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0009551048278808594 seconds
Time to load utils op: 0.0009477138519287109 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004756450653076172 seconds

  | Name  | Type            | Params | Params per Device
--------------------------------------------------------------
0 | model | GPT2LMHeadModel | 3.6 B  | 222 M            
--------------------------------------------------------------
3.6 B     Trainable params
0         Non-trainable params
3.6 B     Total params
14,225.080Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:219: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0: 100%|█| 2/2 [01:57<00:00, 58.51s/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
  warning_cache.warn(
/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:219: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:   0%|                       | 0/2896 [00:00<?, ?it/s]

Node1机器日志：

$ bash finetune_gpt2.sh 
++ date
+ echo 'START TIME: 2022年 09月 01日 星期四 16:34:09 CST'
START TIME: 2022年 09月 01日 星期四 16:34:09 CST
+ MICRO_BATCH_SIZE=1
++ pwd
+ ROOT_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog
+ ZERO_STAGE=3
+ config_json=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ cat
+ export PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ TRAINER_ARGS='
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last '
+ DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
+ DATA_ARGS='
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
'
+ PRETRAINED_MODEL_PATH=IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese
+ MODEL_ARGS='
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01 '
+ MASTER_ADDR=IP
+ MASTER_PORT=9010
+ DISTRIBUTED_ARGS='
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 1     --max_restarts=0
'
+ SCRIPTS_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
+ export 'CMD=     
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 1     --max_restarts=0
     /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py     
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last      
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01      
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
 '
+ CMD='     
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 1     --max_restarts=0
     /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py     
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last      
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01      
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
 '
+ export NCCL_SOCKET_IFNAME=enp129s0f0
+ NCCL_SOCKET_IFNAME=enp129s0f0
+ export NCCL_IB_DISABLE=1
+ NCCL_IB_DISABLE=1
+ OMP_NUM_THREADS=1
+ MKL_NUM_THREADS=1
+ torchrun --nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 1 --max_restarts=0 /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py --max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01 --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
initializing deepspeed distributed: GLOBAL_RANK: 9, MEMBER: 10/16
initializing deepspeed distributed: GLOBAL_RANK: 14, MEMBER: 15/16
initializing deepspeed distributed: GLOBAL_RANK: 10, MEMBER: 11/16
initializing deepspeed distributed: GLOBAL_RANK: 13, MEMBER: 14/16
initializing deepspeed distributed: GLOBAL_RANK: 11, MEMBER: 12/16
initializing deepspeed distributed: GLOBAL_RANK: 15, MEMBER: 16/16
initializing deepspeed distributed: GLOBAL_RANK: 12, MEMBER: 13/16
initializing deepspeed distributed: GLOBAL_RANK: 8, MEMBER: 9/16
Total training step:Total training step:  57765776

Total training step: 5776
Total training step: Total training step:5776 
Total training step:5776 
5776
Total training step: 5776
Total training step: 5776
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1943840980529785 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1576390266418457 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1438350677490234 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1594393253326416 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2506330013275146 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2735142707824707 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.295503854751587 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.099400281906128 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.4705345630645752 seconds
Loading extension module utils...
Time to load utils op: 0.4043726921081543 seconds
Loading extension module utils...
Time to load utils op: 0.4040102958679199 seconds
Loading extension module utils...
Time to load utils op: 0.5046920776367188 seconds
Loading extension module utils...
Time to load utils op: 0.40497612953186035 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.5046734809875488 seconds
Loading extension module utils...
Time to load utils op: 0.504218339920044 seconds
Time to load utils op: 0.504563570022583 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004253387451171875 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004715919494628906 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...

Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Time to load utils op: 0.0009021759033203125 seconds
Time to load utils op: 0.0009865760803222656 seconds
Time to load utils op: 0.0009179115295410156 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
No modifications detected for re-loaded extension module utils, skipping build step...Loading extension module utils...

Loading extension module utils...
Time to load utils op: 0.0006961822509765625 seconds
Time to load utils op: 0.0010385513305664062 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00115966796875 seconds

GPU、显存、CPU、内存占用：

koking0 · 2022-09-09T06:48:46Z

解决了，把PyTorchLighting的Trainer换成了Hug个ingFace的Trainer，就可以多机多卡训练了，
猜测可能是因为PyTorchLighting中DataLoader的啥的同步问题，卡了大半个月，终于解决了。

Zyriix · 2023-04-25T13:23:59Z

您好，可以请问一下你是用的torch lightning model吗？我也遇到了这个问题，可以看看你的huggingface trainer怎么写的吗？

Zyriix · 2023-04-25T14:14:57Z

@koking0

koking0 · 2023-04-27T04:51:15Z

您好，可以请问一下你是用的torch lightning model吗？我也遇到了这个问题，可以看看你的huggingface trainer怎么写的吗？

记不太清了，当时的代码没有留存。[手动捂脸哭]

YIFanH · 2023-10-07T12:32:35Z

@Zyriix 你解决了吗一样的pytorchlighting多机多卡我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题，能问下你那边是怎么解决的吗

Zyriix · 2023-10-10T07:07:31Z

@Zyriix 你解决了吗一样的pytorchlighting多机多卡我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题，能问下你那边是怎么解决的吗

这个问题一般是因为不同的node之间gather的时候数据不一致导致的，具体来说有这几种可能：

loss_dict不一致，比如你定义了多个loss，node1跑的是loss1，node2跑的是loss2，如果你的loss_dict没有为这些loss值设置默认值，就会一直waiting
metric_dict不一致，比如node1你记录了l2loss，node2你记录了IOU，也会有这种情况

一般检查这两个就可以解决，通常直接把log_dict里的sync关掉，即不使用跨节点的metrics计算会减少这种问题出现的概率

jacken3 · 2023-11-16T09:28:34Z

@Zyriix 你解决了吗一样的pytorchlighting多机多卡我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题，能问下你那边是怎么解决的吗

这个问题一般是因为不同的node之间gather的时候数据不一致导致的，具体来说有这几种可能：

loss_dict不一致，比如你定义了多个loss，node1跑的是loss1，node2跑的是loss2，如果你的loss_dict没有为这些loss值设置默认值，就会一直waiting

metric_dict不一致，比如node1你记录了l2loss，node2你记录了IOU，也会有这种情况

一般检查这两个就可以解决，通常直接把log_dict里的sync关掉，即不使用跨节点的metrics计算会减少这种问题出现的概率

你好，我最近训练代码也遇到了相同的问题，感觉应该和您描述很相符，因为在我的程序中，计算图和数据是高度相关的，某些特定数据不会经过部分网络导致各个GPU上的loss不一致，可以加个联系方式进一步聊一下相关问题嘛？

jiemosang · 2023-11-29T06:50:43Z

@Zyriix 你解决了吗一样的pytorchlighting多机多卡我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题，能问下你那边是怎么解决的吗

这个问题一般是因为不同的node之间gather的时候数据不一致导致的，具体来说有这几种可能：

loss_dict不一致，比如你定义了多个loss，node1跑的是loss1，node2跑的是loss2，如果你的loss_dict没有为这些loss值设置默认值，就会一直waiting

metric_dict不一致，比如node1你记录了l2loss，node2你记录了IOU，也会有这种情况

一般检查这两个就可以解决，通常直接把log_dict里的sync关掉，即不使用跨节点的metrics计算会减少这种问题出现的概率

你好，我最近训练代码也遇到了相同的问题，感觉应该和您描述很相符，因为在我的程序中，计算图和数据是高度相关的，某些特定数据不会经过部分网络导致各个GPU上的loss不一致，可以加个联系方式进一步聊一下相关问题嘛？

我也是类似的特定数据过特定分支，在多机开启syncbn后就会卡住，跟你的情况很像，请问有解决吗

Zyriix · 2023-12-04T04:02:01Z

@Zyriix 你解决了吗一样的pytorchlighting多机多卡我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题，能问下你那边是怎么解决的吗

这个问题一般是因为不同的node之间gather的时候数据不一致导致的，具体来说有这几种可能：

loss_dict不一致，比如你定义了多个loss，node1跑的是loss1，node2跑的是loss2，如果你的loss_dict没有为这些loss值设置默认值，就会一直waiting

metric_dict不一致，比如node1你记录了l2loss，node2你记录了IOU，也会有这种情况

一般检查这两个就可以解决，通常直接把log_dict里的sync关掉，即不使用跨节点的metrics计算会减少这种问题出现的概率

你好，我最近训练代码也遇到了相同的问题，感觉应该和您描述很相符，因为在我的程序中，计算图和数据是高度相关的，某些特定数据不会经过部分网络导致各个GPU上的loss不一致，可以加个联系方式进一步聊一下相关问题嘛？

这个解决方式可以尝试上面提到的，给每个loss设一个默认值来解决。
比如:
loss1 = 0
loss2 = 0
if cond1:
loss1 = mse(net1(input), target)
else:
loss2 = mse(net2(input), target)
loss = loss1 + loss2
loss_dict = {'loss':loss, 'loss1':loss1, 'loss2':loss2}
return loss， loss_dict

除了loss之外，分布式训练还会gather梯度，你们提到的计算图不同可能会导致不同node间的梯度不同，从而使程序一直等待梯度。使用deepspeed stage2好像可以解决这类问题，我不确定，可以尝试。

matrix-yang · 2024-03-15T03:13:57Z

的特定数据过特定分支，在多机开启syncbn后就会卡住，跟你的情况很像，请问有解决吗

我也是这样的，我跑的moe,不同token走得分支不一样，然后backward的时候就卡住，老哥有解决吗？

Klein-Lan · 2024-03-29T08:41:11Z

除了loss之外，分布式训练还会gather梯度，你们提到的计算图不同可能会导致不同node间的梯度不同，从而使程序一直等待梯度。使用deepspeed stage2好像可以解决这类问题，我不确定，可以尝试。

是的，我也是在训练MoE的时候会卡住。我试了一下如果只有1个expert，程序多卡运行没问题。我怀疑是不同卡上选择不同的expert的梯度不同导致的，无法进行反向更新参数？我使用的是deepspeed的stage2也依然没有解决这个问题。

rationalspark · 2024-07-09T10:31:43Z

我也遇到了相同的问题。deepspeed stage1还能跑一会儿再出错，2的话第一步就会超时。想了个笨办法，就是把有时不会用到的参数的平方和乘个很小的系数加到loss里面去，这样保证他们出现在计算图里面，能够有梯度算出来。试试行不行

rationalspark · 2024-07-09T10:42:07Z

按照上面的方法，这个问题似乎解决了。不过还有其他错误，继续熬

koking0 closed this as completed Sep 9, 2022

This was referenced Oct 21, 2022

PyTorch Lightning 结合 DeepSpeed 训练保存的 checkpoint 文件如何转换为模型 bin 文件 #152

Closed

Wenzhong2.0-GPT2-3.5B-chinese微调后生成乱码 #156

Closed

xiuqhou mentioned this issue May 10, 2024

分布式多机多卡训练卡住，超时后报错 xiuqhou/Salience-DETR#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分布式多机多卡训练卡住，超时后报错 #123

分布式多机多卡训练卡住，超时后报错 #123

koking0 commented Aug 19, 2022 •

edited

Loading

koking0 commented Sep 1, 2022

koking0 commented Sep 9, 2022

Zyriix commented Apr 25, 2023

Zyriix commented Apr 25, 2023

koking0 commented Apr 27, 2023

YIFanH commented Oct 7, 2023

Zyriix commented Oct 10, 2023

jacken3 commented Nov 16, 2023

jiemosang commented Nov 29, 2023

Zyriix commented Dec 4, 2023

matrix-yang commented Mar 15, 2024

Klein-Lan commented Mar 29, 2024

rationalspark commented Jul 9, 2024

rationalspark commented Jul 9, 2024

分布式多机多卡训练卡住，超时后报错 #123

分布式多机多卡训练卡住，超时后报错 #123

Comments

koking0 commented Aug 19, 2022 • edited Loading

koking0 commented Sep 1, 2022

koking0 commented Sep 9, 2022

Zyriix commented Apr 25, 2023

Zyriix commented Apr 25, 2023

koking0 commented Apr 27, 2023

YIFanH commented Oct 7, 2023

Zyriix commented Oct 10, 2023

jacken3 commented Nov 16, 2023

jiemosang commented Nov 29, 2023

Zyriix commented Dec 4, 2023

matrix-yang commented Mar 15, 2024

Klein-Lan commented Mar 29, 2024

rationalspark commented Jul 9, 2024

rationalspark commented Jul 9, 2024

koking0 commented Aug 19, 2022 •

edited

Loading