Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed的PP需要相同的seq-length(collate时注意padding)和batchsize(将dataloader的drop_last设为True) #25

Closed
Youngluc opened this issue Jun 21, 2024 · 11 comments

Comments

@Youngluc
Copy link

网上关于Deepspeed流水线并行的资料实在太少了...我遇到个问题需要请教一下,麻烦大佬有时间帮忙分析一下...
我仿照大佬的代码写了一个其他VLM的training code,在训练过程中会遇到一个奇怪的问题:
setting如下:num_stages=4, ngpus_per_node=8,那么pp=4,dp=2,然后rank0和rank1会分别有两个batch:B1和B2,假设B1和B2的序列长度分别为N1和N2。
然后在autograd时候就出错了,说的是Mismatch shape错误,grad的shape为N1,output的shape为N2,相当于autograd时候用了rank1的batch B1去更新rank2的batch B2了。

这个BUG或者问题我实在无从下手解决,也没搜集到相关资料。

P.S.: 我在LLM的BlockPipeLayer中打印了一下,发现B1的数据完整的forward了所有层,B2的数据只forward了前20多个层,后面的层还没传播完。是不是哪里的同步有问题啊?

@Youngluc
Copy link
Author

仔细看了一下,不是跨rank了,是在rank0内,从GPU1的pipe(layer0~layer20)到GPU2的pipe(layer21-34),tensor在传输过程中,序列长度发生变化了(2369->2262,这是为啥?

@Coobiw
Copy link
Owner

Coobiw commented Jun 21, 2024 via email

@Youngluc
Copy link
Author

hi,想先问两个问题: 1.序列处理是否是按照本库的处理,因为我的preprocess不会出现不等长序列 2.错误的log和你的输出截图方便提供一下吗

---- 回复的原邮件 ---- 发件人 Hao cheng @.> 发送日期 2024年06月21日 16:47 收件人 Coobiw/MiniGPT4Qwen @.> 抄送人 Subscribed @.> 主题 Re: [Coobiw/MiniGPT4Qwen] 哈喽打扰一下询问个问题! (Issue #25) 仔细看了一下,不是跨rank了,是在rank0内,从GPU1的pipe(layer0~layer20)到GPU2的pipe(layer21-34),tensor在传输过程中,序列长度发生变化了(2369->2262,这是为啥? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

hi感谢您的回复,

  1. 我用的InternVL里的序列处理方法,最终batch拿到的序列长度肯定是一样的,在preprocess和collator中已经处理完毕了。
  2. 最终报错为(其中一个,还有另一个一样的):
    RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2361, 6144]) and output[0] has a shape of torch.Size([4, 2481, 6144])

比较完整的错误报告(我打印了each layer的input_embeds.shape,还有我在collator中传入的一个tag【内容为input_ids.sum()与random.randint(100,2000)的一个拼接tensor】)
`dynamic ViT batch size: 29, images per sample: 7.25, dynamic token length: 2262
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:539:forward] Activation Checkpointing Information
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:540:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:541:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:543:forward] ----Synchronization False
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:544:forward] ----Profiling time in checkpointing False
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.0 cuda:0 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.1 cuda:0 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.2 cuda:0 tensor([712564646, 1668], device='cuda:0')
dynamic ViT batch size: 31, images per sample: 7.75, dynamic token length: 2361
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.3 cuda:0 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.0 cuda:1 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.4 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.1 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.5 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.2 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.6 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.3 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.7 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.4 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.8 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.5 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.9 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.6 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.10 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.7 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.11 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.8 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.12 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.9 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.13 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.10 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.14 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.11 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.15 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.12 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.16 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.13 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.17 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.14 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.18 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.15 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.19 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.16 cuda:1 tensor([712564646, 1668], device='cuda:0')
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.17 cuda:1 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.18 cuda:1 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.19 cuda:1 tensor([757955384, 1668], device='cuda:1')
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.20 cuda:2 tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.21 cuda:2 tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.22 cuda:2 tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.23 cuda:2 tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.24 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.20 cuda:3 tensor([757955384, 1668], device='cuda:3')
tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.21 cuda:3 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.25 cuda:2 tensor([757955384, 1668], device='cuda:3')
tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.22 cuda:3 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.26 cuda:2 tensor([757955384, 1668], device='cuda:3')
tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.23 cuda:3 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.27 cuda:2 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.28 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.24 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.29 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.25 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.30 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.26 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.31 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.27 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.32 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.28 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.33 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.29 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.30 cuda:3 tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.31 cuda:3 tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.32 cuda:3 tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.33 cuda:3 tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.34 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.35 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.36 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.37 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.38 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.39 cuda:4 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.34 cuda:5 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.35 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.40 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.36 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.41 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.37 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.42 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.38 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.43 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.39 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.44 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.40 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.45 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.41 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.46 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.42 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.47 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.43 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.44 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.45 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.46 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.47 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.48 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.49 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.50 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.51 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.52 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.53 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.54 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.55 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.48 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.49 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.50 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.51 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.52 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.53 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.54 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.55 cuda:7 tensor([757955384, 1668], device='cuda:7')
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[2024-06-21 16:41:08,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 10.02 | optimizer_gradients: 32.98 | optimizer_step: 89.77
0 batch end
0 batch end
0 batch end
0 batch end
0 batch end
0 batch end
0 batch end
0 batch end
06/21/2024 16:41:11 - INFO - main - {'loss': 2.441850185394287, 'learning_rate': 0.0, 'epoch': 0.0}

Epoch 1: 0%| | 0/15698 [00:26<?, ?it/s, loss=2.44, learning_rate=0, epoch=0]�[A

Epoch 1: 0%| | 1/15698 [00:26<115:52:37, 26.58s/it, loss=2.44, learning_rate=0, epoch=0]�[Adynamic ViT batch size: 31, images per sample: 7.75, dynamic token length: 2481
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.0 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.1 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.2 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.3 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.4 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.5 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.6 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.7 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.8 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.9 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.10 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.11 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.12 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.13 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.14 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.15 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.16 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.17 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.18 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.19 cuda:1 tensor([766560813, 1374], device='cuda:1')
Traceback (most recent call last):
File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 850, in
if name == 'main':
File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 830, in main
with torch.cuda.amp.autocast(dtype=torch.bfloat16, cache_enabled=False):
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 373, in train_batch
self._exec_schedule(sched)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1373, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 789, in exec_backward_pass
torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 244, in backward
grad_tensors
= make_grads(tensors, grad_tensors, is_grads_batched=False)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 88, in _make_grads
raise RuntimeError(
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2361, 6144]) and output[0] has a shape of torch.Size([4, 2481, 6144]).
dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2369
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.0 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.1 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.2 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.3 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.4 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.5 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.6 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.7 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.8 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.9 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.10 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.11 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.12 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.13 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.14 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.15 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.16 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.17 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.18 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.19 cuda:0 tensor([790315515, 1374], device='cuda:0')

Epoch 1: 0%| | 1/15698 [00:33<146:05:44, 33.51s/it, loss=2.44, learning_rate=0, epoch=0]
Traceback (most recent call last):
File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 850, in
if name == 'main':
File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 830, in main
with torch.cuda.amp.autocast(dtype=torch.bfloat16, cache_enabled=False):
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 373, in train_batch
self._exec_schedule(sched)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1373, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 789, in exec_backward_pass
torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 244, in backward
grad_tensors
= make_grads(tensors, grad_tensors, is_grads_batched=False)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 88, in _make_grads
raise RuntimeError(
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2262, 6144]) and output[0] has a shape of torch.Size([4, 2369, 6144]).
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.20 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.21 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.22 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.23 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.24 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.25 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.26 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.27 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.28 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.29 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.30 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.31 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.32 cuda:2 tensor([790315515, 1374], device='cuda:2')

Epoch: 0it [02:15, ?it/s]
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.33 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.34 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.35 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.36 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.37 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.38 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.39 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.40 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.41 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.42 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.43 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.44 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.45 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.46 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.47 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.48 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.49 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.50 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.51 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.52 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.53 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.54 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.55 cuda:6 tensor([790315515, 1374], device='cuda:6')
[2024-06-21 16:41:30,578] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 225308
[2024-06-21 16:41:30,578] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 225309
[2024-06-21 16:41:31,133] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 225310`

@Youngluc
Copy link
Author

Youngluc commented Jun 21, 2024

模型的一个pipeline切分:
[2024-06-21 16:38:26,030] [INFO] [module.py:375:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=21
0: TokenizerPipeLayer
1: InternLMBlockPipeLayer
2: InternLMBlockPipeLayer
3: InternLMBlockPipeLayer
4: InternLMBlockPipeLayer
5: InternLMBlockPipeLayer
6: InternLMBlockPipeLayer
7: InternLMBlockPipeLayer
8: InternLMBlockPipeLayer
9: InternLMBlockPipeLayer
10: InternLMBlockPipeLayer
11: InternLMBlockPipeLayer
12: InternLMBlockPipeLayer
13: InternLMBlockPipeLayer
14: InternLMBlockPipeLayer
15: InternLMBlockPipeLayer
16: InternLMBlockPipeLayer
17: InternLMBlockPipeLayer
18: InternLMBlockPipeLayer
19: InternLMBlockPipeLayer
20: InternLMBlockPipeLayer
stage=1 layers=14
21: InternLMBlockPipeLayer
22: InternLMBlockPipeLayer
23: InternLMBlockPipeLayer
24: InternLMBlockPipeLayer
25: InternLMBlockPipeLayer
26: InternLMBlockPipeLayer
27: InternLMBlockPipeLayer
28: InternLMBlockPipeLayer
29: InternLMBlockPipeLayer
30: InternLMBlockPipeLayer
31: InternLMBlockPipeLayer
32: InternLMBlockPipeLayer
33: InternLMBlockPipeLayer
34: InternLMBlockPipeLayer
stage=2 layers=14
35: InternLMBlockPipeLayer
36: InternLMBlockPipeLayer
37: InternLMBlockPipeLayer
38: InternLMBlockPipeLayer
39: InternLMBlockPipeLayer
40: InternLMBlockPipeLayer
41: InternLMBlockPipeLayer
42: InternLMBlockPipeLayer
43: InternLMBlockPipeLayer
44: InternLMBlockPipeLayer
45: InternLMBlockPipeLayer
46: InternLMBlockPipeLayer
47: InternLMBlockPipeLayer
48: InternLMBlockPipeLayer
stage=3 layers=11
49: InternLMBlockPipeLayer
50: InternLMBlockPipeLayer
51: InternLMBlockPipeLayer
52: InternLMBlockPipeLayer
53: InternLMBlockPipeLayer
54: InternLMBlockPipeLayer
55: InternLMBlockPipeLayer
56: InternLMBlockPipeLayer
57: FLNPipeLayer
58: LMPipeLayer
59: LossPipeLayer

会发现序列长度为2369和2481的这两个序列,好像过了stage0(layer19)之后就被阻断了,然后这个grad_tensor和output之间的match关系也比较混乱。。。

@Coobiw
Copy link
Owner

Coobiw commented Jun 21, 2024 via email

@Coobiw
Copy link
Owner

Coobiw commented Jun 21, 2024

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下
比如collator大概的返回形式应该和这个函数类似:

def collate_fn_minigpt4qwen(batch,preprocess_func):
    image_list, conversation_list = [], []

    for sample in batch:
        image_list.append(sample["image"])
        conversation_list.append(sample["conversations"])

    new_batch = \
        {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }
    data_dict = preprocess_func(new_batch['conversations'])

    return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477

@Youngluc
Copy link
Author

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似:

def collate_fn_minigpt4qwen(batch,preprocess_func):
    image_list, conversation_list = [], []

    for sample in batch:
        image_list.append(sample["image"])
        conversation_list.append(sample["conversations"])

    new_batch = \
        {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }
    data_dict = preprocess_func(new_batch['conversations'])

    return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477

对的,stage划分是1357 0246这样,collator最后返回的是按照大佬博客里讲的Tuple[Tuple[torch.Tensor], Any]这样的形式,我仔细研究了一下,测试的时候发现是,出错的那几个batch,activation在从GPU0(GPU1)转移到GPU2(GPU3)的时候,出现问题,变成这个rank上个step的batch的序列形状了。我举个例子:
比如在rank1(GPU0246)上,第一个batch的input_embeds是(4,2262,6144)的形状,第二个batch是(4,2361,6144)的形状。第一个batch的forward和backward是正常的,而第二个batch在GPU0上是正常的,GPU0上每个layer的hidden_states形状都是(4,2361,6144),但是到GPU2上的layer时,形状就都变成(4,2262,6144)了,然后就会报错:
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2262, 6144]) and output[0] has a shape of torch.Size([4, 2361, 6144]).
这是实际情况,我很纳闷为什么会这样。我改变了pp的等级(pp=2/4/8),都是上述这种情况,第一个step都是正常的,第二个开始就会这样了。大佬可以帮忙分析一下为什么吗?非常感谢!!!(如果太占用大佬时间就算惹

@Youngluc
Copy link
Author

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似:

def collate_fn_minigpt4qwen(batch,preprocess_func):
    image_list, conversation_list = [], []

    for sample in batch:
        image_list.append(sample["image"])
        conversation_list.append(sample["conversations"])

    new_batch = \
        {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }
    data_dict = preprocess_func(new_batch['conversations'])

    return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477

训练的第一个step应该是正常的,第二个step就有问题了,第二个step的batch只能在stage0对应的GPU上传播正常,从stage0转移到stage1的GPU上时,hidden_states的形状就变为step1中的形状了。大佬的ds版本和torch版本可以说一下吗?我现在搞不清楚这个问题的真正根源是在哪。

@Coobiw
Copy link
Owner

Coobiw commented Jun 21, 2024 via email

@Youngluc
Copy link
Author

方便看下你block的代码吗,或者你知乎私我个联系方式 有空一起看一下子?

---- 回复的原邮件 ---- 发件人 Hao cheng @.> 发送日期 2024年06月21日 23:39 收件人 Coobiw/MiniGPT4Qwen @.> 抄送人 Coobiw @.>, Comment @.> 主题 Re: [Coobiw/MiniGPT4Qwen] 哈喽打扰一下询问个问题! (Issue #25) 感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似: def collate_fn_minigpt4qwen(batch,preprocess_func): image_list, conversation_list = [], [] for sample in batch: image_list.append(sample["image"]) conversation_list.append(sample["conversations"]) new_batch = \ { "image": torch.stack(image_list, dim=0), "conversations": conversation_list, } data_dict = preprocess_func(new_batch['conversations']) return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']), data_dict['labels'] ) # 我这里是Tuple[Tuple[Tensor], Tensor] 可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477 训练的第一个step应该是正常的,第二个step就有问题了,第二个step的batch只能在stage0对应的GPU上传播正常,从stage0转移到stage1的GPU上时,hidden_states的形状就变为step1中的形状了。大佬的ds版本和torch版本可以说一下吗?我现在搞不清楚这个问题的真正根源是在哪。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

好好好,我知乎私聊大佬!代码的话是在办公笔记本上我没法直接复制粘贴😭

@Coobiw
Copy link
Owner

Coobiw commented Jul 1, 2024

Solved.

We find that DeepSpeed Pipeline Parallel needs the same seq_length in a mini-batch(including many micro-batch) and the same batch-size(so we should set drop_last to True).

This is a good discovery. I'll close but pin this issue.

@Coobiw Coobiw closed this as completed Jul 1, 2024
@Coobiw Coobiw pinned this issue Jul 1, 2024
@Coobiw Coobiw changed the title 哈喽打扰一下询问个问题! DeepSpeed的PP需要相同的seq-length和batchsize(drop_last设为True即可) Jul 1, 2024
@Coobiw Coobiw changed the title DeepSpeed的PP需要相同的seq-length和batchsize(drop_last设为True即可) DeepSpeed的PP需要相同的seq-length(collate时注意padding)和batchsize(将dataloader的drop_last设为True) Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants