Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Parallel] Compatible new comm library upgrade for XPUs. #63817

Merged
merged 1 commit into from
Apr 25, 2024

Conversation

ZibinGuo
Copy link
Contributor

@ZibinGuo ZibinGuo commented Apr 24, 2024

PR Category

Communication Library

PR Types

New features

Description

根据#56604 pr,在xpu上适配新版静态图分布式通信库

Copy link

paddle-bot bot commented Apr 24, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

if core.is_compiled_with_xpu():
dev._dtype = DeviceType.XPU
else:
dev._dtype = DeviceType.GPU
visible_devices = os.getenv("CUDA_VISIBLE_DEVICES")
elif 'XPU_VISIBLE_DEVICES' in os.environ:
dev._dtype = DeviceType.XPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可能要再加一个XPULINK_VISIBLE_DEVICES

Copy link
Contributor

@dynamicheart dynamicheart Jul 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XiaociZhang

PP8时,需要设置:

export CUDA_DEVICE_ORDER=OAM_ID
export XPULINK_VISIBLE_DEVICES=2,3,0,1,5,4,7,6

然而,会导致rank0对应dev2,进而导致通信库无法正常工作,由 @Thunderbrook 进行问题的排查,以下代码进行相关说明:

这里传入的devices是模型启动脚本里面的--xpus,为0,1,2,3,4,5,6,7:

这里的device._labels是从XPULINK_VISIBLE_DEVICES解析,为2,3,0,1,5,4,7,6,get_selected_devices也为2,3,0,1,5,4,7,6,因此rank0为dev2

解决方案:

方案1:设置export XPULINK_VISIBLE_DEVICES=2,3,0,1,5,4,7,6的同时,需要设置--xpus "2,3,0,1,5,4,7,6",这样使得rank0,仍然为dev0

方案2(推荐方案):训练参数去掉--xpus

综上,机内PP8需要

  1. 设置环境变量
export CUDA_DEVICE_ORDER=OAM_ID
export XPULINK_VISIBLE_DEVICES=2,3,0,1,5,4,7,6
  1. 训练参数去掉--xpus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

方案2 is preferred

Copy link
Contributor

@RuohengMa RuohengMa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

elif 'XPU_VISIBLE_DEVICES' in os.environ:
dev._dtype = DeviceType.XPU
visible_devices = os.getenv("XPU_VISIBLE_DEVICES")
elif 'CUDA_VISIBLE_DEVICES' in os.environ:
if core.is_compiled_with_xpu():
dev._dtype = DeviceType.XPU
Copy link
Contributor

@houj04 houj04 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里后续考虑加个注释?因为不知道背景的人可能会觉得疑惑,为什么在XPU下面会刷CUDA的环境变量。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,下个pr里面加上

@houj04 houj04 merged commit 551afbc into PaddlePaddle:develop Apr 25, 2024
28 of 30 checks passed
dynamicheart added a commit to dynamicheart/Paddle that referenced this pull request May 8, 2024
@ZibinGuo
Copy link
Contributor Author

ZibinGuo commented Jul 27, 2024 via email

@houj04 houj04 added the XPU label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants