Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XPU] fix gm allocaion on XPUContext::Impl::Init #60260

Merged
merged 1 commit into from
Dec 22, 2023

Conversation

dynamicheart
Copy link
Contributor

@dynamicheart dynamicheart commented Dec 22, 2023

PR types

Bug fixes

PR changes

APIs

Description

This PR #54674 forces the option XPUAPI_DEFAULT_SIZE of xdnn::Context to 1 by default, regardless of whether we set the environment variable XPUAPI_DEFAULT_SIZE to a different value. It triggers a lot of xpu_wait calls.

This comment describes why XPUAPI_DEFAULT_SIZE is originally set to 1: #54674 (comment)

Copy link

paddle-bot bot commented Dec 22, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@houj04 houj04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM && 建议戳原作者/原审批人也看一下。

@houj04
Copy link
Contributor

houj04 commented Dec 22, 2023

有个问题:这里引用到的PR是半年之前的,为啥最近发现了这个问题呢?

@dynamicheart
Copy link
Contributor Author

LGTM && 建议戳原作者/原审批人也看一下。

@AlbertVan @zhupengyang 辛苦两位同学看看

Copy link
Contributor

@runzhech runzhech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dynamicheart
Copy link
Contributor Author

有个问题:这里引用到的PR是半年之前的,为啥最近发现了这个问题呢?

PyTorch的XpuContext实现参考了Paddle这边的实现,PyTorch那边先发现了这个问题,大概是2023年10月份发现的。

Copy link
Contributor

@shentanyue shentanyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本质问题是多个xpu_context都会各自去申请一份XPUAPI_DEFAULT_SIZE。
训练侧后面可以再关注下。

@houj04 houj04 merged commit 39ddd5f into PaddlePaddle:develop Dec 22, 2023
29 checks passed
@houj04 houj04 added the XPU label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants