Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unsupported setting of self._n_gpu in training_args on XPU devices #27716

Merged
merged 2 commits into from
Dec 1, 2023

Conversation

Liangliang-Ma
Copy link
Contributor

In current training_args, self._n_gpu is set to device_count on XPU device, which will cause crash on XPU devices.
In Trainer, if self.args.n_gpu greater than one, it will utilize torch.nn.DataParallel to wrap the model. But Ipex(intel_extension_for_pytorch) don't support DataParallel, while it suggests using DDP instead.
So to make huggingface Trainer work on intel devices, this fix should be applied.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! This seems to have been introduce in #25714, and thus I am not convinced that the fix is as simple as that!
Would you mind also sharing a reproducer of the issue? Might be related to specific hard / soft versions

@Liangliang-Ma
Copy link
Contributor Author

Liangliang-Ma commented Nov 28, 2023

Hey! This seems to have been introduce in #25714, and thus I am not convinced that the fix is as simple as that! Would you mind also sharing a reproducer of the issue? Might be related to specific hard / soft versions

Hi, @ArthurZucker! I have discussed with @abhilash1910 about the issue before this PR and we thought it can be fixed like this. This could be reproduced in a multi intel GPUs env, for the distributedType should be set to MULTI_XPU in accelerator. I think most example scripts using Trainer and more than one gpu should be reproducer.

Copy link
Contributor

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Liangliang-Ma for addressing this.
@ArthurZucker yes the current limitation of pure dp on ipex only allows us to use n_gpu=1 .
Could you help re-trigger the CI test ? Thanks

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright thanks all for checking, triggering the CI

@Liangliang-Ma
Copy link
Contributor Author

@ArthurZucker Seems CI got block by Internet issue. Could you please check with that? Thanks

@ArthurZucker
Copy link
Collaborator

Seems like I can't would you mind merging with main to trigger it?!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ArthurZucker ArthurZucker merged commit 9ddbb69 into huggingface:main Dec 1, 2023
20 checks passed
@ArthurZucker
Copy link
Collaborator

Thanks both 🤗

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants