-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unsupported setting of self._n_gpu in training_args on XPU devices #27716
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! This seems to have been introduce in #25714, and thus I am not convinced that the fix is as simple as that!
Would you mind also sharing a reproducer of the issue? Might be related to specific hard / soft versions
Hi, @ArthurZucker! I have discussed with @abhilash1910 about the issue before this PR and we thought it can be fixed like this. This could be reproduced in a multi intel GPUs env, for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Liangliang-Ma for addressing this.
@ArthurZucker yes the current limitation of pure dp on ipex only allows us to use n_gpu=1 .
Could you help re-trigger the CI test ? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright thanks all for checking, triggering the CI
@ArthurZucker Seems CI got block by Internet issue. Could you please check with that? Thanks |
Seems like I can't would you mind merging with main to trigger it?! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Thanks both 🤗 |
In current training_args, self._n_gpu is set to device_count on XPU device, which will cause crash on XPU devices.
In Trainer, if self.args.n_gpu greater than one, it will utilize torch.nn.DataParallel to wrap the model. But Ipex(intel_extension_for_pytorch) don't support DataParallel, while it suggests using DDP instead.
So to make huggingface Trainer work on intel devices, this fix should be applied.