Fix unsupported setting of self._n_gpu in training_args on XPU devices #27716

Liangliang-Ma · 2023-11-27T05:45:34Z

In current training_args, self._n_gpu is set to device_count on XPU device, which will cause crash on XPU devices.
In Trainer, if self.args.n_gpu greater than one, it will utilize torch.nn.DataParallel to wrap the model. But Ipex(intel_extension_for_pytorch) don't support DataParallel, while it suggests using DDP instead.
So to make huggingface Trainer work on intel devices, this fix should be applied.

ArthurZucker

Hey! This seems to have been introduce in #25714, and thus I am not convinced that the fix is as simple as that!
Would you mind also sharing a reproducer of the issue? Might be related to specific hard / soft versions

Liangliang-Ma · 2023-11-28T02:50:20Z

Hey! This seems to have been introduce in #25714, and thus I am not convinced that the fix is as simple as that! Would you mind also sharing a reproducer of the issue? Might be related to specific hard / soft versions

Hi, @ArthurZucker! I have discussed with @abhilash1910 about the issue before this PR and we thought it can be fixed like this. This could be reproduced in a multi intel GPUs env, for the distributedType should be set to MULTI_XPU in accelerator. I think most example scripts using Trainer and more than one gpu should be reproducer.

abhilash1910

Thanks @Liangliang-Ma for addressing this.
@ArthurZucker yes the current limitation of pure dp on ipex only allows us to use n_gpu=1 .
Could you help re-trigger the CI test ? Thanks

ArthurZucker

Alright thanks all for checking, triggering the CI

Liangliang-Ma · 2023-11-29T01:24:19Z

@ArthurZucker Seems CI got block by Internet issue. Could you please check with that? Thanks

ArthurZucker · 2023-11-29T16:18:13Z

Seems like I can't would you mind merging with main to trigger it?!

HuggingFaceDocBuilderDev · 2023-11-29T16:37:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker · 2023-12-01T09:34:22Z

Thanks both 🤗

change xpu _n_gpu = 1

669d82a

ArthurZucker reviewed Nov 27, 2023

View reviewed changes

abhilash1910 approved these changes Nov 28, 2023

View reviewed changes

ArthurZucker approved these changes Nov 28, 2023

View reviewed changes

Merge branch 'huggingface:main' into xpu

16a2bf8

ArthurZucker merged commit 9ddbb69 into huggingface:main Dec 1, 2023
20 checks passed

abhilash1910 mentioned this pull request Jan 11, 2024

Fix wrong xpu device in DistributedType.MULTI_XPU mode #28386

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unsupported setting of self._n_gpu in training_args on XPU devices #27716

Fix unsupported setting of self._n_gpu in training_args on XPU devices #27716

Liangliang-Ma commented Nov 27, 2023

ArthurZucker left a comment

Liangliang-Ma commented Nov 28, 2023 •

edited

Loading

abhilash1910 left a comment

ArthurZucker left a comment

Liangliang-Ma commented Nov 29, 2023

ArthurZucker commented Nov 29, 2023

HuggingFaceDocBuilderDev commented Nov 29, 2023

ArthurZucker commented Dec 1, 2023

Fix unsupported setting of self._n_gpu in training_args on XPU devices #27716

Fix unsupported setting of self._n_gpu in training_args on XPU devices #27716

Conversation

Liangliang-Ma commented Nov 27, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

Liangliang-Ma commented Nov 28, 2023 • edited Loading

abhilash1910 left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Liangliang-Ma commented Nov 29, 2023

ArthurZucker commented Nov 29, 2023

HuggingFaceDocBuilderDev commented Nov 29, 2023

ArthurZucker commented Dec 1, 2023

Liangliang-Ma commented Nov 28, 2023 •

edited

Loading