[Auto Parallel] Support MoE expert parallelism in dygraph auto parallel #63904

pkuzyc · 2024-04-26T08:29:29Z

PR Category

Auto Parallel

PR Types

New features

Description

Pcard-76459
Support MoE expert parallelism in dygraph auto parallel. In auto-parallel expert parallelism, experts' weights have different process meshes. This pr implements the expert parallelism as following:

Main changes

Add two apis local_tensor_list_from_dtensor and dtensor_from_local_list to transform the tensors between global and local meshes.
Fix the problems when the input tensors of a op have different mesh, which is necessary in expert parallelism:
1). add a skip_check_mesh flag in TensorDistAttr to skip checking whether the process mesh are different.
2). adapt the computing of tensors with local and global process meshes in grad_clip.

paddle-bot · 2024-04-26T08:29:33Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

zhiqiu · 2024-05-07T13:24:22Z

python/paddle/nn/clip.py

+                    if set(g.process_mesh.process_ids) < set(
+                        clip_input.process_mesh.process_ids
+                    ):
+                        clip_input = clip_input._local_value()
+                    else:
+                        clip_input = paddle.distributed.reshard(
+                            clip_input, g.process_mesh, clip_input.placements
+                        )


U can refine reshard() to support reshard clip_input to g.process_mesh

zhiqiu · 2024-05-07T13:25:15Z

test/auto_parallel/semi_auto_parallel_simple_net_ep.py

+        h = self.gate(x)
+        if self.config.run_ep:
+            local_val_list = (
+                dist.auto_parallel.api.local_tensor_list_from_dtensor(
+                    h, self.config.mesh, 0, [dist.Shard(0)]
+                )
+            )
+        else:
+            local_val_list = paddle.split(h, num_or_sections=2, axis=0)
+        expert_out_list = []
+        for i, expert in enumerate(self.experts):
+            local_val = local_val_list[i]
+            expert_out_list.append(expert(local_val))
+        if self.config.run_ep:
+            out = dist.auto_parallel.api.dtensor_from_local_list(
+                expert_out_list, self.config.mesh, [dist.Shard(0)], 0
+            )
+        else:
+            out = paddle.stack(expert_out_list, axis=0)
+            out = out.reshape((-1, self.config.class_num))
+        return out


As discussed offline, we need to polish the api of ep

zhiqiu

The comments can be fixed in the future.

paddle-ci-bot · 2024-05-08T03:27:07Z

Sorry to inform you that ad7fd06's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

XieYunshen

LGTM for set_tests_properties(test_semi_auto_parallel_simple_net_ep PROPERTIES LABELS "RUN_TYPE=EXCLUSIVE" TIMEOUT 120)

heavyrain-lzy

LGTM for ops.yaml

zhiqiu

LGTM

pkuzyc added 5 commits April 28, 2024 10:34

Support MoE in dygraph auto parallel

0f4f4c1

modify the api name and add unit test

0d9c7a7

remove partial_status-related functions in dist_attr

748c202

add unit test to CMakeList

4e363c6

add timeout property for ut

18d8dcb

pkuzyc force-pushed the moe_auto_dygraph branch from 0d71cc8 to 18d8dcb Compare April 28, 2024 06:29

pkuzyc added 2 commits April 28, 2024 22:06

add global_clip_norm in unit test

3a33054

remove print

ad7fd06

zhiqiu reviewed May 7, 2024

View reviewed changes

zhiqiu previously approved these changes May 7, 2024

View reviewed changes

polish as comments

5bf0331

pkuzyc dismissed zhiqiu’s stale review via 5bf0331 May 11, 2024 07:08

pkuzyc requested review from LiYuRio and ForFishes as code owners May 11, 2024 07:08

bug fix

3318f39

XieYunshen approved these changes May 13, 2024

View reviewed changes

heavyrain-lzy approved these changes May 13, 2024

View reviewed changes

zhiqiu approved these changes May 14, 2024

View reviewed changes

zhiqiu merged commit 3f4cf83 into PaddlePaddle:develop May 14, 2024
30 of 31 checks passed

JZ-LIANG self-requested a review June 20, 2024 08:37

pkuzyc mentioned this pull request Jul 24, 2024

[Dist Dialect] Add MoE-related api in PIR dist dialect #66462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Auto Parallel] Support MoE expert parallelism in dygraph auto parallel #63904

[Auto Parallel] Support MoE expert parallelism in dygraph auto parallel #63904

pkuzyc commented Apr 26, 2024 •

edited

Loading

paddle-bot bot commented Apr 26, 2024

zhiqiu May 7, 2024

zhiqiu May 7, 2024

zhiqiu left a comment

paddle-ci-bot bot commented May 8, 2024

XieYunshen left a comment

heavyrain-lzy left a comment

zhiqiu left a comment

[Auto Parallel] Support MoE expert parallelism in dygraph auto parallel #63904

[Auto Parallel] Support MoE expert parallelism in dygraph auto parallel #63904

Conversation

pkuzyc commented Apr 26, 2024 • edited Loading

PR Category

PR Types

Description

paddle-bot bot commented Apr 26, 2024

zhiqiu May 7, 2024

Choose a reason for hiding this comment

zhiqiu May 7, 2024

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

paddle-ci-bot bot commented May 8, 2024

XieYunshen left a comment

Choose a reason for hiding this comment

heavyrain-lzy left a comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

pkuzyc commented Apr 26, 2024 •

edited

Loading