Automatic Model Parallelism Through FX #1933

zhenglongjiepheonix · 2024-07-01T02:59:55Z

What does this PR do?

This PR tries to add an automatic parallelization backend for torch dynamo, which takes the dynamo-captured fx graph, runs a few passes to automatically identify parts that can be parallelized and transforms the graph into its parallelized version. For simplicity it focuses on models supporting dynamo tracing in transformers library right now and might not support custom models because of the tricky parts in parallel pattern matching.

For now it only supports parallelization of linears in the graph, in the context of transformers they would be attention layers and mlp layers, with the following milestones left:

support tensor parallelism on loss layers(need models support training-mode trace)
support weights loading from disk/hub
support fusion of parallel linears(qkv fusion, mlp fusion)
try supporting sequence parallelism

~~Please feel free to review and provide suggestions even if it's still in progress and not covering all features.~~
According to @michaelbenayoun , we should try merging the first version and iterations will be coming in following PRs

…ic_model_parallel_via_fx

HuggingFaceDocBuilderDev · 2024-07-01T03:19:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/fx/parallelization/__init__.py

optimum/fx/parallelization/core.py

optimum/fx/parallelization/passes.py

optimum/fx/parallelization/utils.py

fxmarty

this is super cool!

fxmarty · 2024-07-01T15:33:58Z

optimum/fx/parallelization/distributed/dist_ops.py

+    rank = dist.get_rank(group = group)
+
+    tensor = tensor.contiguous()
+    tensors = [torch.empty_like(tensor) for _ in range(world_size)]


I have not tried it, but may be https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather_into_tensor is more efficient with a single empty call, like https://github.com/huggingface/text-generation-inference/blob/d0225b10156320f294647ac676c130d03626473d/server/text_generation_server/layers/tensor_parallel.py#L98

fxmarty · 2024-07-01T15:35:12Z

optimum/fx/parallelization/distributed/dist_ops.py

+    size = tensor.size()
+    assert size[split_dim] % world_size == 0
+    tensors = torch.split(tensor, size[split_dim] // world_size, dim = split_dim)
+    tensor = tensors[rank].contiguous()


why contiguous?

tensors after split may not be contiguous, I think it's better be contiguous

fxmarty · 2024-07-01T15:37:45Z

optimum/fx/parallelization/parallel_layers/linear.py

+            self.bias.zero_()
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        input = differentiable_identity(input, self.process_group)


why do we need an identity here?

to take care of gradient reduce in backward

fxmarty · 2024-07-01T15:42:02Z

optimum/fx/parallelization/passes.py

+            self.clear_marker_per_node(node)
+
+
+class ParallelLinearAnnotatePass(AnalyzeBase):


I don't get the name parallel here? Isn't it more like successive?

it actually means annotate some layers to be their parallel counterparts

optimum/fx/parallelization/passes.py

fxmarty · 2024-07-01T15:49:16Z

tests/fx/parallelization/test_tensor_parallel.py

+    tensors = gather_at_main_process(tensor=logits, group=tp_group, rank=rank, world_size=world_size)
+
+    # check results at main worker process
+    if rank == 0:
+        assert len(tensors) == world_size
+        for i in range(1, world_size):
+            torch.testing.assert_close(tensors[i - 1].cpu(), tensors[i].cpu(), rtol=1e-4, atol=1e-4)


it should probably be checked on all ranks

check at the main process should be enough, because it gathers results from other ranks at main process and does comparison

fxmarty · 2024-07-01T15:51:24Z

tests/fx/parallelization/test_tensor_parallel.py

+    move_model_to_device(model, device=device)
+    initialize_parameter_mapping(model, ctx=ctx)
+
+    model = torch.compile(model, fullgraph=True, backend=partial(parallelize_backend, ctx=ctx, config=cfg))


can we compose with inductor?

Not quite confident to say now, but at least it won't be able single graph

I would say that is the hope for the future.

…ic_model_parallel_via_fx

optimum/fx/parallelization/core.py

optimum/fx/parallelization/parallel_layers/embedding.py

…ic_model_parallel_via_fx

michaelbenayoun · 2024-07-24T10:24:20Z