Add Automicrobatching for Non-Powers-of-2 + Fixes to FSDP deadlocks using Adaptive Sync Hooks #3503

JackZ-db · 2024-07-30T06:00:41Z

No description provided.

mvpatel2000

Rerequest once test passes!

mvpatel2000

first pass, design looks right but code needs some cleanup

mvpatel2000 · 2024-07-31T18:27:05Z

composer/trainer/_patch_pytorch.py

+@no_type_check
+def unshard(self):
+    """
+    Run the unshard logic. 
+    This is an unpatched method from pytorch, meant to be reverted to 
+    whenever automicrobatching turns off its hooks for increased throughput.
+    This includes all-gathering the flat parameter
+    and switching to using the unsharded flat parameter. If the handle does
+    not need unsharding, then this only switches to using the unsharded
+    flat parameter. For ``NO_SHARD``, this is a no-op.
+    If FSDP is in :meth:`summon_full_params` and the handle uses parameter


This should probably be in the if torch 2.3.1 section

mvpatel2000 · 2024-07-31T18:27:28Z

composer/distributed/dist_strategy.py

+
+            if auto_microbatching:


can you add a comment on what this is doing?

mvpatel2000 · 2024-07-31T18:28:06Z

composer/trainer/trainer.py

+def _double_device_train_microbatch_size(state: State):
+    """Double device_train_microbatch_size when automicrobatching searches upward for a higher non-OOM microbatch size.


should this go into automcirobatching utils folder?

mvpatel2000 · 2024-07-31T18:28:16Z

composer/trainer/trainer.py

+        num_consecutive_thrashes = 0
+    return num_consecutive_thrashes
+
+def _handle_downward_search_in_automicrobatching(state: State, lowest_oom_microbatch_size: int, highest_non_oom_microbatch_size: int, lower_bound_microbatch_size: int, num_search_steps: int, max_search_steps: int):


same comment on moving to utils?

mvpatel2000 · 2024-07-31T18:29:18Z

composer/trainer/trainer.py

@@ -1251,6 +1421,7 @@ def __init__(
        if parallelism_config is not None:
            # Patch PyTorch to fix distributed bugs
            patch_pytorch()
+            patch_unshard_for_automicrobatching(self.auto_microbatch_size_found)


this should be just part of patch_pytorch to simplify interface

we need to pass in a boolean variable telling it how to patch this one specific method though - i feel like it would be less readable if we passed self.auto_microbatch_size_found directly into patch_pytorch

mvpatel2000 · 2024-07-31T18:30:40Z

composer/trainer/trainer.py

+                # Sync for OOMs
+                found_cuda_oom = _found_ooms_across_ranks(self.state, found_cuda_oom)


this block is really complicated. lets move to a helper fn

mvpatel2000 · 2024-07-31T18:30:53Z

composer/trainer/trainer.py

        with torch.no_grad(), model_eval_mode(self.state.model):
+            if self.state.fsdp_enabled and self.first_batch_complete:
+                print("readd hooks for eval")


mvpatel2000 · 2024-07-31T18:31:03Z

composer/utils/__init__.py

@@ -8,6 +8,18 @@
    convert_nested_dict_to_flat_dict,
    extract_hparams,
 )
+from composer.utils.automicrobatching import (
+    # _create_sync_hook,


mvpatel2000 · 2024-07-31T18:31:08Z

composer/utils/__init__.py

@@ -164,4 +176,14 @@
    'validate_credentials',
    'build_remote_backend',
    'RemoteFilesExistingCheckStatus',
+    # '_create_sync_hook',


JackZ-db added 8 commits July 29, 2024 22:57

add automicrobatching for non-powers-of-2 + adaptive sync hooks

bbb9a66

include auto helpers in _all_

4228889

fix circular imports

a537c4c

remove circular import

ff806d1

remove import state

4025274

dist

9ad1719

fix imports

896b999

import defaultdict

7146f23

JackZ-db requested review from mvpatel2000 and j316chuck July 30, 2024 07:58

mvpatel2000 reviewed Jul 30, 2024

View reviewed changes

JackZ-db added 10 commits July 30, 2024 09:46

log for hook on off

cd2fe9f

fixed hook readd bug

b1b16cd

rename hooks to fsdp hooks, will only trigger if fsdp

476e028

only invoke hook logic if fsdp enabled

0693364

typo

df404ac

fix seq length warmup

0b6d6ce

only patch flat param handle unshard if > 2.3

153c413

fix version comparison

d98926d

mark unit test

3e82ef6

remove device mark

a09b844

JackZ-db requested a review from mvpatel2000 July 31, 2024 01:45

JackZ-db added 4 commits July 30, 2024 18:56

filter user warnigns out

33840c5

fix

e87c9f6

dist sampler

86c32a0

ignore runtime warning

a193b76

JackZ-db requested a review from bigning July 31, 2024 16:39

only drop hooks after 3 consecutive successes with this microbatch size

0b6a30d

mvpatel2000 reviewed Jul 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Automicrobatching for Non-Powers-of-2 + Fixes to FSDP deadlocks using Adaptive Sync Hooks #3503

Add Automicrobatching for Non-Powers-of-2 + Fixes to FSDP deadlocks using Adaptive Sync Hooks #3503

JackZ-db commented Jul 30, 2024

mvpatel2000 left a comment

mvpatel2000 left a comment

mvpatel2000 Jul 31, 2024

mvpatel2000 Jul 31, 2024

mvpatel2000 Jul 31, 2024

mvpatel2000 Jul 31, 2024

mvpatel2000 Jul 31, 2024

JackZ-db Jul 31, 2024

mvpatel2000 Jul 31, 2024

mvpatel2000 Jul 31, 2024

mvpatel2000 Jul 31, 2024

mvpatel2000 Jul 31, 2024

		def _double_device_train_microbatch_size(state: State):
		"""Double device_train_microbatch_size when automicrobatching searches upward for a higher non-OOM microbatch size.

		# Sync for OOMs
		found_cuda_oom = _found_ooms_across_ranks(self.state, found_cuda_oom)

Add Automicrobatching for Non-Powers-of-2 + Fixes to FSDP deadlocks using Adaptive Sync Hooks #3503

Are you sure you want to change the base?

Add Automicrobatching for Non-Powers-of-2 + Fixes to FSDP deadlocks using Adaptive Sync Hooks #3503

Conversation

JackZ-db commented Jul 30, 2024

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment