Add FSDP input validation for use_orig_params and activation_cpu_offload flag #3515

j316chuck · 2024-08-03T00:16:43Z

What does this PR do?

Add FSDP input validation for use_orig_params and activation_cpu_offload flag

What issue(s) does this change relate to?

https://databricks.atlassian.net/browse/GRT-3143

Tests:

Added unit test
Before error log: iteration-23-qXpOcN

[rank0]: │ /usr/lib/python3/dist-packages/composer/core/engine.py:493 in _run_callbacks │
[rank0]: │                                                                              │
[rank0]: │   490 │   │   │   ctx = cast(ContextManager, contextlib.nullcontext()) if ma │
[rank0]: │   491 │   │   │   with ctx:                                                  │
[rank0]: │   492 │   │   │   │   self._debug_log(event, f'Running callback {type(cb).__ │
[rank0]: │ ❱ 493 │   │   │   │   cb.run_event(event, self.state, self.logger)           │
[rank0]: │   494 │                                                                      │
[rank0]: │   495 │   def _run_loggers(self, event: Union[Event, str]):                  │
[rank0]: │   496 │   │   loggers = [callback for callback in self.state.callbacks if is │
[rank0]: │                                                                              │
[rank0]: │ /workspace/llm-foundry/llmfoundry/callbacks/hf_checkpointer.py:254 in        │
[rank0]: │ run_event                                                                    │
[rank0]: │                                                                              │
[rank0]: │   251 │   │   │   state,                                                     │
[rank0]: │   252 │   │   │   event,                                                     │
[rank0]: │   253 │   │   ) and self.last_checkpoint_batch != state.timestamp.batch:     │
[rank0]: │ ❱ 254 │   │   │   self._save_checkpoint(state, logger)                       │
[rank0]: │   255 │   │   elif event == Event.INIT:                                      │
[rank0]: │   256 │   │   │   if not isinstance(state.model, HuggingFaceModel):          │
[rank0]: │   257 │   │   │   │   raise ValueError(                                      │
[rank0]: │                                                                              │
[rank0]: │ /workspace/llm-foundry/llmfoundry/callbacks/hf_checkpointer.py:481 in        │
[rank0]: │ _save_checkpoint                                                             │
[rank0]: │                                                                              │
[rank0]: │   478 │   │   for _, module in state_dict_model.named_modules():             │
[rank0]: │   479 │   │   │   hooks.append(module._register_state_dict_hook(tensor_hook) │
[rank0]: │   480 │   │                                                                  │
[rank0]: │ ❱ 481 │   │   state_dict = get_model_state_dict(                             │
[rank0]: │   482 │   │   │   state_dict_model,                                          │
[rank0]: │   483 │   │   │   options=StateDictOptions(                                  │
[rank0]: │   484 │   │   │   │   full_state_dict=True,                                  │
[rank0]: │                                                                              │
[rank0]: │ /usr/lib/python3/dist-packages/torch/distributed/checkpoint/state_dict.py:69 │
[rank0]: │ 6 in get_model_state_dict                                                    │
[rank0]: │                                                                              │
[rank0]: │    693 │   │   │   options=options,                                          │
[rank0]: │    694 │   │   )                                                             │
[rank0]: │    695 │   │   model_state_dict = _get_model_state_dict(model, info)         │
[rank0]: │ ❱  696 │   │   _verify_state_dict(model_state_dict, {}, info)                │
[rank0]: │    697 │   │   return model_state_dict                                       │
[rank0]: │    698                                                                       │
[rank0]: │    699                                                                       │
[rank0]: │                                                                              │
[rank0]: │ /usr/lib/python3/dist-packages/torch/distributed/checkpoint/state_dict.py:35 │
[rank0]: │ 6 in _verify_state_dict                                                      │
[rank0]: │                                                                              │
[rank0]: │    353 │                                                                     │
[rank0]: │    354 │   for key in model_state_dict.keys():                               │
[rank0]: │    355 │   │   if FLAT_PARAM in key:                                         │
[rank0]: │ ❱  356 │   │   │   raise RuntimeError(                                       │
[rank0]: │    357 │   │   │   │   f"{key} contains {FLAT_PARAM}. This can happen if the │
[rank0]: │    358 │   │   │   │   "is not the root module."                             │
[rank0]: │    359 │   │   │   )                                                         │
[rank0]: ╰──────────────────────────────────────────────────────────────────────────────╯
[rank0]: RuntimeError: model.layers.0._flat_param contains _flat_param. This can happen
[rank0]: if the model is not the root module.
wandb: | 0.060 MB of 0.060 MB uploaded

After error log: iteration-23-PNrLTN

[rank7]: │ /usr/lib/python3/dist-packages/composer/trainer/trainer.py:1405 in __init__  │
[rank7]: │                                                                              │
[rank7]: │   1402 │   │   log.info('Run name: %s', run_name)                            │
[rank7]: │   1403 │   │                                                                 │
[rank7]: │   1404 │   │   # Create the State                                            │
[rank7]: │ ❱ 1405 │   │   self.state = State(                                           │
[rank7]: │   1406 │   │   │   rank_zero_seed=rank_zero_seed,                            │
[rank7]: │   1407 │   │   │   algorithms=algorithms,                                    │
[rank7]: │   1408 │   │   │   model=model,                                              │
[rank7]: │                                                                              │
[rank7]: │ /usr/lib/python3/dist-packages/composer/core/state.py:553 in __init__        │
[rank7]: │                                                                              │
[rank7]: │    550 │   │   self.automicrobatch_fsdp_hook_handles = []                    │
[rank7]: │    551 │   │   self.fsdp_modules = {}                                        │
[rank7]: │    552 │   │                                                                 │
[rank7]: │ ❱  553 │   │   self._validate_parallelism_configs()                          │
[rank7]: │    554 │   │                                                                 │
[rank7]: │    555 │   │   self.device_mesh: Optional[DeviceMesh] = _create_device_mesh( │
[rank7]: │    556 │   │   if self.fsdp_config is not None and self.device_mesh is not N │
[rank7]: │                                                                              │
[rank7]: │ /usr/lib/python3/dist-packages/composer/core/state.py:645 in                 │
[rank7]: │ _validate_parallelism_configs                                                │
[rank7]: │                                                                              │
[rank7]: │    642 │   │                                                                 │
[rank7]: │    643 │   │   # Validate FSDP config parameters.                            │
[rank7]: │    644 │   │   if self.fsdp_config and self.fsdp_config.activation_cpu_offlo │
[rank7]: │ ❱  645 │   │   │   raise ValueError('activation_cpu_offload=True is not supp │
[rank7]: │    646 │   │                                                                 │
[rank7]: │    647 │   │   # Validate FSDP state dict type                               │
[rank7]: │    648 │   │   if self.fsdp_state_dict_type not in [None, 'full', 'sharded'] │
[rank7]: ╰──────────────────────────────────────────────────────────────────────────────╯
[rank7]: ValueError: activation_cpu_offload=True is not supported with
[rank7]: use_orig_params=False.

mvpatel2000

LGTM!

commit change

c4efc01

j316chuck force-pushed the chuck/add_fsdp_activation_error branch from 35b78c1 to c4efc01 Compare August 6, 2024 23:30

Chuck Tang added 2 commits August 6, 2024 16:40

commit change

ef1e253

ok

ce16b09

j316chuck requested a review from mvpatel2000 August 6, 2024 23:57

j316chuck marked this pull request as ready for review August 6, 2024 23:57

j316chuck requested a review from dakinggg August 7, 2024 00:03

mvpatel2000 approved these changes Aug 7, 2024

View reviewed changes

j316chuck merged commit a15b18c into dev Aug 7, 2024
14 checks passed

j316chuck deleted the chuck/add_fsdp_activation_error branch August 7, 2024 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FSDP input validation for use_orig_params and activation_cpu_offload flag #3515

Add FSDP input validation for use_orig_params and activation_cpu_offload flag #3515

j316chuck commented Aug 3, 2024 •

edited

Loading

mvpatel2000 left a comment

Add FSDP input validation for use_orig_params and activation_cpu_offload flag #3515

Add FSDP input validation for use_orig_params and activation_cpu_offload flag #3515

Conversation

j316chuck commented Aug 3, 2024 • edited Loading

What does this PR do?

What issue(s) does this change relate to?

Tests:

mvpatel2000 left a comment

Choose a reason for hiding this comment

j316chuck commented Aug 3, 2024 •

edited

Loading