Skip to content

v0.23.0

Compare
Choose a tag to compare
@bigning bigning released this 05 Jun 20:34

What's New

1. Parallelism V2 + Tensor Parallel (#3335)

Composer now supports PyTorch's implementation of tensor parallelism. As part of this, we've revamped and simplified how Composer does distributed training. Previously, Composer accepted a fsdp_config attribute in the Trainer:

trainer = Trainer(model, fsdp_config = {'sharding_strategy': 'FULL_SHARD'})

As we generalize to more forms of parallelism, we've deprecated fsdp_config in favor of parallelism_config:

trainer = Trainer(
    model = model,
    ...
    parallelism_config = {
        'fsdp': {
            'sharding_strategy': 'FULL_SHARD',
            'data_parallel_shard_degree': 2,      # Size of shard dimension
            'data_parallel_replicate_degree': 2,  # Size of replicate dimension
        },
        'tp_config': {
            'tensor_parallel_degree': 2,          # Size of TP dimension
            'layer_plan': ...  # describes how to TP layers
        }
    }
)

As part of this change, we now default to using DTensor for parallelism with PyTorch FSDP. PyTorch has deprecated ShardedTensor, so this migrates to the new backend which avoids various checkpointing bugs.

See the docs for tensor parallel for more information. Note that tensor parallel is still experimental and may be subject to API breaking changes. All checkpointing features may also not work with this parallelism.

2. MLFLow API Simplification

Previously, MLFlow logger required a tracking URI and an absolute user path when using MLFlow with Databricks:

mlflow_logger = MLFlowLogger(
    tracking_uri = 'databricks',
    experiment_name = '/Users/xxx.yyy@zzz.com/my-first-project/'
)

trainer = Trainer(
    model = model,
    ...
    loggers = mlflow_logger,
)

Now, if you are using Databricks secrets as an environment variable, Composer will autopopulate tracking_uri and the experiment_name prefix:

trainer = Trainer(
    model = model,
    ...
    loggers = MLFlowLogger(experiment_name='my-first-project'),
)

3. Wallclock Save Interval

Composer now supports setting a save interval in wallclock time:

trainer = Trainer(
    model = model,
    ...
    save_interval='30m',
)

Note that most durations, such as max_duration, do not accept wallclock time, and the initial version of this feature is only limited to a subset of time features like save_interval.

Bug Fixes

  • Don't close the engine if it's already closed in #3143
  • Fix HF tests with Pin in #3248
  • Fix backwards compatibility tests in #3252
  • Fix unexpected remote checkpointing downloading in #3271
  • Fix HSDP with ShardDegree < 8 in #3313

What's Changed

New Contributors

Full Changelog: v0.22.0...v0.23.0