Release v0.23.0 · mosaicml/composer

What's New

1. Parallelism V2 + Tensor Parallel (#3335)

Composer now supports PyTorch's implementation of tensor parallelism. As part of this, we've revamped and simplified how Composer does distributed training. Previously, Composer accepted a fsdp_config attribute in the Trainer:

trainer = Trainer(model, fsdp_config = {'sharding_strategy': 'FULL_SHARD'})

As we generalize to more forms of parallelism, we've deprecated fsdp_config in favor of parallelism_config:

trainer = Trainer(
    model = model,
    ...
    parallelism_config = {
        'fsdp': {
            'sharding_strategy': 'FULL_SHARD',
            'data_parallel_shard_degree': 2,      # Size of shard dimension
            'data_parallel_replicate_degree': 2,  # Size of replicate dimension
        },
        'tp_config': {
            'tensor_parallel_degree': 2,          # Size of TP dimension
            'layer_plan': ...  # describes how to TP layers
        }
    }
)

As part of this change, we now default to using DTensor for parallelism with PyTorch FSDP. PyTorch has deprecated ShardedTensor, so this migrates to the new backend which avoids various checkpointing bugs.

See the docs for tensor parallel for more information. Note that tensor parallel is still experimental and may be subject to API breaking changes. All checkpointing features may also not work with this parallelism.

2. MLFLow API Simplification

Previously, MLFlow logger required a tracking URI and an absolute user path when using MLFlow with Databricks:

mlflow_logger = MLFlowLogger(
    tracking_uri = 'databricks',
    experiment_name = '/Users/xxx.yyy@zzz.com/my-first-project/'
)

trainer = Trainer(
    model = model,
    ...
    loggers = mlflow_logger,
)

Now, if you are using Databricks secrets as an environment variable, Composer will autopopulate tracking_uri and the experiment_name prefix:

trainer = Trainer(
    model = model,
    ...
    loggers = MLFlowLogger(experiment_name='my-first-project'),
)

3. Wallclock Save Interval

Composer now supports setting a save interval in wallclock time:

trainer = Trainer(
    model = model,
    ...
    save_interval='30m',
)

Note that most durations, such as max_duration, do not accept wallclock time, and the initial version of this feature is only limited to a subset of time features like save_interval.

Bug Fixes

Don't close the engine if it's already closed in #3143
Fix HF tests with Pin in #3248
Fix backwards compatibility tests in #3252
Fix unexpected remote checkpointing downloading in #3271
Fix HSDP with ShardDegree < 8 in #3313

What's Changed

Remove CPU offload for DDP/single-gpu by @mvpatel2000 in #3242
Adding more checkpoint backwards compatability tests by @snarayan21 in #3244
Don't close the engine if its already closed by @dakinggg in #3143
Replace evaluator.dataloader.device_eval_batch_size with evaluator.device_eval_microbatch_size by @ShashankMosaicML in #3247
Fix HF tests with Pin by @mvpatel2000 in #3248
Remove ICL metrics by @mvpatel2000 in #3243
Add offset and length arguments for checkpoint validation functions by @irenedea in #3246
Fix backwards compatibility tests, raise error for torch version mismatch by @snarayan21 in #3252
Bump cryptography from 41.0.5 to 42.0.6 by @dependabot in #3256
Bump databricks-sdk from 0.25.1 to 0.27.0 by @dependabot in #3257
Improve GCS Object Store by @mvpatel2000 in #3251
add retry to gcs.upload_file by @bigning in #3232
Add unit test support for full state dict + load_weights_only and save_weights_only by @eracah in #3260
will/bump_aws_ofi_nccl by @willgleich in #3253
Fix daily GCS tests by @mvpatel2000 in #3268
Fix: SAM not working with FSDP/DeepSpeed and LR scheduler. by @Joqsan in #3259
Add upload timeout patch to mlflow on azure by @dakinggg in #3265
Add option to stagger uploads based on local rank by @dakinggg in #3275
explicit close by @dakinggg in #3276
Update NCCL_ASYNC_ERROR_HANDLING env variable by @priba in #3267
new dist_cp save planner to fix issue that each rank needs to download all checkpoint files by @bigning in #3271
Bump to torch 2.2.2 by @mvpatel2000 in #3283
Fix UCObjectStore.list_objects by @dakinggg in #3284
Update peft version by @dakinggg in #3287
replace load_fsdp_monolith_ with load_monolith_ by @milocress in #3288
Return PyTorch Latest by @mvpatel2000 in #3290
Fix daily tests by filtering a warning by @mvpatel2000 in #3291
remove orig_params check by @milocress in #2981
[ckpt-rewr] Get Model State Dict Util Function by @eracah in #3250
Skip compression check with symlink files by @mvpatel2000 in #3300
Monkeypatch Device Mesh ND Slicing by @mvpatel2000 in #3302
Bump coverage[toml] from 7.4.4 to 7.5.1 by @dependabot in #3305
Bump databricks-sdk from 0.27.0 to 0.27.1 by @dependabot in #3306
Update transformers requirement from !=4.34.0,<4.41,>=4.11 to >=4.11,!=4.34.0,<4.42 by @dependabot in #3307
Allow overwrite on upload retry in remote uploader downloader by @irenedea in #3310
Update platform references by @aspfohl in #3304
Fix cometml unit tests by @j316chuck in #3314
Fix HSDP with ShardDegree < 8 by @bigning in #3313
Update docstring for get_model_state_dict by @eracah in #3318
Tensor Parallelism Integration by @mvpatel2000 in #3269
Bugfixes to FSDP + TP by @mvpatel2000 in #3323
Wct save interval by @KuuCi in #3264
Wrap ChunkedEncodingError from UCObjectStore by @irenedea in #3321
Add checkpoint events to mosaicml logger by @b-chu in #3316
Bump timeout to fix daily tests by @j316chuck in #3325
Fix FSDP ckpt by filtering User Waring by @j316chuck in #3327
Revert TP integration by @dakinggg in #3328
Bump databricks-sdk from 0.27.1 to 0.28.0 by @dependabot in #3331
Bump sphinxcontrib-katex from 0.9.6 to 0.9.10 by @dependabot in #3333
Update peft requirement from <0.11,>=0.10.0 to >=0.10.0,<0.12 by @dependabot in #3332
Bump coverage[toml] from 7.5.1 to 7.5.2 by @dependabot in #3330
Update protobuf requirement from <5.27 to <5.28 by @dependabot in #3329
Improving memory snapshot by @cli99 in #3315
Add A10 to speed monitor by @mvpatel2000 in #3336
change ComposerModel output type by @hyenal in #3341
Remove evaluator state by @snarayan21 in #3339
[ckpt-rewr] Generate Metadata State Dict API by @eracah in #3311
Tensor Parallelism v2 by @mvpatel2000 in #3335
Migrate Type Hints for PEP 585 by @mvpatel2000 in #3344
[checkpoint v2] add remote uploader class by @bigning in #3303
Raise errors on all ranks for checkpoint download failures by @irenedea in #3345
Add return type annotation when init doesn't take any argument by @antoinebrl in #3347
[ckpt-rewr] Get Optim State Dict Util API by @eracah in #3299
Fix type check issue with device train microbatch size by @mvpatel2000 in #3349
Add torch distributed checkpointing monkeypatches to enable TE checkpointing for extra_state attribute by @j316chuck in #3298
Bump coverage[toml] from 7.5.2 to 7.5.3 by @dependabot in #3353
Update wandb requirement from <0.17,>=0.13.2 to >=0.13.2,<0.18 by @dependabot in #3352
Optional CheckpointSaver instantiation inside the Trainer by @antoinebrl in #3334
MLFlow better experiment defaults by @mvpatel2000 in #3356
Rename metadata keys by @mvpatel2000 in #3354
Dataclasses for ParallelismConfig by @mvpatel2000 in #3346
Upgrade Mofed with apt by @willgleich in #3340
Multi gpu ci test by @KuuCi in #3312
Autoresume Validation with Max Duration by @mvpatel2000 in #3358
Deprecate and bump verstion to 0.23.0 by @bigning in #3359

New Contributors

@Joqsan made their first contribution in #3259

Full Changelog: v0.22.0...v0.23.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.23.0

What's New

Bug Fixes

What's Changed

New Contributors

Contributors