v0.23.0
What's New
1. Parallelism V2 + Tensor Parallel (#3335)
Composer now supports PyTorch's implementation of tensor parallelism. As part of this, we've revamped and simplified how Composer does distributed training. Previously, Composer accepted a fsdp_config
attribute in the Trainer:
trainer = Trainer(model, fsdp_config = {'sharding_strategy': 'FULL_SHARD'})
As we generalize to more forms of parallelism, we've deprecated fsdp_config
in favor of parallelism_config
:
trainer = Trainer(
model = model,
...
parallelism_config = {
'fsdp': {
'sharding_strategy': 'FULL_SHARD',
'data_parallel_shard_degree': 2, # Size of shard dimension
'data_parallel_replicate_degree': 2, # Size of replicate dimension
},
'tp_config': {
'tensor_parallel_degree': 2, # Size of TP dimension
'layer_plan': ... # describes how to TP layers
}
}
)
As part of this change, we now default to using DTensor for parallelism with PyTorch FSDP. PyTorch has deprecated ShardedTensor, so this migrates to the new backend which avoids various checkpointing bugs.
See the docs for tensor parallel for more information. Note that tensor parallel is still experimental and may be subject to API breaking changes. All checkpointing features may also not work with this parallelism.
2. MLFLow API Simplification
Previously, MLFlow logger required a tracking URI and an absolute user path when using MLFlow with Databricks:
mlflow_logger = MLFlowLogger(
tracking_uri = 'databricks',
experiment_name = '/Users/xxx.yyy@zzz.com/my-first-project/'
)
trainer = Trainer(
model = model,
...
loggers = mlflow_logger,
)
Now, if you are using Databricks secrets as an environment variable, Composer will autopopulate tracking_uri
and the experiment_name
prefix:
trainer = Trainer(
model = model,
...
loggers = MLFlowLogger(experiment_name='my-first-project'),
)
3. Wallclock Save Interval
Composer now supports setting a save interval in wallclock time:
trainer = Trainer(
model = model,
...
save_interval='30m',
)
Note that most durations, such as max_duration
, do not accept wallclock time, and the initial version of this feature is only limited to a subset of time features like save_interval
.
Bug Fixes
- Don't close the engine if it's already closed in #3143
- Fix HF tests with Pin in #3248
- Fix backwards compatibility tests in #3252
- Fix unexpected remote checkpointing downloading in #3271
- Fix HSDP with ShardDegree < 8 in #3313
What's Changed
- Remove CPU offload for DDP/single-gpu by @mvpatel2000 in #3242
- Adding more checkpoint backwards compatability tests by @snarayan21 in #3244
- Don't close the engine if its already closed by @dakinggg in #3143
- Replace
evaluator.dataloader.device_eval_batch_size
withevaluator.device_eval_microbatch_size
by @ShashankMosaicML in #3247 - Fix HF tests with Pin by @mvpatel2000 in #3248
- Remove ICL metrics by @mvpatel2000 in #3243
- Add offset and length arguments for checkpoint validation functions by @irenedea in #3246
- Fix backwards compatibility tests, raise error for torch version mismatch by @snarayan21 in #3252
- Bump cryptography from 41.0.5 to 42.0.6 by @dependabot in #3256
- Bump databricks-sdk from 0.25.1 to 0.27.0 by @dependabot in #3257
- Improve GCS Object Store by @mvpatel2000 in #3251
- add retry to gcs.upload_file by @bigning in #3232
- Add unit test support for full state dict + load_weights_only and save_weights_only by @eracah in #3260
- will/bump_aws_ofi_nccl by @willgleich in #3253
- Fix daily GCS tests by @mvpatel2000 in #3268
- Fix: SAM not working with FSDP/DeepSpeed and LR scheduler. by @Joqsan in #3259
- Add upload timeout patch to mlflow on azure by @dakinggg in #3265
- Add option to stagger uploads based on local rank by @dakinggg in #3275
- explicit close by @dakinggg in #3276
- Update NCCL_ASYNC_ERROR_HANDLING env variable by @priba in #3267
- new dist_cp save planner to fix issue that each rank needs to download all checkpoint files by @bigning in #3271
- Bump to torch 2.2.2 by @mvpatel2000 in #3283
- Fix UCObjectStore.list_objects by @dakinggg in #3284
- Update peft version by @dakinggg in #3287
- replace
load_fsdp_monolith_
withload_monolith_
by @milocress in #3288 - Return PyTorch Latest by @mvpatel2000 in #3290
- Fix daily tests by filtering a warning by @mvpatel2000 in #3291
- remove orig_params check by @milocress in #2981
- [ckpt-rewr] Get Model State Dict Util Function by @eracah in #3250
- Skip compression check with symlink files by @mvpatel2000 in #3300
- Monkeypatch Device Mesh ND Slicing by @mvpatel2000 in #3302
- Bump coverage[toml] from 7.4.4 to 7.5.1 by @dependabot in #3305
- Bump databricks-sdk from 0.27.0 to 0.27.1 by @dependabot in #3306
- Update transformers requirement from !=4.34.0,<4.41,>=4.11 to >=4.11,!=4.34.0,<4.42 by @dependabot in #3307
- Allow overwrite on upload retry in remote uploader downloader by @irenedea in #3310
- Update platform references by @aspfohl in #3304
- Fix cometml unit tests by @j316chuck in #3314
- Fix HSDP with ShardDegree < 8 by @bigning in #3313
- Update docstring for get_model_state_dict by @eracah in #3318
- Tensor Parallelism Integration by @mvpatel2000 in #3269
- Bugfixes to FSDP + TP by @mvpatel2000 in #3323
- Wct save interval by @KuuCi in #3264
- Wrap ChunkedEncodingError from UCObjectStore by @irenedea in #3321
- Add checkpoint events to mosaicml logger by @b-chu in #3316
- Bump timeout to fix daily tests by @j316chuck in #3325
- Fix FSDP ckpt by filtering User Waring by @j316chuck in #3327
- Revert TP integration by @dakinggg in #3328
- Bump databricks-sdk from 0.27.1 to 0.28.0 by @dependabot in #3331
- Bump sphinxcontrib-katex from 0.9.6 to 0.9.10 by @dependabot in #3333
- Update peft requirement from <0.11,>=0.10.0 to >=0.10.0,<0.12 by @dependabot in #3332
- Bump coverage[toml] from 7.5.1 to 7.5.2 by @dependabot in #3330
- Update protobuf requirement from <5.27 to <5.28 by @dependabot in #3329
- Improving memory snapshot by @cli99 in #3315
- Add A10 to speed monitor by @mvpatel2000 in #3336
- change ComposerModel output type by @hyenal in #3341
- Remove evaluator state by @snarayan21 in #3339
- [ckpt-rewr] Generate Metadata State Dict API by @eracah in #3311
- Tensor Parallelism v2 by @mvpatel2000 in #3335
- Migrate Type Hints for PEP 585 by @mvpatel2000 in #3344
- [checkpoint v2] add remote uploader class by @bigning in #3303
- Raise errors on all ranks for checkpoint download failures by @irenedea in #3345
- Add return type annotation when init doesn't take any argument by @antoinebrl in #3347
- [ckpt-rewr] Get Optim State Dict Util API by @eracah in #3299
- Fix type check issue with device train microbatch size by @mvpatel2000 in #3349
- Add torch distributed checkpointing monkeypatches to enable TE checkpointing for extra_state attribute by @j316chuck in #3298
- Bump coverage[toml] from 7.5.2 to 7.5.3 by @dependabot in #3353
- Update wandb requirement from <0.17,>=0.13.2 to >=0.13.2,<0.18 by @dependabot in #3352
- Optional
CheckpointSaver
instantiation inside theTrainer
by @antoinebrl in #3334 - MLFlow better experiment defaults by @mvpatel2000 in #3356
- Rename metadata keys by @mvpatel2000 in #3354
- Dataclasses for ParallelismConfig by @mvpatel2000 in #3346
- Upgrade Mofed with apt by @willgleich in #3340
- Multi gpu ci test by @KuuCi in #3312
- Autoresume Validation with Max Duration by @mvpatel2000 in #3358
- Deprecate and bump verstion to 0.23.0 by @bigning in #3359
New Contributors
Full Changelog: v0.22.0...v0.23.0