Releases: mosaicml/composer
v0.23.5
What's New
1. Variable length dataloaders (#3416)
Adds support for dataloaders with rank-dependent lengths. The solution terminates iteration for dataloaders on all ranks when the first dataloader finishes.
Bug Fixed
1. Remove close flush for mosaicml logger (#3446)
Previously, the MosaicML Logger sporadically raised an error when the python interpreter was shutting down as it attempted to flush data on Event.CLOSE
using futures, which cannot be scheduled at that time. Instead, we now only block on finishing existing data upload on Event.CLOSE
, avoiding scheduling new futures.
What's Changed
- Update numpy requirement from <1.27.0,>=1.21.5 to >=1.21.5,<2.1.0 by @dependabot in #3406
- Restore dev version by @karan6181 in #3417
- Save checkpoint to disk for API with new save layout by @eracah in #3399
- Patch PyTorch 2.3.1 by @mvpatel2000 in #3419
- Fixes some typing issues by @dakinggg in #3418
- Fix style by @b-chu in #3420
- Bump coverage[toml] from 7.5.3 to 7.5.4 by @dependabot in #3422
- Update psutil requirement from <6,>=5.8.0 to >=5.8.0,<7 by @dependabot in #3424
- Add support for variable length dataloaders in DDP by @JAEarly in #3416
- Hsdp + MoE CI tests by @KuuCi in #3378
- Bumping MLflow version to 2.14.1 by @JackZ-db in #3425
- Skip HSDP + TP pytests that require torch 2.3 or above by @KuuCi in #3426
- Remove CodeQL workflow by @mvpatel2000 in #3429
- Remove save overwrite by @mvpatel2000 in #3431
- Fixes to TP Docs by @snarayan21 in #3430
- Lower the system metrics logging frequency to reduce MLflow server's load by @chenmoneygithub in #3436
- Update paramiko requirement from <3,>=2.11.0 to >=3.4.0,<4 by @dependabot in #3439
- Bump CI testing version by @mvpatel2000 in #3433
- Fix docstring for EVAL_AFTER_ALL/EVAL_BEFORE_ALL by @mvpatel2000 in #3445
- Remove close flush for mosaicml logger by @mvpatel2000 in #3446
- Remove MosaicMLLambdaEvalClient by @aspfohl in #3432
- Relax hf hub pin by @dakinggg in #3435
- Pytest skip 2 by @KuuCi in #3448
- bump version v0.23.5 by @XiaohanZhangCMU in #3450
Full Changelog: v0.23.4...v0.23.5
v0.23.4
Bug Fixes
1. Patch PyTorch 2.3.1 (#3419)
Fixes missing import when monkeypatching device mesh functions in PyTorch 2.3.1. This is necessary for MoE training.
Full Changelog: v0.23.3...v0.23.4
v0.23.3
New Features
1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)
We've enhanced the MLflow logger's log_image
function to use the new API with time-dimension support, enabling images to be viewed in MLflow.
2. Add logging buffer time to MLFLow logger (#3401)
We've added the logging_buffer_seconds
argument to the MLflow logger, which specifies how many seconds to buffer before sending logs to the MLflow tracking server.
Bug Fixes
1. Only require databricks-sdk
when on Databricks platform (#3389)
Previously, MLFlow always imported the databricks-sdk. Now, we only require the sdk if on the databricks platform and using databricks secrets to access managed MLFlow.
2. Skip extra dataset state load during job resumption (#3393)
Previously, when loading a checkpoint with train_dataloader
, the dataset_state
would load first, and if train_dataloader
was set again afterward, load_state_dict
would be called with a None
value. Now, we've added a check in the train_dataloader
setter to skip this redundant load.
3. Fix auto-microbatching on CUDA 12.4 (#3400)
In CUDA 12.4, the out-of-memory error message has changed to CUDA error: out of memory
. Previously, our logic hardcoded checks for CUDA out of memory
when using device_train_microbatch_size="auto"
. Now, we check for both CUDA out of memory
and CUDA error: out of memory
.
4. Fix mlflow logging to Databricks workspace file paths which startswith /Shared/
prefix (#3410)
Previously, for MLflow logging, we prepended the path /Users/
to all user-provided logging paths on the Databricks platform, if not specified, including paths starting with /Shared/
, which was incorrect since /Shared/
indicates a shared workspace. Now, the /Users/
prepend is skipped for paths starting with /Shared/
.
What's Changed
- Bump CI from 0.0.7 to 0.0.8 by @KuuCi in #3383
- Fix backward compatibility caused by missing eval metrics class by @bigning in #3385
- Bump version v0.23.2 by @bigning in #3386
- Restore dev version by @bigning in #3388
- Only requires
databricks-sdk
when inside the Databricks platform by @antoinebrl in #3389 - Update packaging requirement from <24.1,>=21.3.0 to >=21.3.0,<24.2 by @dependabot in #3392
- Bump cryptography from 42.0.6 to 42.0.8 by @dependabot in #3391
- Skip extra dataset state load by @mvpatel2000 in #3393
- Remove FSDP restriction from PyTorch 1.13 by @mvpatel2000 in #3395
- Check for 'CUDA error: out of memory' when auto-microbatching by @JAEarly in #3400
- Add tokens to iterations by @b-chu in #3374
- Busy wait utils in dist by @dakinggg in #3396
- Add buffering time to mlflow logger by @chenmoneygithub in #3401
- Add missing import for PyTorch 2.3.1 device mesh slicing by @mvpatel2000 in #3402
- Add pynvml to mlflow dep group by @dakinggg in #3404
- min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged by @JackZ-db in #3373
- Simplify launcher world size parsing by @mvpatel2000 in #3398
- Optionally use
flash-attn
's CE loss for metrics by @snarayan21 in #3394 - log image fix by @jessechancy in #3286
- [ckpt-rewr] Save state dict API by @eracah in #3372
- Revert "Optionally use
flash-attn
's CE loss for metrics (#3394)" by @snarayan21 in #3408 - CPU tests image fix by @snarayan21 in #3409
- Add setter for epoch in iteration by @b-chu in #3407
- Move pillow dep as required by @mvpatel2000 in #3412
- fixing mlflow logging to Databricks workspace file paths with /Shared/ prefix by @JackZ-db in #3410
- Bump version v0.23.3 by @karan6181 in #3414
New Contributors
Full Changelog: v0.23.2...v0.23.3
v0.23.2
v0.23.1
What's New
1. PyTorch 2.3.1 Upgrade
Composer now supports PyTorch 2.3.1.
What's Changed
- Torch 2.3.1 Upgrade by @mvpatel2000 in #3367
- Fix monkeypatch imports by @mvpatel2000 in #3375
- Remove unnecessary state dict and load_state_dict functions by @eracah in #3361
- Adding checkpoint backwards compatibility tests after 0.23.0 release by @bigning in #3377
- prepare_fsdp_module documentation fix by @KuuCi in #3379
- Composer version bump to v0.23.1 by @snarayan21 in #3380
- Clear caplog and use as context manager in test_logging by @snarayan21 in #3382
Full Changelog: v0.23.0...v0.23.1
v0.23.0
What's New
1. Parallelism V2 + Tensor Parallel (#3335)
Composer now supports PyTorch's implementation of tensor parallelism. As part of this, we've revamped and simplified how Composer does distributed training. Previously, Composer accepted a fsdp_config
attribute in the Trainer:
trainer = Trainer(model, fsdp_config = {'sharding_strategy': 'FULL_SHARD'})
As we generalize to more forms of parallelism, we've deprecated fsdp_config
in favor of parallelism_config
:
trainer = Trainer(
model = model,
...
parallelism_config = {
'fsdp': {
'sharding_strategy': 'FULL_SHARD',
'data_parallel_shard_degree': 2, # Size of shard dimension
'data_parallel_replicate_degree': 2, # Size of replicate dimension
},
'tp_config': {
'tensor_parallel_degree': 2, # Size of TP dimension
'layer_plan': ... # describes how to TP layers
}
}
)
As part of this change, we now default to using DTensor for parallelism with PyTorch FSDP. PyTorch has deprecated ShardedTensor, so this migrates to the new backend which avoids various checkpointing bugs.
See the docs for tensor parallel for more information. Note that tensor parallel is still experimental and may be subject to API breaking changes. All checkpointing features may also not work with this parallelism.
2. MLFLow API Simplification
Previously, MLFlow logger required a tracking URI and an absolute user path when using MLFlow with Databricks:
mlflow_logger = MLFlowLogger(
tracking_uri = 'databricks',
experiment_name = '/Users/xxx.yyy@zzz.com/my-first-project/'
)
trainer = Trainer(
model = model,
...
loggers = mlflow_logger,
)
Now, if you are using Databricks secrets as an environment variable, Composer will autopopulate tracking_uri
and the experiment_name
prefix:
trainer = Trainer(
model = model,
...
loggers = MLFlowLogger(experiment_name='my-first-project'),
)
3. Wallclock Save Interval
Composer now supports setting a save interval in wallclock time:
trainer = Trainer(
model = model,
...
save_interval='30m',
)
Note that most durations, such as max_duration
, do not accept wallclock time, and the initial version of this feature is only limited to a subset of time features like save_interval
.
Bug Fixes
- Don't close the engine if it's already closed in #3143
- Fix HF tests with Pin in #3248
- Fix backwards compatibility tests in #3252
- Fix unexpected remote checkpointing downloading in #3271
- Fix HSDP with ShardDegree < 8 in #3313
What's Changed
- Remove CPU offload for DDP/single-gpu by @mvpatel2000 in #3242
- Adding more checkpoint backwards compatability tests by @snarayan21 in #3244
- Don't close the engine if its already closed by @dakinggg in #3143
- Replace
evaluator.dataloader.device_eval_batch_size
withevaluator.device_eval_microbatch_size
by @ShashankMosaicML in #3247 - Fix HF tests with Pin by @mvpatel2000 in #3248
- Remove ICL metrics by @mvpatel2000 in #3243
- Add offset and length arguments for checkpoint validation functions by @irenedea in #3246
- Fix backwards compatibility tests, raise error for torch version mismatch by @snarayan21 in #3252
- Bump cryptography from 41.0.5 to 42.0.6 by @dependabot in #3256
- Bump databricks-sdk from 0.25.1 to 0.27.0 by @dependabot in #3257
- Improve GCS Object Store by @mvpatel2000 in #3251
- add retry to gcs.upload_file by @bigning in #3232
- Add unit test support for full state dict + load_weights_only and save_weights_only by @eracah in #3260
- will/bump_aws_ofi_nccl by @willgleich in #3253
- Fix daily GCS tests by @mvpatel2000 in #3268
- Fix: SAM not working with FSDP/DeepSpeed and LR scheduler. by @Joqsan in #3259
- Add upload timeout patch to mlflow on azure by @dakinggg in #3265
- Add option to stagger uploads based on local rank by @dakinggg in #3275
- explicit close by @dakinggg in #3276
- Update NCCL_ASYNC_ERROR_HANDLING env variable by @priba in #3267
- new dist_cp save planner to fix issue that each rank needs to download all checkpoint files by @bigning in #3271
- Bump to torch 2.2.2 by @mvpatel2000 in #3283
- Fix UCObjectStore.list_objects by @dakinggg in #3284
- Update peft version by @dakinggg in #3287
- replace
load_fsdp_monolith_
withload_monolith_
by @milocress in #3288 - Return PyTorch Latest by @mvpatel2000 in #3290
- Fix daily tests by filtering a warning by @mvpatel2000 in #3291
- remove orig_params check by @milocress in #2981
- [ckpt-rewr] Get Model State Dict Util Function by @eracah in #3250
- Skip compression check with symlink files by @mvpatel2000 in #3300
- Monkeypatch Device Mesh ND Slicing by @mvpatel2000 in #3302
- Bump coverage[toml] from 7.4.4 to 7.5.1 by @dependabot in #3305
- Bump databricks-sdk from 0.27.0 to 0.27.1 by @dependabot in #3306
- Update transformers requirement from !=4.34.0,<4.41,>=4.11 to >=4.11,!=4.34.0,<4.42 by @dependabot in #3307
- Allow overwrite on upload retry in remote uploader downloader by @irenedea in #3310
- Update platform references by @aspfohl in #3304
- Fix cometml unit tests by @j316chuck in #3314
- Fix HSDP with ShardDegree < 8 by @bigning in #3313
- Update docstring for get_model_state_dict by @eracah in #3318
- Tensor Parallelism Integration by @mvpatel2000 in #3269
- Bugfixes to FSDP + TP by @mvpatel2000 in #3323
- Wct save interval by @KuuCi in #3264
- Wrap ChunkedEncodingError from UCObjectStore by @irenedea in #3321
- Add checkpoint events to mosaicml logger by @b-chu in #3316
- Bump timeout to fix daily tests by @j316chuck in #3325
- Fix FSDP ckpt by filtering User Waring by @j316chuck in #3327
- Revert TP integration by @dakinggg in #3328
- Bump databricks-sdk from 0.27.1 to 0.28.0 by @dependabot in #3331
- Bump sphinxcontrib-katex from 0.9.6 to 0.9.10 by @dependabot in #3333
- Update peft requirement from <0.11,>=0.10.0 to >=0.10.0,<0.12 by @dependabot in #3332
- Bump coverage[toml] from 7.5.1 to 7.5.2 by @dependabot in #3330
- Update protobuf requirement from <5.27 to <5.28 by @dependabot in #3329
- Improving memory snapshot by @cli99 in #3315
- Add A10 to speed monitor by @mvpatel2000 in #3336
- change ComposerModel output type by @hyenal in #3341
- Remove evaluator state by @snarayan21 in #3339
- [ckpt-rewr] Generate Metadata State Dict API by @eracah in #3311
- Tensor Parallelism v2 by @mvpatel2000 in #3335
- Migrate Type Hints for PEP 585 by @mvpatel2000 in #3344
- [checkpoint v2] add remote uploader class by @bigning in #3303
- Raise errors on all ranks for checkpoint download failures by @irenedea in #3345
- Add return type annotation when init doesn't take any argument by @antoinebrl in #3347
- [ckpt-rewr] Get Optim State Dict Util API by @eracah in #3299
- Fix type check issue with device train microbatch size by @mvpatel2000 in https://github.com/...
v0.22.0
What's New
🔥 Support for PyTorch v2.3.0
Composer now supports the recently-released PyTorch version 2.3.0! Please raise any issues with us so we can address them.
Bug Fixes
- Fixing checks for device microbatch size for sequence parallelism in #3200
- Fixing token logging in #3206
- Search for run name in MLFlowLogger in #3215
- Fix FQN names with activation checkpointing in #3210
- Strict weight matching for checkpoint loading in #3219
What's Changed
- Bump transformers by @dakinggg in #3197
- Add deprecation warnings for ICL datasets/helper functions/metrics by @bmosaicml in #3125
- Bump traitlets from 5.14.2 to 5.14.3 by @dependabot in #3204
- Raise LR schedule warnings only when necessary by @snarayan21 in #3207
- Add torch 2.3 support by @mvpatel2000 in #3209
- Add torch 2.3 CI/CD by @mvpatel2000 in #3211
- Fix daily test images by @mvpatel2000 in #3212
- Try FAv2 2.5.7 from source by @mvpatel2000 in #3213
- Update tests by @mvpatel2000 in #3217
- Fix torch 2.3 GPU tests by @mvpatel2000 in #3218
- Use flash-attn 2.5.8 with no build isolation in docker images by @snarayan21 in #3224
- Add a torch.cuda.empty_cache() in utils.save_checkpoint by @bfontain in #3216
- Require 2 steps for GS object store by @mvpatel2000 in #3228
- Add
rename_metrics
to Mlflow logger by @hanlint in #3225 - Fix daily tests by @mvpatel2000 in #3229
- Change precision for daily tests by @mvpatel2000 in #3231
- Create new Mlflow run by default and introduce
run_group
by @chenmoneygithub in #3208 - Fix daily test pt 4 by @mvpatel2000 in #3233
- Deprecate and bump version to 0.22 by @mvpatel2000 in #3230
- Fix daily tests v5 by @mvpatel2000 in #3234
- Fix daily v6 by @mvpatel2000 in #3235
- fix daily tests v7 by @mvpatel2000 in #3236
- Raise the daily test timeout by @dakinggg in #3241
- Accelerate GPU tests by @mvpatel2000 in #3237
- Make sharded checkpoint loading backwards-compatible by @snarayan21 in #3240
Full Changelog: v0.21.3...v0.22.0
v0.21.3
Bug Fixes
1. Increased Robustness to Checkpoint Loading
We've patched several edge cases in loading sharded checkpoints, especially with DTensors, which should decrease memory usage when loading checkpoints. We've also hardened retry logic against object cloud failure, ensuring higher robustness to transient network issues.
What's Changed
- Raise daily test timeout by @mvpatel2000 in #3172
- fix remote file naming by @cli99 in #3173
- [fix] DTensor + SHARD_GRAD_OP + use_orig_params by @bigning in #3175
- Bump db sdk by @dakinggg in #3176
- Build latest pytorch nightly images by @dakinggg in #3179
- Add FP8 TransformerEngine activation checkpointing by @cli99 in #3156
- Enabling the computation of validation loss and other metrics when using sequence parallelism by @ShashankMosaicML in #3183
- Update mosaic_fsdp_utils.py by @vchiley in #3185
- Fix the FSDP.optim_state_dict_to_load OOM by @bigning in #3184
- Revert "Update mosaic_fsdp_utils.py" by @vchiley in #3187
- Bump databricks-sdk from 0.24.0 to 0.25.1 by @dependabot in #3190
- Add version tag to local builds by @mvpatel2000 in #3188
- Update
NeptuneLogger
by @AleksanderWWW in #3165 - Filter neptune warning in doctests by @mvpatel2000 in #3195
- Removal of metrics deepcopy before computing the metrics by @gregjauvion in #3180
- Fix MLFlow Tag Name for Resumption by @KuuCi in #3194
- Fix mistral gating by @dakinggg in #3199
- Bump version to 0.21.3 by @mvpatel2000 in #3198
New Contributors
- @gregjauvion made their first contribution in #3180
Full Changelog: v0.21.2...v0.21.3
v0.21.2
Bug Fixes
1. Enable torch 2.2.2 (#3161)
Composer currently monkeypatches PyTorch for nightly versions in order to fix upstream bugs. With the release of torch 2.2.2, these monkeypatches were mistakenly applied to the stable release due to incorrect gating on imports. This release fixes the gating, enabling torch 2.2.2.
2. MPS Metric Computation on CPU (#3105)
Due to bugs in computing torchmetrics on Mac devices, we move metric computation onto CPU. This previously had issues with data not properly moving to CPU.
Thank you to @hyenal for this contribution!
3. Batch Sampler Support (#3105)
Composer now supports batch sampler, which previously resulted in an error if specified in the dataloader.
Thank you to @Ghelfi for this contribution!
What's Changed
- Make codequality callable by @mvpatel2000 in #3133
- Explicitly print checkpoint downloading exception by @bigning in #3131
- Change release actions by @mvpatel2000 in #3136
- Passing rank and num_replicas to dist.get_sampler by @ShashankMosaicML in #3137
- Fix broadcast by @mvpatel2000 in #3138
- Compressor fixes by @mbway in #3142
- In case of MPS device also copy batch to CPU by @hyenal in #3105
- Composer object store download retry by @bigning in #3140
- Bump databricks-sdk from 0.22.0 to 0.23.0 by @dependabot in #3144
- Update transformers requirement from !=4.34.0,<4.39,>=4.11 to >=4.11,!=4.34.0,<4.40 by @dependabot in #3148
- Update protobuf requirement from <3.21 to <5.27 by @dependabot in #3147
- Bump traitlets from 5.14.1 to 5.14.2 by @dependabot in #3145
- Bump to 0.21 by @mvpatel2000 in #3150
- Fixing sequence parallel error conditions and adding type float for microbatch_size in typehints by @ShashankMosaicML in #3139
- Fix torch monkeypatch version check by @dakinggg in #3155
- Update torchmetrics requirement from <1.3.2,>=0.10.0 to >=0.10.0,<1.3.3 by @dependabot in #3157
- Bump gitpython from 3.1.42 to 3.1.43 by @dependabot in #3160
- Prevent crash if signal handler cannot be set by @mbway in #3152
- Pin pillow for code quality workflow by @dakinggg in #3162
- Fix torch version check by @dakinggg in #3161
- add more retry to checkpoint downloading by @bigning in #3164
- Append to gpu rank log files instead of throwing error by @jjanezhang in #3166
- Call
set_epoch
onDataloader.batch_sampler
if defined by @Ghelfi in #3124 - Bump version to 0.21.2 by @mvpatel2000 in #3168
New Contributors
Full Changelog: v0.21.1...v0.21.2
v0.21.1
Bug Fixes
1. Fix to HSDP checkpoint loading
The previous release broke checkpoint loading when using HSDP with mutliple replicas. This patch release fixes checkpoint loading.
What's Changed
- Fix broadcast by @mvpatel2000 in #3138
Full Changelog: v0.21.0...v0.21.1