Releases: mosaicml/composer
v0.15.0
🚀 Composer v0.15.0
What's New
-
Exact Eval (#2218)
Composer now supports exact evaluation! Now, evaluation will give the exact same results regardless of the number of GPUs by removing any duplicated samples from the dataloader.
-
Monolithic Checkpoint Loading (#2288)
When training large models, loading the model and optimizer on every rank can use up all the system memory. With FSDP, Composer can now load the model and optimizer on only rank 0 and broadcast it to all other ranks. To enable:
from composer import Trainer # Construct Trainer trainer = Trainer( ..., fsdp_config={ load_monolith_rank0_only: True }, ) # Train! trainer.fit()
and ensure the model on rank 0 is on CPU/GPU (as opposed to meta).
-
Spin Dataloaders
By default, Composer spins dataloaders back to the current timestamp to ensure deterministic resumption. However, dataloader spinning can be very slow, so
Trainer
now has a new flag to disable spinning if determinism is not required. To enable:from composer import Trainer # Construct Trainer trainer = Trainer( ..., spin_dataloaders=False, ) # Train! trainer.fit()
Deprecations
HealthChecker
is now deprecated and will be removed inv0.17.0
Bug Fixes
- Add support for saving HF info in state dict when using DDP by @dakinggg in #2206
- Change state dict loading default to strict by @dakinggg in #2216
- CE loss vs CE metric equivalence by @dakinggg in #2241
- Move sharded checkpoints into their own intermediate prefix folder by @eracah in #2205
- Fix typo depricated -> deprecated by @eracah in #2270
- Spin dataloader arg by @mvpatel2000 in #2267
- Confirming the output variable has two dimensions before confirming the shape of the second element. by @jimmiemunyi in #2275
- Add loss_dict keyword to closure lambda function by @Landanjs in #1952
- Strip spacing icl by @bmosaicml in #2306
What's Changed
- Update FFCV by @mvpatel2000 in #2197
- Add support for saving HF info in state dict when using DDP by @dakinggg in #2206
- Bump junitparser from 3.0.0 to 3.1.0 by @dependabot in #2212
- Bump sentencepiece from 0.1.98 to 0.1.99 by @dependabot in #2208
- Add docs for Checkpointing with Cloudflare R2 by @eracah in #2215
- Working slack link by @growlix in #2217
- Change state dict loading default to strict by @dakinggg in #2216
- Fix typo in evaluation docs by @dakinggg in #2225
- Clean soft cross entropy by @mvpatel2000 in #2227
- add cmake by @dakinggg in #2229
- Upgrade to mcli0.4, smaller mcli improvements by @aspfohl in #2226
- Bump to torch 2.0.1 by @mvpatel2000 in #2235
- Deprecate healthchecker by @mvpatel2000 in #2236
- Update torch 2.0.1 workflows by @mvpatel2000 in #2239
- Log wandb URL to metadata by @mvpatel2000 in #2240
- Bump ipykernel from 6.22.0 to 6.23.1 by @dependabot in #2244
- Update transformers requirement from <4.29,>=4.11 to >=4.11,<4.30 by @dependabot in #2245
- CE loss vs CE metric equivalence by @dakinggg in #2241
- Exact Eval by @mvpatel2000 in #2218
- bump torchmetrics pin by @nik-mosaic in #2247
- Remove deprecated code / torch 1.11 / torch 1.12 by @mvpatel2000 in #2234
- Rename
backwards_create_graph
description by @mvpatel2000 in #2248 - Move sharded checkpoints into their own intermediate prefix folder by @eracah in #2205
- Fix daily tests by fixing test_fsdp_load_old_checkpoint by @eracah in #2249
- Support for multiple optimizer groups in torch 2.0 + FSDP by @sashaDoubov in #2230
- Change AdamW step to a tensor instead of an int by @eracah in #2237
- Update to cuda 11.8 by @mvpatel2000 in #2250
- Fix daily tests by adding s3 secrets to daily-gpu tests by @eracah in #2254
- Typo in s3_prefix: epemeral -> ephemeral 🤦♂️ by @eracah in #2255
- Bump yamllint from 1.31.0 to 1.32.0 by @dependabot in #2256
- Bump coverage[toml] from 7.2.5 to 7.2.6 by @dependabot in #2258
- Add callbacks for EVAL_BEFORE_ALL and EVAL_AFTER_ALL by @rishab-partha in #2264
- Update torch device naming convention for h100 gpus by @vchiley in #2265
- Fix typo depricated -> deprecated by @eracah in #2270
- alerts for daily tests by @mvpatel2000 in #2272
- Fix daily tests by patching cupy version by @mvpatel2000 in #2274
- Skip ffcv notebook by @mvpatel2000 in #2277
- Spin dataloader arg by @mvpatel2000 in #2267
- Confirming the output variable has two dimensions before confirming the shape of the second element. by @jimmiemunyi in #2275
- Bump coverage[toml] from 7.2.6 to 7.2.7 by @dependabot in #2282
- Patch for tokenizers that have python files in save_pretrained output by @dakinggg in #2279
- fix get file(overwite=True) to properly handle pre-existing files by @bmosaicml in #2284
- Fix Checkpointing Docs Link by @rishab-partha in #2278
- Add errors for Mixed Dataloader Eval by @rishab-partha in #2269
- Fix autoresume with slashed directory by @rishab-partha in #2287
- Delete symlinks when not saving checkpoints locally by @rishab-partha in #2285
- fixed adding tokenizer to hf by @KuuCi in #2290
- New Console Logger Test + Discard before Eval by @rishab-partha in #2273
- Enabled kv caching during generate to speed up QA Task by @bmosaicml in #2293
- Update monai requirement from <1.2,>=0.9.1 to >=0.9.1,<1.3 by @dependabot in #2298
- Bump sphinxcontrib-katex from 0.9.4 to 0.9.5 by @dependabot in #2296
- Training Checkpoint Fix by @KuuCi in #2294
- Update transformers requirement from <4.30,>=4.11 to >=4.11,<4.31 by @dependabot in #2295
- Fixed how save_checkpoint_to_save_folder called CheckpointSaver object to save state and logger by @KuuCi in #2300
- Update Slack link in README.md by @ejyuen in #2261
- Change progress bar logger to print all eval metrics by @rishab-partha in #2286
- Add pytest clear cache by @rishab-partha in #2305
- Fix tests for wandb and mlflow loggers by @b-chu in #2302
- Monolithic Loading by @mvpatel2000 in #2288
- Add loss_dict keyword to closure lambda function by @Landanjs in #1952
- Strip spacing icl by @bmosaicml in #2306
- Add additional error with auto microbatching by @mvpatel2000 in #2308
- Group autoresume messages by @mvpatel2000 in #2307
- Move deepspeed enabled to state by @mvpatel2000 in #2309
- Jiggling tests and adding gc collect by @bcui19 in #2312
- Monolithic loading improvements by @mvpatel2000 in #2313
- Update version to 0.15 by @mvpatel2000 in #2315
New Contributors
- @aspfohl made their first contribution in #2226
- @sashaDoubov made their first contribution in #2230
- @rishab-partha made their first contribution in...
v0.14.1
Bug Fixes
Fixes a bug related to sentpiece tokenizers and ICL eval.
What's Changed
- Update docs to remove gradient clipping in events by @mvpatel2000 in #2193
- remove explorer info from readme by @nik-mosaic in #2174
- bugfix sentpiece by @bmosaicml in #2198
- Fix Broken Training Loop Image Link by @eracah in #2199
- Fix broken image link for GLU by @eracah in #2201
- bugfix sentpiece (#2198) by @bmosaicml in #2200
- Bump version to v0.14.1 by @mvpatel2000 in #2202
- Pin protobuf by @mvpatel2000 in #2203
Full Changelog: v0.14.0...v0.14.1
v0.14.0
🚀 Composer v0.14.0
Composer v0.14.0 is released! Install via pip
:
pip install composer==0.14.0
The legacy package name still works via pip
:
pip install mosaicml==0.14.0
New Features
-
🆕 PyTorch 2.0 Support (#2172)
We're thrilled to announce official support for PyTorch 2.0! We've got all initial unit tests passing and run through our examples. We've also made some updates to start taking advantage of all the great new features.
Initial support also includes:
-
Support for torch.compile
Model Dataset Without compile thoughput/samples_per_sec With compile thoughput/samples_per_sec Performance % ResNet50 ImageNet 5557 7424 33.60% DeepLab V3 ADE20K 81.60 98.82 21.10% HF BERT C4 3360 4259 26.75% HF Causal LM C4 50.61 103.29 100.05% To start using, simply add
compile_config
argument to theTrainer
:# To use default `torch.compile` config trainer = Trainer( ..., compile_config={}, ) # To use custom `torch.compile` config, provide an argument as a dictionary, for example: trainer = Trainer( ..., compile_config={'mode': 'reduce-overhead'}, )
The
Trainer
also supports pre-compiled models passed via themodels
argument. If the model has been pre-compiled, thecompile_config
argument is ignored if provided.Note: We recommend baselining your model with and without
torch.compile
as there are scenarios where enabling compile does not yield any throughput improvements and in some cases where this can lead to a regression. -
PyTorch 2.0 Docker Images
We've added the following new official MosaicML Docker Images with PyTorch 2.0 support:
Linux Distro Flavor PyTorch Version CUDA Version Python Version Docker Tags Ubuntu 20.04 Base 2.0.0 11.7.1 (Infiniband) 3.10 mosaicml/pytorch:2.0.0_cu117-python3.10-ubuntu20.04
Ubuntu 20.04 Base 2.0.0 11.7.1 (EFA) 3.10 mosaicml/pytorch:2.0.0_cu117-python3.10-ubuntu20.04-aws
Ubuntu 20.04 Base 2.0.0 cpu 3.10 mosaicml/pytorch:2.0.0_cpu-python3.10-ubuntu20.04
Ubuntu 20.04 Vision 2.0.0 11.7.1 (Infiniband) 3.10 mosaicml/pytorch_vision:2.0.0_cu117-python3.10-ubuntu20.04
Ubuntu 20.04 Vision 2.0.0 cpu 3.10 mosaicml/pytorch_vision:2.0.0_cpu-python3.10-ubuntu20.04
-
-
🦾 New Callbacks
-
Activation monitor (#2066)
Monitors activations in the network. Every interval batches it will attach a forwards hook and logs the max, average, l2 norm, and kurtosis for the input and output activations. To enable:
from composer import Trainer from composer.callbacks import ActivationMonitor # Construct Trainer trainer = Trainer( ..., callbacks=[ActivationMonitor()], ) # Train! trainer.fit()
-
Slack Logger (#2133)
You can now send custom training metrics using Slack! To enable:
from composer import Trainer from composer.loggers import SlackLogger transform = transforms.Compose([transforms.ToTensor()]) trainer = Trainer( ... loggers=[ SlackLogger( log_interval="10ba", # or 1ep, 2ep include_keys=["algorithm_traces*", "loss*"], formatter_func=(lambda data, **kwargs: [ { "type": "section", "text": {"type": "mrkdwn", "text": f"*{k}:* {v}"} } for k, v in data.items() ]) ) ], ) trainer.fit()
Please see PR #2133 for additional details.
-
API changes
- The
grad_accum
argument has been removed fromTrainer
, users are now required to usedevice_train_microbatch_size
instead (#2040)
Deprecations
- We no longer support PyTorch 1.11 and 1.12 due to security vulnerabilities. New features will not be tested against these versions.
Bug Fixes
- Eval subset num batches bug fix (#2028)
- Protect for missing slack_sdk import (#2031)
- Adjust HuggingFaceModel token embedding resizing to only occur when necessary (#2027)
- Update FSDP meta weight tying tests to include precision testing (#2050)
- Backward Compat with Torchmetrics (#2046)
- Busy wait for local rank 0 download to avoid timeout on large file download (#2054)
- Fix OCIObjectStore save_overwrite=False bug (#2053)
- Busy wait so that non local rank zeros don't timeout while local rank zero downloads a monolithic checkpoint (#2071)
- Skip extra downloads when not using a format string (#2073)
- fix name_or_path usage in HF save/load usage (#2075)
- Fix EMA resumption issue with calling trainer.eval() before trainer.fit() (#2088)
- Patch EMA with FSDP (#2091)
- Updating gradient clipping to be torch 2.0 compatible (#2089)
- Adding checks for weight tying s.t. we don't think None attributes are weight tied (#2103)
- gate the extra forward call specifically for fsdp (#2102)
- Allow user to set ONNX opset version when Exporting for Inference (#2101)
- Runtime estimator (#2124)
- Use state_dict Torchmetrics Serialization (#2116)
- Fix filelock in checkpoint download (#2184)
What's Changed
- Eval subset num batches bug fix by @mvpatel2000 in #2028
- Protect for missing
slack_sdk
import by @hanlint in #2031 - switch code quality workflow to dev target and smoketest by @dakinggg in #2032
- Generate composer PyPi package by @bandish-shah in #2034
- HealthChecker should only send test message on global rank zero by @hanlint in #2035
- Bump version to 0.13.1 by @bandish-shah in #2033
- Use follow in mcp script by @mvpatel2000 in #2022
- Bump pytest from 7.2.1 to 7.2.2 by @dependabot in #2039
- Bump pypandoc from 1.10 to 1.11 by @dependabot in #2038
- Adds a PR guidelines section to contributing.md by @dakinggg in #1993
- Adjust HuggingFaceModel token embedding resizing to only occur when necessary by @dakinggg in #2027
- Remove deprecated code by @mvpatel2000 in #2026
- test and fix composer package name usage in composer_collect_env by @dakinggg in #2049
- Log nodename information in composer by @eracah in #2043
- Update FSDP meta weight tying tests to include precision testing by @bcui19 in #2050
- Backward Compat with Torchmetrics by @mvpatel2000 in #2046
- update fsdp mixed precision by @vchiley in #2047
- Checkpoints Simplified by @mvpatel2000 in #2041
- Add composer PyPI package tests to daily workflow by @bandish-shah in #2052
- Delete composer package GPU workflow by @dakinggg in #2055
- Revert "Checkpoints Simplified (#2041)" by @dakinggg in #2056
- Raise error if attempting to export FSDP model by @hanlint in #2051
- Busy wait for local rank 0 download to avoid timeout on large file download by @dakinggg in #2054
- Fix OCIObjectStore save_overwrite=False bug by @eracah in #2053
- Update docs with non-rank zero logs instructions by @hanlint in #2058
- Pin torchmetrics by @mvpatel2000 in #2065
- Add
NO_REENTRANT
activation checkpointing by @bmosaicml in #20...
v0.13.5
Full Changelog: v0.13.4...v0.13.5
- Add support for EMA + FSDP
v0.13.4
Full Changelog: v0.13.3...v0.13.4
Bumps streaming version pin to <1.0
v0.13.3
🚀 Composer v0.13.3
Introducing the composer
PyPi package!
Composer v0.13.3 is released!
Composer can also now be installed using the new composer
PyPi package via pip
:
pip install composer==0.13.3
The legacy package name still works via pip
:
pip install mosaicml==0.13.3
Bug Fixes
What's Changed
- Bump version to 0.13.3 by @bandish-shah in #2115
- add missing import by @dakinggg in #2113
- add sentencepiece support by @dakinggg in #2093
- Pin mcli version until API change is resolved by @dakinggg in #2111
Full Changelog: v0.13.2...v0.13.3
v0.13.2
🚀 Composer v0.13.2
Introducing the composer
PyPi package!
Composer v0.13.2 is released!
Composer can also now be installed using the new composer
PyPi package via pip
:
pip install composer==0.13.2
The legacy package name still works via pip
:
pip install mosaicml==0.13.2
Bug Fixes
- test and fix composer package name usage in composer_collect_env (#2049)
- Backward Compat with Torchmetrics by @mvpatel2000 (#2046)
- Fix OCIObjectStore save_overwrite=False bug (#2053)
- busy wait for the rank 0 download (#2071)
- Skip extra downloads when not using a format string (#2073)
What's Changed
- Pin transformers package to <4.27 by @dakinggg in #2076
- Bump version to v0.13.2 (#2068) by @bandish-shah
- Skip extra downloads when not using a format string by @dakinggg in #2073
- add support for autoresume + FSDP + sharding by @dakinggg in #2072
- busy wait for the rank 0 download by @dakinggg in #2071
- Revert "Checkpoints Simplified (#2059)" by @dakinggg in #2070
- Add
device
anddtype
back toLPLayerNorm
(#2067) by @abhi-mosaic - Checkpoints Simplified by @mvpatel2000 in #2059
- Allow
LPLayerNorm
andLPGroupNorm
to supportself.bias
orself.weight
= None (#2044) by @abhi-mosaic - Add
NO_REENTRANT
activation checkpointing (#2042) by @bmosaicml - pin torchmetrics by @mvpatel2000 in #2065
- Update docs with non-rank zero logs instructions by @hanlint in #2058
- Fix OCIObjectStore save_overwrite=False bug by @eracah in #2053
- Busy wait for local rank 0 download to avoid timeout on large file download by @dakinggg in #2054
- Raise error if attempting to export FSDP model by @hanlint in #2051
- Revert "Checkpoints Simplified (#2041)" by @dakinggg in #2056
- Delete composer package GPU workflow by @dakinggg in #2055
- Add composer PyPI package tests to daily workflow (#2052) by @bandish-shah
- Checkpoints Simplified by @mvpatel2000 in #2041
- update fsdp mixed precision by @vchiley in #2047
- Backward Compat with Torchmetrics by @mvpatel2000 in #2046
- Update FSDP meta weight tying tests to include precision testing by @bcui19 in #2050
- Log nodename information in composer by @eracah in #2043
- test and fix composer package name usage in composer_collect_env by @dakinggg in #2049
- Adjust how HuggingFaceModel handles embedding resizing by @dakinggg in #2027
- Adds a PR guidelines section to contributing.md by @dakinggg in #1993
- Bump pypandoc from 1.10 to 1.11 (#2038) by @dependabot[bot]
- Bump pytest from 7.2.1 to 7.2.2 (#2039) by @dependabot[bot]
- Use follow in mcp script by @mvpatel2000 in #2022
Full Changelog: v0.13.1...v0.13.2
v0.13.1
🚀 Composer v0.13.1
Introducing the composer
PyPi package!
Composer v0.13.1 is released!
Composer can also now be installed using the new composer
PyPi package via pip
:
pip install composer==0.13.1
The legacy package name still works via pip
:
pip install mosaicml==0.13.1
Note: The mosaicml==0.13.0
PyPi package was yanked due to some minor packaging issues discovered after release. The package was re-released as Composer v0.13.1, thus these release notes contain details for both v0.13.0 and v0.13.1.
New Features
-
🤙 New and Updated Callbacks
-
New
HealthChecker
Callback (#2002)The callback will log a warning if the GPUs on a given node appear to be in poor health (low utilization). The callback can also be configured to send a Slack message!
from composer import Trainer from composer.callbacks import HealthChecker # Warn if GPU utilization difference drops below 10% health_checker = HealthChecker( threshold = 10 ) # Construct Trainer trainer = Trainer( ..., callbacks=health_checker, ) # Train! trainer.fit()
-
Updated
MemoryMonitor
to use GigaBytes (GB) units (#1940) -
New
RuntimeEstimator
Callback (#1991)Estimate the remaining runtime of your job! Approximates the time remaining by observing the throughput and comparing to the number of batches remaining.
from composer import Trainer from composer.callbacks import RuntimeEstimator # Construct trainer with RuntimeEstimator callback trainer = Trainer( ..., callbacks=RuntimeEestimator(), ) # Train! trainer.fit()
-
Updated
SpeedMonitor
throughput metrics (#1987)Expands throughput metrics to track relative to several different time units and per device:
throughput/batches_per_sec
andthroughput/device/batches_per_sec
throughput/tokens_per_sec
andthroughput/device/tokens_per_sec
throughput/flops_per_sec
andthroughput/device/flops_per_sec
throughput/device/samples_per_sec
Also adds
throughput/device/mfu
metric to compute per device MFU. Simply enable theSpeedMonitor
callback per usual to log these new metrics! Please see SpeedMonitor documentation for more information.
-
-
⣿ FSDP Sharded Checkpoints (#1902)
Users can now specify the
state_dict_type
in thefsdp_config
dictionary to enable sharded checkpoints. For example:from composer import Trainer fsdp_confnig = { 'sharding_strategy': 'FULL_SHARD', 'state_dict_type': 'local', } trainer = Trainer( ..., fsdp_config=fsdp_config, save_folder='checkpoints', save_filename='ba{batch}_rank{rank}.pt', save_interval='10ba', )
Please see the PyTorch FSDP docs and Composer's Distributed Training notes for more information.
-
🤗 HuggingFace Improvements
- Update
HuggingFaceModel
class to support encoder-decoder batches withoutdecoder_input_ids
(#1950) - Allow evaluation metrics to be passed to
HuggingFaceModel
directly (#1971) - Add a utility function to load a Composer checkpoint of a
HuggingFaceModel
and write out the expectedconfig.json
andpytorch_model.bin
in the HuggingFace pretrained folder (#1974)
- Update
-
🛟 Nvidia H100 Alpha Support - Added
amp_fp8
data typeIn preparation for H100's arrival, we've added the
amp_fp8
precision type. Currently settingamp_fp8
specifies a new precision context usingtransformer_engine.pytorch.fp8_autocast.
For more details, please see Nvidia's new Transformer Engine and the specific fp8 recipe we utilize.from composer import Trainer trainer = Trainer( ..., precision='amp_fp8', )
API changes
-
The
torchmetrics
package has been upgraded to 0.11.x.The
torchmetrics.Accuracy
metric now requires atask
argument which can take on a value ofbinary
,multiclass
ormultilabel
. Please see Torchmetrics Accuracy docs for details.Additonally, since specifying
value='multiclass'
requires an additional field ofnum_classes
to be specified, we've had to updateComposerClassifier
to accept the additionalnum_classes
argument. Please see PR's #2017 and #2025 for additional details -
Surgery algorithms used in functional form return a value of
None
(#1543)
Deprecations
- Deprecate HFCrossEntropy and Perplexity (#1857)
- Remove Jenkins CI (#1943, #1954)
- Change Deprecation Warnings to Warnings for specifying
ProgressBarLogger
andConsoleLogger
to loggers (#1846)
Bug Fixes
- Fixed an issue introduced in 0.12.1 where
HuggingFaceModel
crashes ifconfig.return_dict = False
(#1948) - Refactor EMA to improve memory efficiency (#1941)
- Make wandb checkpoint logging compatible with wandb model registry (#1973)
- Fix ICL race conditions (#1978)
- Update
epoch
metric name totrainer/epoch
(#1986) - reset scaler (#1999)
- Bug/sync optimization logger across ranks (#1970)
- Update Docker images to fix resolve vulnerability scan issues (#2007)
- Fix eval duplicate logging issue (#2018)
- extend test and patch bug (#2028)
- Protect for missing slack_sdk import (#2031)
Known Issues
- Docker Image Security Vulnerability
- CVE-2022-45907: The
mosaicml/pytorch:1.12.1*
,mosaicml/pytorch:1.11.0*
,mosaicml/pytorch_vision:1.12.1*
andmosaicml/pytorch_vision:1.11.0*
images are impacted and currently supported for legacy use cases. We recommend users upgrade to images with PyTorch >1.13. The affected images will be removed in the next Composer release.
- CVE-2022-45907: The
What's Changed
- Raise error if max duration is in epochs and dataloader is infinite by @dakinggg in #1942
- Bump traitlets from 5.8.0 to 5.9.0 by @dependabot in #1946
- Deprecate HFCrossEntropy and Perplexity by @dakinggg in #1857
- Change functional surgery method return values to None by @nik-mosaic in #1543
- Retire Jenkins by @bandish-shah in #1943
- Update MCP GHA Name by @mvpatel2000 in #1951
- update memory monitor by @mvpatel2000 in #1940
- Move ffcv up in test order by @dskhudia in #1953
- Fix memory monitor test by @mvpatel2000 in #1957
- Fix model surgery failure due to functional API change by @nik-mosaic in #1949
- Change how we check for forwards args in models for HF models by @bcui19 in #1955
- add return dict false test and bug fix by @dakinggg in #1948
- remove jenkins ci by @mvpatel2000 in #1954
- add support for enc-dec batches without decoder_input_ids by @dakinggg in #1950
- Refactor EMA to improve memory efficiency by @coryMosaicML in #1941
- Add warning for untrusted checkpoints by @mvpatel2000 in #1959
- permit opt tokenizer by @bmosaicml in #1958
- GHA Docker build flow for PR's by @bandish-shah in #1883
- Update download badge link to pepy by @karan6181 in #1966
- Update python version in setup.py and fixed pypi download badge by @karan6181 in #1969
- allow eval metrics to be passed in to HuggingFaceModel directly by @dakinggg in #1971
- Make wandb checkpoint logging compatible with wandb model registry by @growlix in #1973
- Add support for FP8 on H100 using NVidia's TransformerEngine by @dskhudia in #1965
- Util for writing HuggingFace save_pretrained from a composer checkpoint by @dakinggg in #1974
- Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) by @eracah in #1902
- Bump custom-inherit from 2.4.0 to 2.4.1 by @dependabot in #1981
- Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #1982
- Fix ICL race conditions by @dakinggg in #1978
- add...
v0.13.0
This release has been yanked due to a minor packaging issue, please skip directly to Composer v0.13.1
What's Changed
- Raise error if max duration is in epochs and dataloader is infinite by @dakinggg in #1942
- Bump traitlets from 5.8.0 to 5.9.0 by @dependabot in #1946
- Deprecate HFCrossEntropy and Perplexity by @dakinggg in #1857
- Change functional surgery method return values to None by @nik-mosaic in #1543
- Retire Jenkins by @bandish-shah in #1943
- Update MCP GHA Name by @mvpatel2000 in #1951
- update memory monitor by @mvpatel2000 in #1940
- Move ffcv up in test order by @dskhudia in #1953
- Fix memory monitor test by @mvpatel2000 in #1957
- Fix model surgery failure due to functional API change by @nik-mosaic in #1949
- Change how we check for forwards args in models for HF models by @bcui19 in #1955
- add return dict false test and bug fix by @dakinggg in #1948
- remove jenkins ci by @mvpatel2000 in #1954
- add support for enc-dec batches without decoder_input_ids by @dakinggg in #1950
- Refactor EMA to improve memory efficiency by @coryMosaicML in #1941
- Add warning for untrusted checkpoints by @mvpatel2000 in #1959
- permit opt tokenizer by @bmosaicml in #1958
- GHA Docker build flow for PR's by @bandish-shah in #1883
- Update download badge link to pepy by @karan6181 in #1966
- Update python version in setup.py and fixed pypi download badge by @karan6181 in #1969
- allow eval metrics to be passed in to HuggingFaceModel directly by @dakinggg in #1971
- Make wandb checkpoint logging compatible with wandb model registry by @growlix in #1973
- Add support for FP8 on H100 using NVidia's TransformerEngine by @dskhudia in #1965
- Util for writing HuggingFace save_pretrained from a composer checkpoint by @dakinggg in #1974
- Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) by @eracah in #1902
- Bump custom-inherit from 2.4.0 to 2.4.1 by @dependabot in #1981
- Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #1982
- Fix ICL race conditions by @dakinggg in #1978
- add map location to huggingface utils by @dakinggg in #1980
- fix log epoch by @mvpatel2000 in #1986
- GHA release workflow, refactor PR and Daily workflows by @bandish-shah in #1968
- Remove python-version input from Daily CPU tests by @bandish-shah in #1989
- Add some logic to pass the correct github ref to mcp script by @bandish-shah in #1990
- Fix typo in docstring for eval with missing space by @mvpatel2000 in #1992
- Fix failing sharded_checkpoint tests that fail when pytorch 1.13 is not installed by @eracah in #1988
- Add merge_group event trigger to GHA daily workflow by @bandish-shah in #1996
- Runtime estimator by @mvpatel2000 in #1991
- Reset scaler state by @mvpatel2000 in #1999
- Speed monitor refactor by @mvpatel2000 in #1987
- Test hf fsdp by @dakinggg in #1972
- Bug/sync optimization logger across ranks by @bmosaicml in #1970
- Fix optimizer monitor test gating with FSDP by @mvpatel2000 in #2000
- Low precision groupnorm by @mvpatel2000 in #1976
- Bump coverage[toml] from 7.1.0 to 7.2.1 by @dependabot in #2008
- Update docs to include runtime estimator by @mvpatel2000 in #2009
- Tag surgery algorithms LPLN and LPGN by @mvpatel2000 in #2011
- Update SpeedMonitor short-description for docs table by @mvpatel2000 in #2010
- Update Low Precision LayerNorm arguments by @nik-mosaic in #1994
- Medical Segmentation Example Typo by @mvpatel2000 in #2014
- Update wallclock logging to default hours by @mvpatel2000 in #2005
- Add HealthChecker Callback by @hanlint in #2002
- Allow FX graph mode post-training dynamic quantisation of BlurConv2d operations. by @BrettRyland in #1995
- Add multi-gpu testing to test_algorithm_resumption by @eracah in #2016
- Add backwards compatible checkpoint loading for EMA by @coryMosaicML in #2012
- fsdp with custom process groups by @vchiley in #2006
- Patch Speed Monitor MFU by @mvpatel2000 in #2013
- Remove runtime estimator state dict by @mvpatel2000 in #2015
- Update Docker images to fix resolve vulnerability scan issues by @bandish-shah in #2007
- Change Deprecation Warnings to Warnings for specifying ProgressBarLogger and ConsoleLogger to loggers by @eracah in #1846
- Fix eval duplicate logging issue by @mvpatel2000 in #2018
- Add workflow_dispatch trigger to pr-docker workflow by @bandish-shah in #2019
- Bump streaming version to less than 0.4.0 by @karan6181 in #2020
- Upgrade ipython installed in Docker images by @bandish-shah in #2021
- Upgrade torchmetrics by @nik-mosaic in #2017
- Complete upgrade of torchmetrics accuracy by @nik-mosaic in #2025
- Bump version to v0.13.0 by @bandish-shah in #2024
New Contributors
- @BrettRyland made their first contribution in #1995
Full Changelog: v0.12.1...v0.13.0
v0.12.1
🚀 Composer v0.12.1
Composer v0.12.1 is released! Install via pip
:
pip install --upgrade mosaicml==0.12.1
New Features
-
📚 In-Context Learning (#1876)
With Composer and MosaicML Cloud you can now evaluate LLMs on in-context learning tasks (LAMBADA, HellaSwag, PIQA, and more) hundreds of times faster than other evaluation harnesses. Please see our "Blazingly Fast LLM Evaluation for In-Context Learning" blog post for more details!
-
💾 Added support for Coreweave Object Storage (#1915)
Coreweave object store is compatible with
boto3
. Uploading objects to Coreweave object store is almost exactly like writing to using S3, except anendpoint_url
must be set via theS3_ENDPOINT_URL
environment variable. For example:import os os.environ['S3_ENDPOINT_URL'] = 'https://object.las1.coreweave.com' from composer.trainer import Trainer # Save checkpoints every epoch to s3://my_bucket/checkpoints trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration='10ep', save_folder='s3://my_bucket/checkpoints', save_interval='1ep', save_overwrite=True, save_filename='ep{epoch}.pt', save_num_checkpoints_to_keep=0, # delete all checkpoints locally ) trainer.fit()
Please see our checkpointing documentation for more details.
-
🪵 Automatic logging of Trainer hparams (#1855)
Hyperparameter arguments passed to the
Trainer
are now automatically logged. Simply set theTrainer
argumentauto_log_hparams=True
.
Bug Fixes
- Update Docker images to use ‘posix_prefix’ paths (#1854)
- Disable new notebook in CI (#1875)
- [Fix] Enable logging of metrics from Callbacks to ConsoleLogging (#1884)
- Ensure loggers run init event before callbacks in Engine (#1890)
- Raise an error in FSDP meta tensor initialization if there's no initialization functions, fix associated flaky FSDP test (#1905)
- Add primitive list support (#1906)
- Add logic for shifting labels before computing metrics (#1913)
- Fixes mis specified dependency (#1919)
- pin setuptools in build requirements (#1926)
- Pin pip<23 in Docker images (#1936)
- Fix bug in trainer.eval and add test cases for test_console_logger (#1937)
What's Changed
- Rename GradMonitor -> OptimizerMonitor; add functionality to log optimizer-specific metrics to assist loss spike investigation by @bmosaicml in #1743
- Add GCS uri support for loading and saving checkpoints by @eracah in #1833
- HF factory function tests by @dakinggg in #1832
- Fix doc issue, Trainer hparam log_to_console defaults to False by @eracah in #1840
- Removed YAHP references from Docs by @bandish-shah in #1841
- Typo by @nguyenhoan1988 in #1843
- Fix source code links in docs by @bandish-shah in #1844
- add importorskip by @dakinggg in #1847
- Update Docker images to use ‘posix_prefix’ paths by @mvpatel2000 in #1854
- Fix typo by @standardAI in #1849
- ConsoleLogger: log first batch and first epoch when using console_log_interval by @eracah in #1860
- Simpler auto log hparams by @eracah in #1855
- Fix typos by @cclauss in #1850
- Bump sphinxext-opengraph from 0.7.3 to 0.7.4 by @dependabot in #1851
- Bump coverage[toml] from 6.5.0 to 7.0.1 by @dependabot in #1853
- Bump traitlets from 5.7.0 to 5.8.0 by @dependabot in #1852
- Bump ipython from 7.32.0 to 8.8.0 by @dependabot in #1865
- Update monai requirement from <0.10,>=0.9.1 to >=0.9.1,<1.2 by @dependabot in #1869
- Bump sphinxcontrib-katex from 0.9.3 to 0.9.4 by @dependabot in #1868
- Bump coverage[toml] from 7.0.1 to 7.0.4 by @dependabot in #1867
- Upgrade docker images to
torch==1.13.1
by @abhi-mosaic in #1863 - add more useful info to state by @dakinggg in #1848
- Feature/lambada evaluator by @bmosaicml in #1845
- multi-node distributed training, submitit & composer integration demo by @YilunKuang in #1753
- Daily tests by @mvpatel2000 in #1870
- Disable new notebook in CI by @mvpatel2000 in #1875
- Update deepspeed by @mvpatel2000 in #1864
- fix fail fast in daily by @mvpatel2000 in #1880
- Fix getting started docs by @mvpatel2000 in #1878
- Speed up test_lm_task_evaluation by @mvpatel2000 in #1879
- Fix unprotected import by @mvpatel2000 in #1874
- add ignore_modules to fsdp by @vchiley in #1877
- Change vision image by @mvpatel2000 in #1881
- Fix eval_forward in the ComposerModel ABC by @eracah in #1871
- Fix fsdp weight tying by @bcui19 in #1856
- Bump pytest from 7.2.0 to 7.2.1 by @dependabot in #1886
- Bump ipykernel from 6.19.2 to 6.20.1 by @dependabot in #1887
- Bump gitpython from 3.1.28 to 3.1.30 by @dependabot in #1888
- Update Vision Image in Pytest by @mvpatel2000 in #1882
- Streaming data tests by @dakinggg in #1842
- Add NLP Algorithms Tests by @nik-mosaic in #1839
- rename HF notebook by @dakinggg in #1873
- Ensure loggers run init event before callbacks in Engine by @eracah in #1890
- [Fix] Enable logging of metrics from Callbacks to ConsoleLogging by @eracah in #1884
- Updating how we load metrics in a state_dict so we don't add extra memory overhead by @bcui19 in #1892
- Getting daily tests passing by @dakinggg in #1893
- Bump nbsphinx from 0.8.10 to 0.8.12 by @dependabot in #1897
- Fix docker image by @mvpatel2000 in #1894
- Add primitive list support by @mvpatel2000 in #1906
- Raise an error in FSDP
meta
tensor initialization if there's no initialization functions, fix associated flaky FSDP test by @bcui19 in #1905 - Gpu Test by @mvpatel2000 in #1907
- Update docker with FFCV fix by @mvpatel2000 in #1908
- Restore GPU tests by @mvpatel2000 in #1909
- Update workflow names by @mvpatel2000 in #1910
- Enable daily gpu tests by @mvpatel2000 in #1911
- Tweak daily GPU tests by @mvpatel2000 in #1912
- Daily GPU Tests -- Change to Git Commit by @mvpatel2000 in #1914
- Add logic for shifting labels before computing metrics by @alextrott16 in #1913
- Add coreweave object store support. by @eracah in #1915
- Fixes mis specified dependency by @dakinggg in #1919
- Bump coverage[toml] from 7.0.4 to 7.1.0 by @dependabot in #1923
- Update importlib-metadata requirement from <6,>=5.0.0 to >=5.0.0,<7 by @dependabot in #1921
- pin setuptools in build requirements by @dakinggg in #1926
- Remove synthetic testing infrastructure for HF/NLP by @dakinggg in #1895
- Add upgrade flags to pip installs by @dakinggg in #1916
- Temporarily pin pip to <23 by @dakinggg in #1930
- add link protection by @mvpatel2000 in #1927
- Cleaning up error checking for FSDP sharding strategies with fp32 precision by @bcui19 in #1925
- Fix mcp script to avoid follow by @mvpatel2000 in #1932
- Emit Eval progress in console logging by @eracah in #1917
- Remove Fused LayerNorm deprecation by @nik-mosaic in https://github.com/mosaicml/comp...