06 May 02:25

83d96b7

v0.6.1

🚀 Composer v0.6.1

Composer v0.6.1 is released!

Go ahead and upgrade; it's fully backwards compatible with Composer v0.6.0.

Install via pip:

pip install --upgrade mosaicml==0.6.1

Alternatively, install Composer with Conda:

conda install -c mosaicml mosaicml=0.6.1

What's New?

📎 Adaptive Gradient Clipping (AGC)

Adaptive Gradient Clipping (AGC) clips gradients based on the ratio of their norms with weights' norms. This technique helps stabilize training with large batch sizes, especially for models without batchnorm layers.
🚚 Exponential Moving Average (EMA)

Exponential Moving Average (EMA) is a model averaging technique that maintains an exponentially weighted moving average of the model parameters during training. The averaged parameters are used for model evaluation. EMA typically results in less noisy validation metrics over the course of training, and sometimes increased generalization.

🪵 Logger is available in the ComposerModel

The Logger is bound to the ComposerModel via the self.logger attribute. It is available during training on all methods (other than __init__).

For example, to log hidden activation:

class Net(ComposerModel):

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        if self.logger:
            self.logger.data_batch({
                "hidden_activation_norm": x.norm(2).item(),
            })
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)

🐛 Environment Collection Script

Composer v0.6.1 includes an environment collection script which generates a printout of your system configuration and python environment. If you run into a bug, the results from this script will help us debug the issue and fix Composer.

To collect your environment information:
```
$ pip install mosaicml  # if composer is not already installed
$ composer_collect_env
```
Then, include the output in your GitHub Issue.

What's Improved?

📜 TorchScriptable Algorithms

BlurPool, Ghost BatchNorm, and Stochastic Depth are now TorchScript-compatible. Try exporting your models with these algorithms enabled!
🏛️ ColOut on Segmentation

ColOut now supports segmentation-style models.

What's Fixed?

🚑️ Loggers capture the Traceback

We fixed a bug so the Loggers, such as the Weights & Biases Logger and the File Logger, will capture the traceback any exception that crashes the training process.
🏋️ Weights & Biases Logger Config

We fixed a bug where the the Weights & Biases Logger was not properly recording the configuration.

Full Changelog

v0.6.0...v0.6.1

Assets 2

21 Apr 01:49

ravi-mosaicml

v0.6.0

4574dce

v0.6.0

🚀 Composer v0.6.0

Composer v0.6.0 is released! Install via pip:

pip install --upgrade mosaicml==0.6.0

Alternatively, install Composer with Conda:

conda install -c mosaicml mosaicml=0.6.0

Major Changes

🗃️ Automatic Gradient Accumulation

Composer v0.6.0 can automatically pick an appropriate value for gradient accumulation. The trainer will automatically catch
OutOfMemory exceptions and handle them gracefully. No need to manually tune this parameter for each model, batch size, and
hardware combination!

To use automatic gradient accumulation, set grad_accum='auto'. For example:
```
trainer = Trainer(
    ...,
    grad_accum='auto',
)
```
💾 Artifact Logging

Training on spot instances? Composer v0.6.0 introduces artifact logging, making it possible to store checkpoints and other artifacts directly to cloud storage. See the Object Store Logger and the Checkpointing Guide for more information.

Artifact Logging has replaced the run directory and the run directory uploader, which have been removed.
📊 Metric Values on the State

Composer v0.6.0 binds the computed metric values on the State. Go ahead and read these values from your own callbacks! We'll be releasing an early stopping callback in an upcoming Composer release.
⚠️ NoEffectWarning and NotIntendedUseWarning for Algorithms

Some algorithms, such as BlurPool, now emit a NoEffectWarning or a NotIntendedUseWarning when they're not being used appropriately.

Minor Improvements

🏃‍♀️ Training Run Names

We introduced a run_name parameter in the Trainer to help organize training runs.
```
trainer = Trainer(
    ...,
    run_name='awesome-traing-run',
)
```
We'll automatically pick one if the run name is not specified.
💈 Automatic Progress Bars

The ProgressBarLogger, formally called the TQDMLogger, is automatically enabled for all training runs.

To disable the progress bar, set progress_bar=False. For example:
```
trainer = Trainer(
    ...,
    progress_bar=False,
)
```
🪵 Logged Data in the Console

To print Logger calls to the console, set the log_to_console and the console_log_level arguments.
```
trainer = Trainer(
    ...,
    log_to_console=True,
    console_log_level="epoch",
)
```
By default, the console logger will only be enabled when progress_bar=False. The default console log level is epoch.
📃 Capturing stdout and stderr in Log Files

The FileLogger captures stdout and stderr by default now. Tracebacks will now be captured amongst other logging statements.
⬆️ PyTorch 1.11 Support

We've tested Composer on PyTorch 1.11. Go ahead and upgrade your dependencies!
✅ Checkpointing

We changed the checkpoint format to store the underlying model, not the DistributedDataParallel wrapped model. If you're using Composer to read checkpoints, there's nothing to change. But if you're reading Composer checkpoints manually, note that the module checkpoints will be formatted differently.

In addition, we changed the checkpointing argument names for the trainer.
- The new parameters save_artifact_name and save_latest_artifact_name allow checkpoints to be saved directly to artifact stores.
- The new parameter save_num_checkpoints_to_keep helps preserve local disk storage by automatically removing old checkpoints.
- load_path replaces load_path_format.
- save_name replaces save_path_format.
- save_latest_filename replaces save_latest_format.
🏎️ Profiling

We added support for custom scheduling functions and re-designed how the profiler saves traces. Each profiling cycle will now have its own trace file. Trace merging happens automatically throughout the training process. Long-running profiling is now possible without the long wait at the end of training for the trace merge.

As part of this refactor, the profiler arguments have changed:
- prof_trace_handlers replaces prof_event_handlers.
- prof_schedule replaces prof_skip_first, prof_wait, prof_warmup, prof_active, and prof_repeat. See the cyclic schedule function.
- torch_prof_folder replaces torch_profiler_trace_dir
- The new arguments torch_prof_filename, torch_prof_artifact_name, torch_prof_overwrite, and torch_prof_num_traces_to_keep allow for customization on how PyTorch Profiler traces are saved.
🏗️ TorchVision Model Architectures

We switched our vision models to use the TorchVision model architecture implementations where possible.

Bug Fixes

Fixed a bug with MixUp and gradient accumulation
Fixed numerous issues with the Composer launch script for distributed training. Composer v0.6.0 includes environment variable support, better defaults and warings, and proper handling of crashed processes.

Changelog

Update Migrating_from_PTL.ipynb by @moinnadeem in #730
CodeQL Analysis by @Averylamp in #723
Installing pyright via npm by @ravi-mosaicml in #735
Polish intro docs by @dblalock in #721
Numerics docs page by @bandish-shah in #725
Testing Niklas GH Docs Star w/ Dark Mode by @moinnadeem in #742
[Artifact Logging PR1] Logger Refactoring by @ravi-mosaicml in #698
Update README.md by @moinnadeem in #731
Updated the Method Cards by @hanlint in #647
Using existing clone in conda meta.yaml by @ravi-mosaicml in #751
[Artifact Logging PR2] Logger Destination Cleanup by @ravi-mosaicml in #699
Shorten to minimal code snippets by @hanlint in #752
Sample-wise Stochastic Depth Method Card by @Landanjs in #749
Update algorithm yamls by @coryMosaicML in #747
[Artifact Logging PR3] Add the run_name as a property of the Logger by @ravi-mosaicml in #700
[Artifact Logging PR4] Added log_file_artifact base method by @ravi-mosaicml in #701
Fix README.md by @ravi-mosaicml in #753
Less CodeQL by @Averylamp in #762
Increase the timeout for test trainer equivalence by @ravi-mosaicml in #766
Port squeze excite method card to new format by @dblalock in #764
Small fixes by @hanlint in #765
Adding defaults to blurpool by @moinnadeem in #756
Added maximum versions to dependencies by @ravi-mosaicml in #768
Update sequence length warmup documentation by @moinnadeem in #770
Additional README fixes by @hanlint in #769
Fix setup.py by @Averylamp in #761
Increased the timeout for test_trainer.py by @ravi-mosaicml in #775
Remove plural types and aliases for native pytorch types by @Landanjs in #677
[Artifact Logging PR5] Added the object store logger by @ravi-mosaicml in #706
[Artifact Logging PR6] Rename the TQDMLogger as the ProgressBarLogger; remove terminal logging from the file logger by @ravi-mosaicml in #708
[Artifact Logging PR7] Add stdout and stderr capture to the FileLogger by @ravi-mosaicml in #710
Update README.md by @vahidfazelrezai in #781
URGENT: Fixing an incorrect number by @jfrankle in https:/...

Contributors

kobindra, moinnadeem, and 23 other contributors

Assets 2

16 Mar 14:02

hanlint

v0.5.0

00e51ba

Release version v0.5.0

We are excited to share Composer v0.5, a library of speed-up methods for efficient neural network training. This release features:

Revamped checkpointing API based on community feedback
New baselines: ResNet34-SSD, GPT-3, and Vision Transformers
Additional improvements to our documentation
Support for bfloat16
Streaming dataset support
Unified functional API for our algorithms

Highlights

Checkpointing API

Checkpointing models are now a Callback, so that users can easily write and add their own callbacks. The callback is automatically appended if a save_folder is provided to the Trainer.

trainer = Trainer(
    model=model,
    algorithms=algorithms,
    save_folder="checkpoints",
    save_interval="1ep"
)

Alternatively, CheckpointSaver can be directly added as a callback:

trainer = Trainer(..., callbacks=[
    CheckpointSaver(
        save_folder='checkpoints',
        name_format="ep{epoch}-ba{batch}/rank_{rank}",
        save_latest_format="latest/rank_{rank}",
        save_interval="1ep",
        weights_only=False,
    )
])

Subclass from CheckpointSaver to add your own logic for saving the best model, or saving at specific intervals. Thanks to @mansheej @siriuslee and other users for their feedback.

bloat16

We've added experimental support for bfloat16, which can be provided via the precision argument to the Trainer:

trainer = Trainer(
    ...,
    precision="bfloat16"
)

Streaming datasets

We've added support for fast streaming datasets. For NLP-based datasets such as C4, we use the HuggingFace datasets backend, and add dataset-specific shuffling, tokenization , and grouping on-the-fly. To support data parallel training, we added specific sharding logic for efficiency. See C4Datasets for more details.

Vision streaming datasets are supported via a patched version of the webdatasets package, and added support for data sharding by workers for fast augmentations. See composer.datasets.webdataset for more details.

Baseline GPT-3, ResNet34-SSD, and Vision Transformer benchmarks

Configurations for GPT-3-like models ranging from 125m to 760m parameters are now released, and use DeepSpeed Zero Stage 0 for memory-efficient training.

We've also added the Single Shot Detection (SSD) model (Wei et al, 2016) with a ResNet34 backbone, based on the MLPerf reference implementation.

Our first Vision Transformer benchmark is the ViT-S/16 model from Touvron et al, 2021, and based on the vit-pytorch package.

See below for the full details:

What's Changed

Export Transforms in composer.algorithms by @ajaysaini725 in #603
Make batchnorm default for UNet by @dskhudia in #535
Fix no_op_model algorithm by @dskhudia in #614
Pin pre-1.0 packages by @bandish-shah in #595
Updated dark mode composer logo, and graph by @nqn in #617
Jenkins + Docker Improvements by @ravi-mosaicml in #621
update README links by @hanlint in #628
Remove all old timing calls by @ravi-mosaicml in #594
Remove state shorthand by @mvpatel2000 in #629
add bfloat16 support by @nikhilsardana in #433
v0.4.0 Hotfix: Docker documentation updates by @bandish-shah in #631
Fix wrong icons in the method cards by @hanlint in #636
fix autocast for pytorch < 1.10 by @nikhilsardana in #639
Add tutorial notebooks to the README by @moinnadeem in #630
Converted Stateless Schedulers to Classes by @ravi-mosaicml in #632
Jenkinsfile Fixes Part 2 by @ravi-mosaicml in #627
Add C4 Streaming dataset by @abhi-mosaic in #489
CONTRIBUTING.md additions by @kobindra in #648
Hide showing object as a base class; fix skipping documentation of forward; fixed docutils dependency. by @ravi-mosaicml in #643
Matthew/functional docstrings update by @growlix in #622
docstrings improvements for core modules by @dskhudia in #598
ssd-resnet34 on COCO map 0.23 by @florescl in #646
Fix broken "best practices" link by @growlix in #649
Update progressive resizing to work for semantic segmentation by @coryMosaicML in #604
Let C4 Dataset overwrite num_workers if set incorrectly by @abhi-mosaic in #655
Lazy imports for pycocotools by @abhi-mosaic in #656
W&B excludes final eval metrics when plotted as a fxn of epoch or trainer/global_step by @growlix in #633
Update GPT3-yamls for default 8xA100-40GB by @abhi-mosaic in #663
Set WandB default to log rank zero only by @abhi-mosaic in #461
Update schedulers guide by @hanlint in #661
[XS] Fix a TQDM deserialization bug by @jbloxham in #665
Add defaults to the docstrings for algorithms by @hanlint in #662
Fix ZeRO config by @jbloxham in #667
[XS] fix formatting for colout by @hanlint in #666
Composer.core docstring touch-up by @ravi-mosaicml in #657
Add Uniform bounding box sampling option for CutOut and CutMix by @coryMosaicML in #634
Update README.md by @ravi-mosaicml in #678
Fix bug in trainer test by @hanlint in #651
InMemoryLogger has get_timeseries() method by @growlix in #644
Batchwise resolution for SWA by @growlix in #654
Fixed the conda build script so it runs on jenkins by @ravi-mosaicml in #676
Yahp version update to 0.1.0 by @Averylamp in #674
Streaming vision datasets by @knighton in #284
Fix DeepSpeed checkpointing by @jbloxham in #686
Vit by @A-Jacobson in #243
[S] cleanup tldr; standardize __all__ by @hanlint in #688
Unify algorithms part 2: mixup, cutmix, label smoothing by @dblalock in #658
composer.optim docstrings by @jbloxham in #653
Fix DatasetHparams, WebDatasetHparams docstring by @growlix in #697
Models docstrings by @A-Jacobson in #469
docstrings improvements for composer.datasets by @dskhudia in #694
Updated contributing.md and the style guide by @ravi-mosaicml in #670
Ability to retry ADE20k crop transform by @Landanjs in #702
Add mmsegmentation DeepLabv3(+) by @Landanjs in #684
Unify functional API part 3 by @dblalock in #715
Update example notebooks by @coryMosaicML in #707
[Checkpointing - PR1] Store the rank_zero_seed on state by @ravi-mosaicml in #680
[Checkpointing - PR2] Added in new Checkpointing Events by @ravi-mosaicml in #690
[Checkpointing - PR3] Clean up RNG and State serialization by @ravi-mosaicml in #692
[Checkpointing - PR4] Refactored the CheckpointLoader into a load_checkpoint function by @ravi-mosaicml in #693
Update {blurpool,factorize,ghostbn} method cards by @dblalock in #711
[Checkpointing - PR 5] Move the CheckpointSaver to a callback. by @ravi-mosaicml in #687
Update datasets docstrings by @growlix in #709
add notebooks and functional api by @hanlint in #714
Migrating from PTL notebook by @florescl in #436
Docs 0.4.1: Profiler section and tutorials by @bandish-shah in https://github.com/mos...

Contributors

kobindra, moinnadeem, and 20 other contributors

Assets 2

01 Mar 02:34

hanlint

v0.4.0

7714b13

Release Version 0.4.0

What's Changed

Release/0.3.0 by @ravi-mosaicml in #102
Create dataloader on trainer init() by @ravi-mosaicml in #92
label smoothing will not work without alpha set by @A-Jacobson in #100
Warmup and cosine annealing warm restarts combine sequentially by @jacobfulano in #99
Moved device.prepare() to init by @ravi-mosaicml in #111
run_event for callbacks, removed deferred logging by @ravi-mosaicml in #85
Remove composer.trainer.ddp; replace with composer.utils.ddp by @ravi-mosaicml in #105
Running callbacks befor algorithms for the INIT event in the engine by @ravi-mosaicml in #113
Replaced atexit with cleanup methods by @ravi-mosaicml in #112
Deepspeed Integration by @jbloxham in #109
Fix loss reporting by @jbloxham in #130
Run Directory Uploader by @ravi-mosaicml in #101
Dataloader Upgrades by @ravi-mosaicml in #114
Synthetic Datasets and Subset Sampling by @ravi-mosaicml in #110
Remove argparse from setup.py by @ravi-mosaicml in #131
Fixed pickling of torch.memory_format objects by @ravi-mosaicml in #132
Fixed issue #135; rename total_batch_size to train_batch_size by @ravi-mosaicml in #137
Implement MosaicMLLoggerBackend by @ajaysaini725 in #81
Add a linear learning rate decay by @moinnadeem in #142
Apply channels last on init by @ravi-mosaicml in #147
Update Trainer checkpointing documentation by @moinnadeem in #150
Address crashes with DDP + Checkpointing by @moinnadeem in #151
Sudo in the dockerimage by @ravi-mosaicml in #152
Remove curriculum learning by @ravi-mosaicml in #164
Remove broken symlinks by @ravi-mosaicml in #163
Removed dataclass from state by @ravi-mosaicml in #153
Guard artifact uploading in wandb with ddp barriers by @ravi-mosaicml in #162
add CODE_OF_CONDUCT.md by @kobindra in #160
[XS] Fix wandb logger by @jbloxham in #172
Print help on run_mosaic_trainer.py, cleaned up verbosity. by @ravi-mosaicml in #170
DeepSpeed ZeRO config options by @jbloxham in #166
DDP Seeding Across Processes by @ajaysaini725 in #173
Fixed the run directory uploader test by @ravi-mosaicml in #177
Fix broken gpu tests by @ravi-mosaicml in #181
Conditionally skip tests when installed with mosaicml[dev] by @ravi-mosaicml in #185
A yapf update broke some formatting...re-running the linter by @ravi-mosaicml in #188
Timer PR parts 1 and 2 from #146 by @ravi-mosaicml in #174
Fixed pyright issues by @ravi-mosaicml in #198
Additional Tests by @ravi-mosaicml in #191
Propagate processes that were sigkilled by @ravi-mosaicml in #184
Add the ability to load a checkpoint without restoring state by @moinnadeem in #169
Add ResNet-9 for CIFAR-10 by @dblalock in #193
Added helper methods for torch.distributed.boradcast by @ravi-mosaicml in #189
Checkpointing & DeepSpeed by @jbloxham in #199
Distinguish between dist and DDP by @jbloxham in #201
DeepSpeed precision fixes for CV by @jbloxham in #197
Fix deterministic mode (and use it for tests); simplify checkpointing tests by @ravi-mosaicml in #203
Load checkpoints from cloud storage by @ravirahman in #200
Updated the DataSpec for the timing abstraction (#146) parts 3 and 4 by @ravi-mosaicml in #178
Add larger GPT models by @jbloxham in #213
Add BERT Base to Composer by @moinnadeem in #195
Integrate the timer into the training loop by @ravi-mosaicml in #210
Dockerfile enhancements by @ravi-mosaicml in #182
Adding checkpointing at the end of training by @moinnadeem in #219
Adding conditional branching on data_collator by @moinnadeem in #220
Fixes apt sources bug fix by @Averylamp in #231
Remove old timing calls from layer freezing by @ravi-mosaicml in #216
Require pip install -e be pip install --user -e when running as root by @ravi-mosaicml in #232
DeepLabv3 + ADE20k benchmark by @Landanjs in #107
Remove old timing calls from selective backprop by @ravi-mosaicml in #221
Clean up the tests to make them work on jenkins by @ravi-mosaicml in #233
Make the run directory rank-local; fix checkpoints saving and restoring by @ravi-mosaicml in #215
Cleaned Up State by @ravi-mosaicml in #223
Fix the speed monitor by @ravi-mosaicml in #238
Fixed loggers and callbacks by @ravi-mosaicml in #240
Fix ade20k padding fill calculation by @Landanjs in #250
Adding fix for NLP learning rates by @moinnadeem in #235
Training Loop Profiler by @ravi-mosaicml in #97
WIP: Composer Jenkinsfile by @ravi-mosaicml in #82
Fix broken tests by @ravi-mosaicml in #257
Fix bug with AFTER_DATALOADER event; remove microbatches from state by @ravi-mosaicml in #258
Remove the DDP DataLoader by @ravi-mosaicml in #245
Fix Jenkins to work on PRs from Forks by @ravi-mosaicml in #267
add ability to specify custom run name, with rank auto-appended by @dblalock in #264
Remove secrets from the yaml by @ravi-mosaicml in #261
Checkpoint logging and doc fixes by @ajaysaini725 in #270
Remove custom W&B config changes by @siriuslee in #236
Dramatically increase default dist_timeout by @jbloxham in #272
Add factorization by @dblalock in #53
Allow str and dict in Trainer init signature by @hanlint in #277
Add kwargs back to the closure by @jbloxham in #292
Default to num_classes=10 for CIFAR10_ResNet56 by @hanlint in #293
Use tqdm.auto for notebooks by @hanlint in #298
Added ResNet20 by @growlix in #289
Optimizer Surgery by @ravi-mosaicml in #249
Don't init dist when world_size is 1 by @jbloxham in #311
Scheduler defaults to step-wise instead of epoch-wise by @hanlint in #312
Added the version to composer.init by @ravi-mosaicml in #315
Rename checkpoint API by @hanlint in #281
Update setup.py by @Averylamp in #321
Timm support by @A-Jacobson in #262
[XS] use correct package name in error messages by @jbloxham in #331
Multiple Evaluator Datasets by @anisehsani in #120
Fixed all uses of textwrap.dedent by @ravi-mosaicml in #332
Remove explicit YAHP constructs from algorithms by @jbloxham in https://github.com/mosaicml/composer/pu...

Contributors

kobindra, moinnadeem, and 21 other contributors

Assets 2

01 Dec 00:27

Averylamp

v0.3.1

d17e69f

Release Version 0.3.1

Hotfix

Hotfix to fix installation of the composer package

Assets 2

30 Nov 01:30

Averylamp

v0.3.0

08944ed

Release Version 0.3.0

Release PR

Major Changes

Python 3.7 Compatibility
Adds CutMix Method
New Pre-Fork DDP entrypoint
- Change PR
- composer Entrypoint for DDP forking prior to script start
- Documentation on Usage

Minor Changes

Lazy-Loading of dependencies
General Docs updates for readability and correctness
DDP Port auto-selection by default (no more conflicting ports upon reuse of trainer)
Small bug fixes for YAHP inheritance

Notes

Google Colab may have issues installing composer with !pip install mosaicml
- Known workaround: Install through git with !pip install git+https://github.com/mosaicml/composer@main

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Composer v0.6.1

What's New?

What's Improved?

What's Fixed?

Full Changelog

🚀 Composer v0.6.0

Major Changes

Minor Improvements

Bug Fixes

Changelog

Contributors

Highlights

Checkpointing API

bloat16

Streaming datasets

Baseline GPT-3, ResNet34-SSD, and Vision Transformer benchmarks

What's Changed

Contributors

What's Changed

Contributors

Hotfix

Major Changes

Minor Changes

Notes

Releases: mosaicml/composer

v0.6.1

🚀 Composer v0.6.1

What's New?

What's Improved?

What's Fixed?

Full Changelog

v0.6.0

🚀 Composer v0.6.0

Major Changes

Minor Improvements

Bug Fixes

Changelog

Contributors

Release version v0.5.0

Highlights

Checkpointing API

bloat16

Streaming datasets

Baseline GPT-3, ResNet34-SSD, and Vision Transformer benchmarks

What's Changed

Contributors

Release Version 0.4.0

What's Changed

Contributors

Release Version 0.3.1

Hotfix

Release Version 0.3.0

Major Changes

Minor Changes

Notes