Skip to content

Releases: mosaicml/composer

v0.6.1

06 May 02:25
Compare
Choose a tag to compare

🚀 Composer v0.6.1

Composer v0.6.1 is released!

Go ahead and upgrade; it's fully backwards compatible with Composer v0.6.0.

Install via pip:

pip install --upgrade mosaicml==0.6.1

Alternatively, install Composer with Conda:

conda install -c mosaicml mosaicml=0.6.1

What's New?

  1. 📎 Adaptive Gradient Clipping (AGC)

    Adaptive Gradient Clipping (AGC) clips gradients based on the ratio of their norms with weights' norms. This technique helps stabilize training with large batch sizes, especially for models without batchnorm layers.

  2. 🚚 Exponential Moving Average (EMA)

    Exponential Moving Average (EMA) is a model averaging technique that maintains an exponentially weighted moving average of the model parameters during training. The averaged parameters are used for model evaluation. EMA typically results in less noisy validation metrics over the course of training, and sometimes increased generalization.

  3. 🪵 Logger is available in the ComposerModel

    The Logger is bound to the ComposerModel via the self.logger attribute. It is available during training on all methods (other than __init__).

    For example, to log hidden activation:

    class Net(ComposerModel):
    
        def forward(self, x):
            x = F.relu(F.max_pool2d(self.conv1(x), 2))
            x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
            if self.logger:
                self.logger.data_batch({
                    "hidden_activation_norm": x.norm(2).item(),
                })
            x = x.view(-1, 320)
            x = F.relu(self.fc1(x))
            x = F.dropout(x, training=self.training)
            x = self.fc2(x)
            return F.log_softmax(x)
  4. 🐛 Environment Collection Script

    Composer v0.6.1 includes an environment collection script which generates a printout of your system configuration and python environment. If you run into a bug, the results from this script will help us debug the issue and fix Composer.

    To collect your environment information:

    $ pip install mosaicml  # if composer is not already installed
    $ composer_collect_env

    Then, include the output in your GitHub Issue.

What's Improved?

  1. 📜 TorchScriptable Algorithms

    BlurPool, Ghost BatchNorm, and Stochastic Depth are now TorchScript-compatible. Try exporting your models with these algorithms enabled!

  2. 🏛️ ColOut on Segmentation

    ColOut now supports segmentation-style models.

What's Fixed?

  1. 🚑️ Loggers capture the Traceback

    We fixed a bug so the Loggers, such as the Weights & Biases Logger and the File Logger, will capture the traceback any exception that crashes the training process.

  2. 🏋️ Weights & Biases Logger Config

    We fixed a bug where the the Weights & Biases Logger was not properly recording the configuration.

Full Changelog

v0.6.0...v0.6.1

v0.6.0

21 Apr 01:49
Compare
Choose a tag to compare

🚀 Composer v0.6.0

Composer v0.6.0 is released! Install via pip:

pip install --upgrade mosaicml==0.6.0

Alternatively, install Composer with Conda:

conda install -c mosaicml mosaicml=0.6.0

Major Changes

  1. 🗃️ Automatic Gradient Accumulation

    Composer v0.6.0 can automatically pick an appropriate value for gradient accumulation. The trainer will automatically catch
    OutOfMemory exceptions and handle them gracefully. No need to manually tune this parameter for each model, batch size, and
    hardware combination!

    To use automatic gradient accumulation, set grad_accum='auto'. For example:

    trainer = Trainer(
        ...,
        grad_accum='auto',
    )
  2. 💾 Artifact Logging

    Training on spot instances? Composer v0.6.0 introduces artifact logging, making it possible to store checkpoints and other artifacts directly to cloud storage. See the Object Store Logger and the Checkpointing Guide for more information.

    Artifact Logging has replaced the run directory and the run directory uploader, which have been removed.

  3. 📊 Metric Values on the State

    Composer v0.6.0 binds the computed metric values on the State. Go ahead and read these values from your own callbacks! We'll be releasing an early stopping callback in an upcoming Composer release.

  4. ⚠️ NoEffectWarning and NotIntendedUseWarning for Algorithms

    Some algorithms, such as BlurPool, now emit a NoEffectWarning or a NotIntendedUseWarning when they're not being used appropriately.

Minor Improvements

  1. 🏃‍♀️ Training Run Names

    We introduced a run_name parameter in the Trainer to help organize training runs.

    trainer = Trainer(
        ...,
        run_name='awesome-traing-run',
    )

    We'll automatically pick one if the run name is not specified.

  2. 💈 Automatic Progress Bars

    The ProgressBarLogger, formally called the TQDMLogger, is automatically enabled for all training runs.

    To disable the progress bar, set progress_bar=False. For example:

    trainer = Trainer(
        ...,
        progress_bar=False,
    )
  3. 🪵 Logged Data in the Console

    To print Logger calls to the console, set the log_to_console and the console_log_level arguments.

    trainer = Trainer(
        ...,
        log_to_console=True,
        console_log_level="epoch",
    )

    By default, the console logger will only be enabled when progress_bar=False. The default console log level is epoch.

  4. 📃 Capturing stdout and stderr in Log Files

    The FileLogger captures stdout and stderr by default now. Tracebacks will now be captured amongst other logging statements.

  5. ⬆️ PyTorch 1.11 Support

    We've tested Composer on PyTorch 1.11. Go ahead and upgrade your dependencies!

  6. ✅ Checkpointing

    We changed the checkpoint format to store the underlying model, not the DistributedDataParallel wrapped model. If you're using Composer to read checkpoints, there's nothing to change. But if you're reading Composer checkpoints manually, note that the module checkpoints will be formatted differently.

    In addition, we changed the checkpointing argument names for the trainer.

    • The new parameters save_artifact_name and save_latest_artifact_name allow checkpoints to be saved directly to artifact stores.
    • The new parameter save_num_checkpoints_to_keep helps preserve local disk storage by automatically removing old checkpoints.
    • load_path replaces load_path_format.
    • save_name replaces save_path_format.
    • save_latest_filename replaces save_latest_format.
  7. 🏎️ Profiling

    We added support for custom scheduling functions and re-designed how the profiler saves traces. Each profiling cycle will now have its own trace file. Trace merging happens automatically throughout the training process. Long-running profiling is now possible without the long wait at the end of training for the trace merge.

    As part of this refactor, the profiler arguments have changed:

    • prof_trace_handlers replaces prof_event_handlers.
    • prof_schedule replaces prof_skip_first, prof_wait, prof_warmup, prof_active, and prof_repeat. See the cyclic schedule function.
    • torch_prof_folder replaces torch_profiler_trace_dir
    • The new arguments torch_prof_filename, torch_prof_artifact_name, torch_prof_overwrite, and torch_prof_num_traces_to_keep allow for customization on how PyTorch Profiler traces are saved.
  8. 🏗️ TorchVision Model Architectures

    We switched our vision models to use the TorchVision model architecture implementations where possible.

Bug Fixes

  • Fixed a bug with MixUp and gradient accumulation
  • Fixed numerous issues with the Composer launch script for distributed training. Composer v0.6.0 includes environment variable support, better defaults and warings, and proper handling of crashed processes.

Changelog

Read more

Release version v0.5.0

16 Mar 14:02
Compare
Choose a tag to compare

We are excited to share Composer v0.5, a library of speed-up methods for efficient neural network training. This release features:

  • Revamped checkpointing API based on community feedback
  • New baselines: ResNet34-SSD, GPT-3, and Vision Transformers
  • Additional improvements to our documentation
  • Support for bfloat16
  • Streaming dataset support
  • Unified functional API for our algorithms

Highlights

Checkpointing API

Checkpointing models are now a Callback, so that users can easily write and add their own callbacks. The callback is automatically appended if a save_folder is provided to the Trainer.

trainer = Trainer(
    model=model,
    algorithms=algorithms,
    save_folder="checkpoints",
    save_interval="1ep"
)

Alternatively, CheckpointSaver can be directly added as a callback:

trainer = Trainer(..., callbacks=[
    CheckpointSaver(
        save_folder='checkpoints',
        name_format="ep{epoch}-ba{batch}/rank_{rank}",
        save_latest_format="latest/rank_{rank}",
        save_interval="1ep",
        weights_only=False,
    )
])

Subclass from CheckpointSaver to add your own logic for saving the best model, or saving at specific intervals. Thanks to @mansheej @siriuslee and other users for their feedback.

bloat16

We've added experimental support for bfloat16, which can be provided via the precision argument to the Trainer:

trainer = Trainer(
    ...,
    precision="bfloat16"
)

Streaming datasets

We've added support for fast streaming datasets. For NLP-based datasets such as C4, we use the HuggingFace datasets backend, and add dataset-specific shuffling, tokenization , and grouping on-the-fly. To support data parallel training, we added specific sharding logic for efficiency. See C4Datasets for more details.

Vision streaming datasets are supported via a patched version of the webdatasets package, and added support for data sharding by workers for fast augmentations. See composer.datasets.webdataset for more details.

Baseline GPT-3, ResNet34-SSD, and Vision Transformer benchmarks

Configurations for GPT-3-like models ranging from 125m to 760m parameters are now released, and use DeepSpeed Zero Stage 0 for memory-efficient training.

We've also added the Single Shot Detection (SSD) model (Wei et al, 2016) with a ResNet34 backbone, based on the MLPerf reference implementation.

Our first Vision Transformer benchmark is the ViT-S/16 model from Touvron et al, 2021, and based on the vit-pytorch package.

See below for the full details:

What's Changed

Read more

Release Version 0.4.0

01 Mar 02:34
Compare
Choose a tag to compare

What's Changed

Read more

Release Version 0.3.1

01 Dec 00:27
d17e69f
Compare
Choose a tag to compare

Hotfix

Hotfix to fix installation of the composer package

Release Version 0.3.0

30 Nov 01:30
08944ed
Compare
Choose a tag to compare

Release PR

Major Changes

  • Python 3.7 Compatibility
  • Adds CutMix Method
  • New Pre-Fork DDP entrypoint

Minor Changes

  • Lazy-Loading of dependencies
  • General Docs updates for readability and correctness
  • DDP Port auto-selection by default (no more conflicting ports upon reuse of trainer)
  • Small bug fixes for YAHP inheritance

Notes

  • Google Colab may have issues installing composer with !pip install mosaicml
    • Known workaround: Install through git with !pip install git+https://github.com/mosaicml/composer@main