Skip to content

v0.9.0

Compare
Choose a tag to compare
@KuuCi KuuCi released this 08 Jun 04:58

🚀 LLM Foundry v0.9.0

New Features

More Token Encoding Types (#1254)

We've expanded the different ways to encode token IDs by allowing uint32 and uint16 formats, which saves significant space for datasets with smaller vocab sizes. We also extended ndarray type support for MDS dataset columns to the generic text dataset and updated conversion scripts accordingly.

Enforced Stricter Configs (#1254, #1225, #1202)

We've implemented stricter enforcement on our Train and Eval configs to further protect users from attempting to train with invalid configs. In conjunction with numerous other PRs, we have stronger error handling to help users use LLM Foundry smoothly.

Previously, this was allowed:

parameters:
   train_dataloader:
      ...
      seed: ${global_seed}
      random_other_key_that's_not_in_the_dataloader_constructor # this is not allowed
   ...
   global_seed: 17 # this is also not allowed

But we've added a variables section. Please do this instead:

parameters:
  variables:
    global_seed: 42
  ...
  train_dataloader:
    seed: ${variables.global_seed}

Chunked text to mds conversion (#1240)

We've updated our text to mds to convertion script to convert files to MDS in chunks. This protects us from loading entire large files at once (potentially causing OOMs), and drastically speeds up converting long sequences.

Breaking Changes and Deprecations

What's Changed

Full Changelog: v0.8.0...v0.9.0