Skip to content

Releases: huggingface/text-generation-inference

v2.2.0

23 Jul 16:30
Compare
Choose a tag to compare

Notable changes

  • Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
  • Gemma2 softcap support
  • Deepseek v2 support.
  • Lots of internal reworks/cleanup (allowing for cool features)
  • Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
  • Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)

What's Changed

New Contributors

Full Changelog: v2.1.1...v2.2.0

v2.1.1

04 Jul 10:43
4dfdb48
Compare
Choose a tag to compare

Main changes

  • Bugfixes
  • Added FlashDecoding support (Beta) use FLASH_DECODING=1 to use TGI with flash decoding (large speedups on long queries). #1940
  • Use Marlin over GPTQ kernels for faster GPTQ inference #2111

What's Changed

  • Fixing the CI to also run in release when it's a tag ? by @Narsil in #2138
  • fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_โ€ฆ by @sywangyi in https://github.com//pull/2148
  • Fixing clippy. by @Narsil in #2149
  • fix: use weights from base_layer by @drbh in #2141
  • feat: download lora adapter weights from launcher by @drbh in #2140
  • Use GPTQ-Marlin for supported GPTQ configurations by @danieldk in #2111
  • fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' by @icyxp in #2123
  • refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform by @sywangyi in #2132
  • fix: prefer serde structs over custom functions by @drbh in #2127
  • Fixing test. by @Narsil in #2152
  • GH router. by @Narsil in #2153
  • Fixing baichuan override. by @Narsil in #2158
  • [Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. by @Narsil in #1940
  • Fixing graph capture for flash decoding. by @Narsil in #2163
  • fix FlashDecoding change's regression in intel platform by @sywangyi in #2161
  • fix: use the base layers weight in mistral rocm by @drbh in #2155
  • Fixing rocm. by @Narsil in #2164
  • Ci test by @glegendre01 in #2124
  • Hotfixing qwen2 and starcoder2 (which also get clamping). by @Narsil in #2167
  • feat: improve update_docs for openapi schema by @drbh in #2169
  • Fixing the dockerfile warnings. by @Narsil in #2173
  • Fixing missing object field for regular completions. by @Narsil in #2175

New Contributors

Full Changelog: v2.1.0...v2.1.1

v2.1.0

28 Jun 06:26
192d49a
Compare
Choose a tag to compare

Notable changes

  • New models : gemma2

  • Multi lora adapters. You can now run multiple loras on the same TGI deployment #2010

  • Faster GPTQ inference and Marlin support (up to 2x speedup).

  • Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)

  • Lots of Rocm support and bugfixes,

  • Lots of new contributors ! Thanks a lot for these contributions

What's Changed

Read more

v2.0.4

24 May 10:55
Compare
Choose a tag to compare

Main changes

What's Changed

New Contributors

Full Changelog: v2.0.3...v2.0.4

v2.0.3

16 May 05:05
40213c9
Compare
Choose a tag to compare

Important changes

What's Changed

New Contributors

Full Changelog: v2.0.2...v2.0.3

v2.0.2

01 May 07:22
6073ece
Compare
Choose a tag to compare

Tl;dr

  • New models (idefics2, phi3)
  • Cleaner VLM support in the openai layer
  • Upgraded to pytorch 2.3.0

What's Changed

New Contributors

Full Changelog: v2.0.1...v2.0.2

v2.0.1

18 Apr 15:22
Compare
Choose a tag to compare

What's Changed

  • feat: improve tools to include name and add tests by @drbh in #1693
  • Update response type for /v1/chat/completions and /v1/completions by @Wauplin in #1747
  • accept list as prompt for OpenAI API by @drbh in #1702
  • fix ROCm docker image

Full Changelog: v2.0.0...v2.0.1

v2.0.0

12 Apr 16:44
c38a7d7
Compare
Choose a tag to compare

TGI is back to Apache 2.0!

Highlights

  • License was reverted to Apache 2.0
  • Cuda graphs are now used by default. They improve latency substancially on high end nodes.
  • Llava-next was added. It is the second multimodal model available on TGI after Idefics.
  • Cohere Command R+ support. TGI is the fastest open source backend for Command R+
  • FP8 support.
  • We now share the vocabulary for all medusa heads, greatly improving latency and memory use.

Try out Command R+ with Medusa heads on 4xA100s with:

model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4

What's Changed

New Contributors

Full Changelog: v1.4.5...v2.0.0

v.1.4.5

29 Mar 18:18
4ee0a0c
Compare
Choose a tag to compare

Highlights

  • DBRX support #1685. See #1679 on how to prompt the model.

What's Changed

Full Changelog: v1.4.4...v1.4.5

v.1.4.4

22 Mar 17:45
6c4496a
Compare
Choose a tag to compare

Highlights

  • CohereForAI/c4ai-command-r-v01 model support

What's Changed

New Contributors

Full Changelog: v1.4.3...v1.4.4