Releases · facebookresearch/xformers

26 Jul 15:41

v0.0.27.post2

1fc661f

torch.compile support, bug fixes & more Latest

Latest

Pre-built binary wheels require PyTorch 2.4.0

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
fMHA: Added backwards pass for merge_attentions
fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

Assets 2

25 Jul 11:59

v0.0.27.post1

b3831ea

torch.compile support, bug fixes & more

Pre-built binary wheels require PyTorch 2.4.0

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
fMHA: Added backwards pass for merge_attentions
fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

Assets 2

09 Jul 16:35

danthe3rd

v0.0.27

184b280

[v0.0.27] torch.compile support, bug fixes & more

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
fMHA: Added backwards pass for merge_attentions
fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

Assets 2

29 Apr 14:40

danthe3rd

v0.0.26.post1

fad50d4

2:4 sparsity & `torch.compile`-ing memory_efficient_attention

Pre-built binary wheels require PyTorch 2.3.0

Added

[2:4 sparsity] Added support for Straight-Through Estimator for sparsify24 gradient (GRADIENT_STE)
[2:4 sparsity] sparsify24_like now supports the cuSparseLt backend, and the STE gradient
Basic support for torch.compile for the memory_efficient_attention operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.

Improved

merge_attentions no longer needs inputs to be stacked.
fMHA: triton_splitk now supports additive bias
fMHA: benchmark cleanup

Assets 2

29 Mar 14:05

danthe3rd

v0.0.25.post1

7fffd3d

`v0.0.25.post1`: Building binaries for PyTorch 2.2.2

Pre-built binary wheels require PyTorch 2.2.2

Assets 2

31 Jan 08:42

danthe3rd

v0.0.24

f7e46d5

2:4 sparsity, fused sequence parallel, torch compile & more

Pre-built binary wheels require PyTorch 2.2.0

Added

Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free.
Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see xformers.ops.sparsify24.

Improved

Make selective activation checkpointing be compatible with torch.compile.

Removed

Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
Removed support for PyTorch version older than 2.1.0

Assets 2

15 Dec 12:14

danthe3rd

v0.0.23.post1

042abc8

Binary builds for PyTorch 2.1.2

Binary wheels and conda binary builds for PyTorch 2.1.2.
For users who need to use a previous version of PyTorch, they can either:

Install a previous version of xFormers
Build from source

Assets 2

06 Dec 16:05

danthe3rd

v0.0.23

1254a16

Bugfixes/improvements in `memory_efficient_attention`

Pre-built binary wheels require PyTorch 2.1.1

Fixed

fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with length%64 == 1
fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

fMHA: Added LocalAttentionFromBottomRightMask (local)
fMHA: Added LowerTriangularFromBottomRightMask (causal)
fMHA: Added LowerTriangularFromBottomRightLocalAttentionMask (local + causal)

Removed

Removed xformers.triton.sum_strided

Assets 2

25 Oct 12:54

danthe3rd

v0.0.22.post7

e1b36f7

[0.0.22.post7] Wheels for Flash-Attention on windows [cu121]

We also add support for cu118/cu121 - we will update the README once the wheels are ready

Assets 2

13 Oct 16:41

danthe3rd

v0.0.22.post4

16e4245

[0.0.22.post4] Build binaries for pytorch 2.1.0 / cuda12.1

~~Also adds back support for Flash-Attention on windows (only for cuda 12.1 build)~~ - the wheels won't include FA on windows for now, as we have some issues to fix in our CI first (should be done in ~a week hopefully)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added

Improved

Removed

Added

Improved

Removed

Added

Improved

Removed

Added

Improved

Added

Improved

Removed

Fixed

Added

Removed

Releases: facebookresearch/xformers

torch.compile support, bug fixes & more

Added

Improved

Removed

torch.compile support, bug fixes & more

Added

Improved

Removed

[v0.0.27] torch.compile support, bug fixes & more

Added

Improved

Removed

2:4 sparsity & `torch.compile`-ing memory_efficient_attention

Added

Improved

`v0.0.25.post1`: Building binaries for PyTorch 2.2.2

2:4 sparsity, fused sequence parallel, torch compile & more

Added

Improved

Removed

Binary builds for PyTorch 2.1.2

Bugfixes/improvements in `memory_efficient_attention`

Fixed

Added

Removed

[0.0.22.post7] Wheels for Flash-Attention on windows [cu121]

[0.0.22.post4] Build binaries for pytorch 2.1.0 / cuda12.1