Releases: oneapi-src/oneDNN
v3.4.1
This is a patch release containing the following changes to v3.4:
- Fixed an issue with caching and serialization of primitives in deterministic mode (7ed604a)
- Introduced memory descriptor serialization API (4cad420, 929a27a, 9b848c8)
- Fixed incorrect results in fp64 convolution and deconvolution on Intel GPUs based on Xe-LPG architecture (ebe77b5, 0b399ac, d748d64, 9f4f3d5, 21a8cae)
- Fixed incorrect results in reorder with large sizes on Intel CPUs and GPUs (69a111e, 4b72361, 74a343b)
- Reduced creation time for deconvolution primitive on Intel CPUs (bec487e, 1eab005)
- Fixed performance regression in deconvolution on Intel CPUs (fbe5b97, 1dd3c6a)
- Removed dangling symblols from static builds (e92c404, 6f5621a)
- Fixed crash during platform detection on some AArch64-based systems (406a079)
- Fixed performance regression in int8 deconvolution on Intel CPUs (7e50e15)
- Fixed handling of zero points for matmul in verbose logs converter (15c7916)
v3.3.6
This is a patch release containing the following changes to v3.3.5:
- Fixed crash during platform detection on some AArch64-based systems (3e0e69b)
- Improved inner product performance with Arm Compute Library (ACL) (e7abee2, 214fb9e, 8aacc8f)
- Fixed incorrect results in int8 depthwise convolution with post-ops on processors with Intel AVX2 instruction set support (0c922e0)
- Fixed performance regression in fp32 convolution on processors with Intel AVX2 instruction set support (4efc0ad)
v3.4
Performance Optimizations
-
Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
- Improved RNN primitive performance with LBR_GRU cell.
- Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
- Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
- Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
- Improved int8 matmul performance with transposed A tensor.
- Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
- Improved performance of int8 convolution with post-ops.
- Optimized batch matmul with binary post-op and broadcast mask
1
and14
. - Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
- Improved performance of subgraphs including
matmul
andadd
operations and mixed int8 and bfloat16 data types with Graph API. - [experimental] Improved performance of
reduction
,softmax
andlayernorm
operations with experimental Graph Compiler backend. - [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
-
Intel Graphics Products:
- Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
- Improved convolution performance for cases relevant to the Stable Diffusion model.
- Improved RNN primitive performance.
- Improved pooling forward propagation performance.
- Improved batched matmul performance for cases with 5 dimensions or more.
-
AArch64-based Processors:
- Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
- Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
- Improved bf16 inner product product primitive performance with ACL.
Functionality
- Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 weight compression.
- Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
- [experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
- Intel Graphics Products
- Introduced support for Intel Data Center GPU Max 1550VG
- Introduced PReLU post-op support for inner product and matmul primitives.
Usability
- Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
- Introduced accumulation mode control.
- Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
- Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
- Reduced RNN primitive memory consumption on GPUs.
- Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
- Extended tensor constructor in Graph API to support memory allocation and management by the library.
- Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
- Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
- Changed default optimization flags for AArch64 builds to
-mcpu=generic
to improve portability.
Validation
- Improved benchdnn performance by optimizing bottlenecks in validation code.
- Introduced
--num-streams
knob in benchdnn to support benchmarking in multi-stream scenarios.
Known Limitations
- Intel Datacenter GPU Flex Series driver for Windows has an issue resulting in program hangs or crashes when oneDNN primitives are created concurrently.
- int8 concat primitive may produce incorrect results on integrated GPUs with current GPU driver.
- fp32 pooling primitive may produce incorrect results in rare conditions on Intel Datacenter GPU Max Series with current GPU driver.
- reorder primitive causes segmentation fault for prime sizes exceeding 2^31 on Intel CPUs.
- fp64 convolution and deconvolution produces incorrect results on integrated graphics in future Intel Core processors (code name Arrow Lake)
- int8 matmul primitive creation with fp32 bias fails on Intel GPU Flex Series and Intel Arc Graphics.
Breaking Changes
- Updated minimal supported ACL version to 23.11 (was 23.02.1).
Thanks to these Contributors
This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.
v3.3.5
This is a patch release containing the following changes to v3.3.4:
v3.4-rc
Performance Optimizations
-
Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
- Improved RNN primitive performance with LBR_GRU cell.
- Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
- Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
- Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
- Improved int8 matmul performance with transposed A tensor.
- Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
- Improved performance of int8 convolution with post-ops.
- Optimized batch matmul with binary post-op and broadcast mask
1
and14
. - Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
- Improved performance of subgraphs including
matmul
andadd
operations and mixed int8 and bfloat16 data types with Graph API. - [experimental] Improved performance of
reduction
,softmax
andlayernorm
operations with experimental Graph Compiler backend. - [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
-
Intel Graphics Products:
- Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
- Improved convolution performance for cases relevant to the Stable Diffusion model.
- Improved RNN primitive performance.
- Improved pooling forward propagation performance.
- Improved batched matmul performance for cases with 5 dimensions or more.
-
AArch64-based Processors:
- Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
- Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
- Improved bf16 inner product product primitive performance with ACL.
Functionality
- Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
- Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
- [experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
- Intel Graphics Products
- Introduced PReLU post-op support for inner product and matmul primitives.
Usability
- Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
- Introduced accumulation mode control.
- Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
- Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
- Reduced RNN primitive memory consumption on GPUs.
- Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
- Extended tensor constructor in Graph API to support memory allocation and management by the library.
- Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
- Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
- Changed default optimization flags for AArch64 builds to
-mcpu=generic
to improve portability.
Validation
- Improved benchdnn performance by optimizing bottlenecks in validation code.
- Introduced
--num-streams
knob in benchdnn to support benchmarking in multi-stream scenarios.
Breaking Changes
- Updated minimal supported ACL version to 23.11 (was 23.02.1).
Thanks to these Contributors
This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.
v3.3.4
This is a patch release containing the following changes to v3.3.3:
- Fixed performance regression in convolution, matmul and inner product primitives with post-ops on Intel CPUs (2e3c94c)
- Fixed performance regression in bfloat16 matmul on processors with Intel AMX instruction set support (c0ae38c, fa43640)
- Fixed
SEGFAULT
in 3D convolutions with differenth
andw
parameters on Intel CPUs (b5f916e) - Fixed performance regression in fp32 convolution backpropagation on Intel CPUs (ee3b12d)
- Reduced benchdnn memory consumption on Intel GPUs (84a8f57)
v3.3.3
This is a patch release containing the following changes to v3.3.2:
- Fixed performance regression in int8 convolutions on processors with Intel AVX-512 and Intel DL Boost support (a00661f)
- Fixed race condition during library initialization on Intel Data Center GPU Max Series (7dfcd11)
- Fixed accuracy issue in experimental Graph Compiler with LLVM code generator (8892e7e)
- Disabled int8 RNN implementation for cases with non-trivial strides (2195e4b)
- Fixed incorrect results in bfloat16 convolution implementation on processors with Intel AMX support (9f00af9)
- Fixed incorrect results in fp16 and int8 convolution on Intel Core Ultra integrated GPUs (69cef84, 79bc6cc, c9c0b09)
v3.3.2
This is a patch release containing the following changes to v3.3.1:
- Fixed incorrect results in bfloat16 reorder on Intel Core Ultra integrates GPUs (9025980, ed9de2a, 0c6bda1)
- Fixed incorrect results in matmul, inner product, and RNN primitives on Intel Core Ultra integrated GPUs (6edab9f)
- Updated compiler optimization flags for AArch64 processors to make build portable (8829c24)
- Fixed segmentation fault during library initialization on AArch64 processors (3e15c61)
v3.3.1
This is a patch release containing the following changes to v3.3:
- Fixed int8 convolution accuracy issue on Intel GPUs (09c87c7)
- Switched internal stream to in-order mode for NVIDIA and AMD GPUs to avoid synchronization issues (db01d62)
- Fixed runtime error for
avgpool_bwd
operation in Graph API (d025ef6, 9e0602a, e0dc1b3) - Fixed benchdnn error reporting for some Graph API cases (98dc9db)
- Fixed accuracy issue in experimental Graph Compiler for int8 MHA variant from StarCoder model (5476ef7)
- Fixed incorrect results for layer normalization with trivial dimensions on Intel GPUs (a2ec0a0)
- Removed redundant synchronization for out-of-order SYCL queues (a96e9b1)
- Fixed runtime error in experimental Graph Compiler for int8 MLP subgraph from LLAMA model (595543d)
- Fixed
SEGFAULT
in experimental Graph Compiler for fp32 MLP subgraph (4207105) - Fixed incorrect results in experimental Graph Compiler for MLP subgraph (57e14b5)
- Fixed the issue with f16 inner product primitive with s8 output returning
unimplemented
on Intel GPUs (bf12207, 800b5e9, ec7054a) - Fixed incorrect results for int8 deconvolution with zero-points on processors with Intel AMX instructions support (55d2cec)
v3.3
Performance Optimizations
- Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
- Improved s32 binary primitive performance.
- Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
- Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
- Improved performance of convolution for depthwise cases with Graph API.
- [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
- Intel Graphics Products:
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced RNN primitive initialization time on Intel GPUs.
- AArch64-based Processors:
- Improved fp32 to bf16 reorder performance.
- Improved max pooling performance with Arm Compute Library (ACL).
- Improved dilated convolution performance for depthwise cases with ACL.
Functionality
- Introduced group normalization primitive support. The functionality is currently available on CPUs.
- Intel CPUs:
- Introduced support for zero points in int8 convolution with groups and 3D spatial.
Usability
- Extended verbose mode output:
- Improved diagnostics on engine creation errors.
- Added information on Graph API calls.
- Added information on strides for non-dense memory objects.
- Added values of runtime dimension.
- Added indication that primitive descriptor was created with
any
memory format tag.
- Introduced examples for Graph API.
- Graph API constant tensor cache is now disabled by default and requires opt-in with
dnnl::graph::set_constant_tensor_cache()
call. - Reduced oneDNN Graph API memory consumption in certain scenarios.
Validation
- Extended benchdnn performance reporting with primitive creation time.
- Introduced cold cache mode in benchdnn.
Known Limitations
- Current GPU OpenCL runtime for Linux has an issue resulting in convolution producing incorrect results on integrated GPUs based on Xe architecture. SYCL configuration is not affected.
- Pooling, resampling, prelu, batch normalization, layer normalization, and eltwise primitives may sporadically produce incorrect results on Intel Arc GPUs on Windows.
- Current GPU driver for Linux has an issue resulting in program hangs or crashes when oneDNN primitives are executed concurrently on Intel Datacenter GPU Max Series.
- Extensive use of RNN primitive on Intel GPUs with default primitive cache setting may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
- Int8 deconvolution with signed weights and activations may produce incorrect results of processors with Intel AMX support.
- Int8 softmax may fail crash on Windows in SYCL debug configuration.
Thanks to these Contributors
This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, @baibeta, Benjamin Taylor @bentaylorhk-arm, Ilya Lavrenov @ilya-lavrenov, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, Renato Barros Arantes @renato-arantes, @snadampal, @sparkyrider, and Thomas KΓΆppe @tkoeppe. We would also like to thank everyone who asked questions and reported issues.