Always use size 16 sub-groups in single work-group radix sort if supported #1833

mmichel11 · 2024-09-10T14:39:49Z

Previously, an IGC workaround has caused kernels compiled with [[sycl::reqd_sub_group_size(32)]] to be silently compiled with size 16 sub-groups with -O0 on certain devices. To comply with the SYCL spec, this case will now throw a synchronous exception at JIT time.

Single work-group radix sort uses a sub-group sizes of 32 for the smallest inputs and 16 for larger inputs to minimize register pressure. I have experimented with the current mainline version, a version only using sub-group size of 16, and a version removing the requirement altogether. The first two versions performed similarly with no measurable difference, and explicitly requiring a sub-group size of 16 offered up to a ~15% benefit over the version with no requirement at all.

By using only sub-group sizes of 16 in single work-group radix sort, we are able to avoid the IGC issue while not impacting performance.

This change is added to avoid a bug where IGC cannot compile SIMD32 kernels with -O0 compilation flags. No performance impact is observed. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…ely" This reverts commit 7572f84. After experimentation with cold cache benchmarks, benefit of up to ~15% was observed requiring size 16 sub-groups for the larger single work-group cases. IGC compiles the kernels with sub-group size 32 here so leaving the requirement is needed to maximize performance.

danhoeflinger · 2024-09-11T13:27:38Z

There is a comment on line 819 of parallel_backend_sycl_radix_sort.h which needs to be updated with this PR.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger

LGTM, but I'd like @MikeDvorskiy to also be able to look at this.

mmichel11 · 2024-09-11T14:14:05Z

There is a comment on line 819 of parallel_backend_sycl_radix_sort.h which needs to be updated with this PR.

I made a few tweaks to the comment to match the new behavior.

MikeDvorskiy

I don't mind regarding the changes, if it doesn't make the performance worse. (As written in PR description - it doesn't)

mmichel11 added 3 commits September 10, 2024 06:23

Only use sub-group sizes of 16 in one-wg radix sort

c5b799e

This change is added to avoid a bug where IGC cannot compile SIMD32 kernels with -O0 compilation flags. No performance impact is observed. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove sub-group size requirements in one-wg radix sort entirely

7572f84

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 requested review from MikeDvorskiy, danhoeflinger and adamfidel September 10, 2024 14:39

Adjust comment on sub-group sizes in __subgroup_radix_sort

9791eaf

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger approved these changes Sep 11, 2024

View reviewed changes

MikeDvorskiy approved these changes Sep 13, 2024

View reviewed changes

mmichel11 merged commit 87d85d5 into main Sep 13, 2024
22 checks passed

mmichel11 deleted the dev/mmichel11/one_wg_radix_sort_simd16 branch September 13, 2024 15:44

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always use size 16 sub-groups in single work-group radix sort if supported #1833

Always use size 16 sub-groups in single work-group radix sort if supported #1833

mmichel11 commented Sep 10, 2024

danhoeflinger commented Sep 11, 2024

danhoeflinger left a comment

mmichel11 commented Sep 11, 2024

MikeDvorskiy left a comment

Always use size 16 sub-groups in single work-group radix sort if supported #1833

Always use size 16 sub-groups in single work-group radix sort if supported #1833

Conversation

mmichel11 commented Sep 10, 2024

danhoeflinger commented Sep 11, 2024

danhoeflinger left a comment

Choose a reason for hiding this comment

mmichel11 commented Sep 11, 2024

MikeDvorskiy left a comment

Choose a reason for hiding this comment