Always use size 16 sub-groups in single work-group radix sort if supported #1833
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, an IGC workaround has caused kernels compiled with
[[sycl::reqd_sub_group_size(32)]]
to be silently compiled with size 16 sub-groups with-O0
on certain devices. To comply with the SYCL spec, this case will now throw a synchronous exception at JIT time.Single work-group radix sort uses a sub-group sizes of 32 for the smallest inputs and 16 for larger inputs to minimize register pressure. I have experimented with the current mainline version, a version only using sub-group size of 16, and a version removing the requirement altogether. The first two versions performed similarly with no measurable difference, and explicitly requiring a sub-group size of 16 offered up to a ~15% benefit over the version with no requirement at all.
By using only sub-group sizes of 16 in single work-group radix sort, we are able to avoid the IGC issue while not impacting performance.