Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

umar456
Copy link
Contributor

@umar456 umar456 commented Jul 10, 2024

Description

This pull request adds an alternative kernel that performs better under certain conditions compared to the sub-group reduction kernel previously implemented. The new kernel uses the work_group_reduce_add function to perform the mean an variance reductions instead of using the sub_group based reductions. One benefit of this kernel is that it performs better for sizes that do not fully utilize the device when sub-group based reductions are used.

Optimizations

work-group based reductions vs sub-group based reductions

There are two kernels implemented for the reusable layer normalization layer. These kernels differ in the way the summation operation is performed in the variance and mean calculation. The work-group kernel will launch a work-item for each element in the lnorm axis and the sub-group kernel will launch one SIMD worth of work-items in the lnorm axis. The work group kernel will use work_group_reduction_add function and sub-group version will use the sub_group_reduction_add function to perform the summation. Here is a heatmap of the two kernel and how they perform over the different shapes of the input tensor.

image

Use of fixed sized loops vs variable sized loops

Example: https://github.com/oneapi-src/oneDNN/compare/main...umar456:oneDNN:uarshad/reusable_vectorized_lnorm?expand=1#diff-399297f4e437e8a12e0e654089b8af8c938a7a8430c6efab4ea61a029a683f8cR53

There is a significant penalty when a runtime variable is used to exit the loop condition. Here are heatmaps between a runtime condition vs a compile time condition for the for loop:

image

Use macro to avoid loops in work-group kernel

The ifdef here: https://github.com/oneapi-src/oneDNN/compare/main...umar456:oneDNN:uarshad/reusable_vectorized_lnorm?expand=1#diff-399297f4e437e8a12e0e654089b8af8c938a7a8430c6efab4ea61a029a683f8cR51

is used to avoid adding loop in the work-group kernel. I had originally assumed that if the compiler knew that we were only iterating the loop one time it would be able to remove the overhead of adding the loop but it seems its not the case. Here is a heatmap of using the macro to remove the loop in the work-group kernel.

image

Use large GRF for certain shapes in the sub-group based kernel

Large GRF flag can significantly improve the speed of the kernel under certain situation. You see the greatest speedup in situations where the tensor is small enough to fit in the device cache and there is the lnorm axis is larger than 768. I suspect this is because it allows the device to queue more load transactions than without the flag. Additionally there is a significant slowdown where the lnorm axis is small and the number of subgroups launched is greater than one wave. I suspect his is because fewer sub-groups are active when the large GRF flag is used. Here is the heatmap of the sub-group kernel with and without the GRF flag.

image

Overall Speedup

512 EU PVC

Heatmap vs. Original Vectorized:
reusable_pvc512

image

@umar456 umar456 added performance platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel labels Jul 10, 2024
@umar456
Copy link
Contributor Author

umar456 commented Jul 10, 2024

make test
enable device_gpu
disable device_cpu
disable benchdnn_all
enable benchdnn_lnorm
enable arch_xe-hpc
enable arch_xe-hpg-atsm
enable arch-xe-hpg-dg2
enable arch_xe-lpg
enable arch_xe2-lpg
enable arch_xe2-hpg-bmg

uint32_t sg_size;

/// The number of work-items in the work group
uint32_t wg_size = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you don't actually need this variable. It's used in two places:

  1. Set as a build option to get passed to reqd_work_group_size: You can probably just remove this and performance changes will be minimal.
  2. Computing the nd_range_t: Reconstruct it directly in the execute function (based on select_work_group_kernel, vector_size, and pd())

If you can remove this, the kernel will be far more reusable.

Comment on lines +88 to +89
/// Use the cl-intel-256-GRF-per-thread flag
bool large_grf = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about using large GRF mode as a heuristic. On most intel GPUs, switching the GRF mode requires stalling the pipeline which can lead to performance losses. You can (probably) see this by running a benchdnn batch on layers that get small/large/small GRF modes, and you should see performance much lower than when they're run separately.

Usually, the GRF mode is passed in by the user as a GPU attr, and the kernels are just tasked with sticking to it.

Comment on lines +51 to +61
#if PVT_MEM_SIZE > 1
VECT_FLOAT_T val[PVT_MEM_SIZE];
unroll_for_by(N_UNROLL)(int sg_idx = 0, i = 0; i < PVT_MEM_SIZE;
sg_idx += GROUP_STRIDE, i++) {
val[i] = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(VECT_BLOCK_READ(
(const __global BLOCK_DATA_T *)(&src[sg_idx]))));
}
#else
VECT_FLOAT_T val = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(
VECT_BLOCK_READ((const __global BLOCK_DATA_T *)(src))));
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the compiler should be able to optimize this incantation. Give it a shot and let me know.

Suggested change
#if PVT_MEM_SIZE > 1
VECT_FLOAT_T val[PVT_MEM_SIZE];
unroll_for_by(N_UNROLL)(int sg_idx = 0, i = 0; i < PVT_MEM_SIZE;
sg_idx += GROUP_STRIDE, i++) {
val[i] = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(VECT_BLOCK_READ(
(const __global BLOCK_DATA_T *)(&src[sg_idx]))));
}
#else
VECT_FLOAT_T val = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(
VECT_BLOCK_READ((const __global BLOCK_DATA_T *)(src))));
#endif
VECT_FLOAT_T val[PVT_MEM_SIZE];
int sg_idx = 0;
for (int i = 0; i < PVT_MEM_SIZE; i++) {
val[i] = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(VECT_BLOCK_READ(
(const __global BLOCK_DATA_T *)(&src[sg_idx]))));
sg_idx += GROUP_STRIDE;
}

@vpirogov vpirogov added this to the v3.6 milestone Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants