Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu: aarch64: remove unnecessary workaround for f16 eltwise_tanh #1984

Merged
merged 1 commit into from
Aug 12, 2024

Conversation

ghost
Copy link

@ghost ghost commented Jul 5, 2024

This patch removes a previously implemented workaround that falls back to the reference implementation of eltwise when the algorithm is tanh in f16. The original bug (ARM-software/ComputeLibrary#998) that the workaround addressed has been fixed.

Description

This patch removes a previously implemented workaround that falls back to the reference implementation of eltwise when the algorithm is tanh in f16. The original bug (ARM-software/ComputeLibrary#998) that the workaround addressed has been fixed.

Checklist

General

  • [Y] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • [ Y] Have you formatted the code using clang-format?

Performance improvements

  • [Y ] Have you submitted performance data that demonstrates performance improvements?
    The direct performance benefits are marginal (refer to attached verbose logs) but since the ACL implementation is used it allows oneDNN to benefit from future targeted optimisations.

Before:

onednn_verbose,info,oneDNN v3.6.0 (commit e88463601057ecdee0f67906e75a44759bce79da)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:32
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,primitive,create:dispatch,eltwise,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,unsupported datatype,src/cpu/ref_eltwise.hpp:51
onednn_verbose,primitive,create:dispatch,eltwise,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,unsupported datatype,src/cpu/ref_eltwise.hpp:51
onednn_verbose,primitive,create:cache_miss,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,0.0170898
onednn_verbose,primitive,create:cache_hit,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,0.00195312
onednn_verbose,primitive,create:cache_miss,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f16::blocked:abcd::f0,,,500x192x55x55,0.0158691
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f16::blocked:abcd::f0,,,500x192x55x55,433.831
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,218.724
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.358
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.757
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.313
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.449
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.365
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.956
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.279
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,218.101
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.43
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.997
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.275
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.505
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.997
onednn_verbose,primitive,exec,cpu,eltwise,ref:any,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,217.333
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,ref:any,,--mode=P --eltwise --dt=f16 --tag=nchw --alg=tanh --alpha=0 --beta=0 500x192x55x55,0,0.682373,217.337,0,217.655,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):217.337 avg(ms):217.655
total: 5.08s; fill: 1.40s (28%);

After:

onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,primitive,create:cache_miss,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,0.27417
onednn_verbose,primitive,create:cache_hit,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,0.0490723
onednn_verbose,primitive,create:cache_miss,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f16::blocked:abcd::f0,,,500x192x55x55,0.0151367
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f16::blocked:abcd::f0,,,500x192x55x55,432.868
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.335
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.684
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.805
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.9
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.544
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.626
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.575
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.654
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.698
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.756
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.498
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.57
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.535
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.59
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.723
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.81
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.562
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.617
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.68
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.734
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.747
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.826
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.613
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.703
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.594
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.683
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.527
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.61
onednn_verbose,primitive,exec:external,CpuActivationKernel/neon_fp16_activation,215.679
onednn_verbose,primitive,exec,cpu,eltwise,acl,forward_training,data_f16::blocked:abcd::f0 diff_undef::undef:::,,alg:eltwise_tanh alpha:0 beta:0,500x192x55x55,215.757
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,acl,,--mode=P --eltwise --dt=f16 --tag=nchw --alg=tanh --alpha=0 --beta=0 500x192x55x55,0,1.21045,215.58,0,215.713,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):215.58 avg(ms):215.713
total: 5.04s; fill: 1.39s (28%);

This patch removes a previously implemented workaround that falls back
to the reference implementation of eltwise when the algorithm is tanh in
f16. The original bug (ARM-software/ComputeLibrary#998) that the
workaround addressed has been fixed.
@jondea jondea added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Jul 5, 2024
Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks test_benchdnn_modeC_eltwise_ci_cpu, please do not merge until fixed.

@vpirogov vpirogov added this to the v3.6 milestone Jul 18, 2024
@ghost
Copy link
Author

ghost commented Aug 6, 2024

This patch requires ACL >= 24.06 which is provided by this #2022. If that is merged we should be good to go.

@vpirogov vpirogov merged commit 5fa3e7d into oneapi-src:main Aug 12, 2024
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants