Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ matrix_transpose/bugfix ] Prevent reading/saving data from/to unallocated memory #2698

Merged
merged 1 commit into from
Aug 9, 2024

Conversation

skykongkong8
Copy link
Member

  • Previous transpose kernel occasionally load/save unallocated memory, and then masked it.
  • Now, it does not read them at the first place, but load with for-loop
  • This would deteriorate speed of fp16 matrix transpose, but won't be dominant in total model latency
dim before after
87,2049 884 ns 16114 ns
2048, 86 34019 ns 82258 ns

Self evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

@taos-ci
Copy link
Collaborator

taos-ci commented Aug 6, 2024

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2698. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

@skykongkong8 skykongkong8 force-pushed the pr/transpose/biq16 branch 2 times, most recently from 79f34fe to de8b4da Compare August 6, 2024 08:27
…ocated memory

- Previous transpose kernel occasionally load/save unallocated memory, and then masked it.
- Now, it does not read them at the first place, but load with for-loop
- This would deteriorate speed of fp16 matrix transpose, but won't be dominant in total model latency

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
if (N == 4) {
input[i] = vld1_f16(&src[i * ld_src]);
} else {
float16x4_t tmp = ZEROS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK attempting to read the unallocated memory only happens if the i == M - 1.
Please let me know if I'm wrong.

Copy link
Member Author

@skykongkong8 skykongkong8 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct, but M inside of this kernel does not mean global M.
M here ranges to 1~8 (local row size).
And as you can see below, it will affect for the leftover local M only, since otherwise it will fall into fixed size kernels, or leftover kernels with template param M = 4 or M = 8

...
//   if (N % 8 > 0 && N % 8 < 4) {
        transpose_kernel_mxn_neon_128<4>(N - jb, &src[i * ld_src + jb], ld_src,
                                         &dst[i + jb * ld_dst], ld_dst);
...
//   } else {
      if (jb < N) {
        transpose_kernel_mxn_neon_256<8>(N - jb, &src[ib * ld_src + jb], ld_src,
                                         &dst[ib + jb * ld_dst], ld_dst);
      }
...

Copy link
Member Author

@skykongkong8 skykongkong8 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And fyi, I think this won't harm the total GEMM latency, because 82258 ns is 0.082258 ms,
GEMM computation using similar dimension would range from 3 ~ 4 ms, which is quite trivial

Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jijoongmoon jijoongmoon merged commit aa1bddf into nnstreamer:main Aug 9, 2024
41 checks passed
@skykongkong8 skykongkong8 deleted the pr/transpose/biq16 branch August 16, 2024 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants