-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ matrix_transpose/bugfix ] Prevent reading/saving data from/to unallocated memory #2698
Conversation
📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2698. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/. |
79f34fe
to
de8b4da
Compare
…ocated memory - Previous transpose kernel occasionally load/save unallocated memory, and then masked it. - Now, it does not read them at the first place, but load with for-loop - This would deteriorate speed of fp16 matrix transpose, but won't be dominant in total model latency **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
de8b4da
to
f7b7464
Compare
if (N == 4) { | ||
input[i] = vld1_f16(&src[i * ld_src]); | ||
} else { | ||
float16x4_t tmp = ZEROS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK attempting to read the unallocated memory only happens if the i == M - 1.
Please let me know if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding is correct, but M
inside of this kernel does not mean global M.
M here ranges to 1~8 (local row size).
And as you can see below, it will affect for the leftover local M only, since otherwise it will fall into fixed size kernels, or leftover kernels with template param M = 4
or M = 8
...
// if (N % 8 > 0 && N % 8 < 4) {
transpose_kernel_mxn_neon_128<4>(N - jb, &src[i * ld_src + jb], ld_src,
&dst[i + jb * ld_dst], ld_dst);
...
// } else {
if (jb < N) {
transpose_kernel_mxn_neon_256<8>(N - jb, &src[ib * ld_src + jb], ld_src,
&dst[ib + jb * ld_dst], ld_dst);
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And fyi, I think this won't harm the total GEMM latency, because 82258 ns is 0.082258 ms,
GEMM computation using similar dimension would range from 3 ~ 4 ms, which is quite trivial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Self evaluation: