[ hgemm ] Improve transposed B matrix computation and matrix padding @open seasame 07/15 20:33 #2655

skykongkong8 · 2024-07-01T04:53:53Z

Major changes:

Implement transposed B matrix data packing-blocking-kernel sequences. Currently only supports for 8x16 kernel.
Implement matrix padding. When input data shape is not compatible with fixed-sized blocking and kernel, adding padding can facilitate them. Note that fine-grained blocking/kernel should be implemented for optimal performance(in terms of speed and memory), but padding can be one of the sub-optimal options for easier implementation.
Split hgemm interface, padding, kernel for easier maintenance.

Following is the unittest output conducted on Galaxy S23 with mean latency. (TC=100)
Additionally, I tested further with experimental GEMM kernel that is quite faster, but less accurate. (not included in this PR, will be introduced in the near future)

TCs	fp32	fp16 before	fp16 after	fp16 experimental wip
conv21	802 ms	1484 ms	692 ms	(575 ms)
conv22	810 ms	1561 ms	745 ms	(572 ms)
conv23	900 ms	1652 ms	756 ms	(570 ms)
conv24_noTrans	805 ms	599 ms	592 ms	(431 ms)
conv24_transB	854 ms	999 ms	733 ms	(584 ms)
conv768	22575 ms	16448 ms	15937 ms	(11163 ms)
dot_gemm_fc	76083 ms	131610 ms	57408 ms	(39968 ms)

Note that the ratio here might differ when M-K-N configuration changes!

TCs	time consumed during padding	ratio (padding time / total time * 100)
conv21	0.015 ms	0.0021%
conv22	0.018 ms	0.0024%
conv23	0.015 ms	0.0019%

Self evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 ss.kong@samsung.com

taos-ci · 2024-07-01T04:53:56Z

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2655. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

- Since current kernel / blocking function supports for fixed shape only, implement padding function for temporary solution. - Note that flexible kernel / blocking implementation should be added for optimal performances - Current implementation separates padding function for matrix A and B but it will eventually be governed with single function **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- add stdlib.h to hgemm_util.h **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- For easier implementation and maintenance of hgemm packing functions, separate them. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- Previously, hgemm transB computation was relying on transposing the entire matrix and using non-transpose sequence. - For optimal performance, matrix packing-blocking-kernel sequence for transB case is explicitly implemented. - Note that current implementation only supports for 8x16 gemm kernel. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- Fix typo and add missing doxygen tags - Add more exact explanation for doxygen tag briefs **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

nntrainer/tensor/blas_interface.cpp

jijoongmoon · 2024-07-11T23:17:00Z

It is not this PR, but we can also consider using multi-threading for the blocking loop.
If it is ready, please remove do not merge tag.

skykongkong8 · 2024-07-11T23:59:10Z

It is not this PR, but we can also consider using multi-threading for the blocking loop. If it is ready, please remove do not merge tag.

Not yet for the upstream merge. It would invoke some unittest fails because currently we are forcibly padding all matrices, while there are some cases of padding situations NYI. But still I think we can check on some model applications.

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

nntrainer/tensor/hgemm/hgemm_padding/hgemm_padding_a.h

- Missing implementations might trigger unittest fails on Android. - This patch will now support padding function for all combinations of following conditions : matrix A / B, trans/noTrans, M/K/N direction **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- Add TCs checking for padding w.r.t. M, K, N, MK, KN, MKN **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- According to recent papers, using values with distribution of [0,1), or [-1, 1) is widely used when comparing fp16-fp32 precision comparison. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- When comparing outputs computed with different precision, max componentwise relative error is needed. - (trivial) Use more precision comparison for zeroDivisionError classifying code in cosine similarity function **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

EunjuYang · 2024-07-16T01:03:31Z

Would it be possible to combine the hgemm_padding functions of matrices A and B by using row-wise / col-wise multiple padding arguments? (instead of using specific parameter for A and B)
I believe this would allow us to reuse the code for various padding scenarios.

For instance, we could use a generalized version of 'hgemm_padding_A_noTrans_wrt_MK' to handle 'hgemm_padding_B_noTrans_wrt_KN' as well. The only differences between these two functions are in the naming of rows and columns as well as the specific padding sizes used.
By combining them, we can achieve flexibility when dealing with different padding sizes.
Currently, the implementation only supports row 8 / col 8 padding for Matrix A and row 8 / col 16 padding for Matrix B.

skykongkong8 · 2024-07-16T01:57:15Z

Would it be possible to combine the hgemm_padding functions of matrices A and B by using row-wise / col-wise multiple padding arguments? (instead of using specific parameter for A and B) I believe this would allow us to reuse the code for various padding scenarios.

For instance, we could use a generalized version of 'hgemm_padding_A_noTrans_wrt_MK' to handle 'hgemm_padding_B_noTrans_wrt_KN' as well. The only differences between these two functions are in the naming of rows and columns as well as the specific padding sizes used. By combining them, we can achieve flexibility when dealing with different padding sizes. Currently, the implementation only supports row 8 / col 8 padding for Matrix A and row 8 / col 16 padding for Matrix B.

That is correct, and I am aware of that point. The reason why I implemented like this is just to code faster. Technically, all functions related to matrix padding will be deleted in the future. Adding padding is actually a nonsense in the first place with that pov. Currently thinking like:

implement explicit padding functions -> 2. fuse them into single general function (functions from 1 will help me to debug while implementing) -> 3. eventually, delete them

jijoongmoon

LGTM

skykongkong8 · 2024-09-25T01:49:05Z

Interesting information to share:
I found that even ARM Compute Library uses zero-padding in matrix packing!
https://developer.arm.com/documentation/109246/0100/matmul-fp32--Single-precision-matrix-by-matrix-multiplication/preprocess-l-function-overview?lang=en

https://developer.arm.com/documentation/109246/0100/matmul-int8--8-bit-integer-to-32-bit-integer-matrix-by-matrix-multiplication/preprocess-r-function-overview

skykongkong8 requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, helloahn, kparichay, gichan-jang, anyj0527, zhoonit, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, djeong20 and EunjuYang as code owners July 1, 2024 04:53

skykongkong8 requested a review from a team as a code owner July 1, 2024 04:53

skykongkong8 changed the title ~~[ hgemm ] Use zero-padding in non-8-divisible M~~ [ WIP ] [ hgemm ] Use zero-padding in non-8-divisible M Jul 1, 2024

github-actions bot added the Need Review label Jul 1, 2024

skykongkong8 added the DO NOT MERGE label Jul 1, 2024

skykongkong8 changed the title ~~[ WIP ] [ hgemm ] Use zero-padding in non-8-divisible M~~ [ WIP ] [ hgemm ] Use zero-padding in non-8-divisible M @open sesame 07/01 14:08 Jul 1, 2024

taos-ci approved these changes Jul 1, 2024

View reviewed changes

skykongkong8 added 4 commits July 10, 2024 10:46

[ Trivial/bugfix ] Add missing library to include

3d4357e

- add stdlib.h to hgemm_util.h **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

skykongkong8 changed the title ~~[ hgemm ] Improve transposed B matrix computation matrix padding~~ [ hgemm ] Improve transposed B matrix computation and matrix padding Jul 11, 2024

skykongkong8 force-pushed the pr/hgemm/Mpadding branch from 2f5b97c to 0f1f607 Compare July 11, 2024 02:30

taos-ci approved these changes Jul 11, 2024

View reviewed changes

jijoongmoon reviewed Jul 11, 2024

View reviewed changes

nntrainer/tensor/blas_interface.cpp Show resolved Hide resolved

[ hgemm ] Implement matrix noTrans A w.r.t. MK padding

233eee0

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

EunjuYang reviewed Jul 15, 2024

View reviewed changes

nntrainer/tensor/hgemm/hgemm_padding/hgemm_padding_a.h Outdated Show resolved Hide resolved

skykongkong8 added 3 commits July 15, 2024 18:41

[ unittest ] Add TCs for checking padding-using GEMM

051d36a

- Add TCs checking for padding w.r.t. M, K, N, MK, KN, MKN **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

skykongkong8 force-pushed the pr/hgemm/Mpadding branch 2 times, most recently from ac134c7 to d2766d6 Compare July 15, 2024 11:00

skykongkong8 removed the DO NOT MERGE label Jul 15, 2024

skykongkong8 force-pushed the pr/hgemm/Mpadding branch from d2766d6 to 6823003 Compare July 15, 2024 11:09

skykongkong8 force-pushed the pr/hgemm/Mpadding branch from 6823003 to 53f7ce2 Compare July 15, 2024 11:11

skykongkong8 changed the title ~~[ hgemm ] Improve transposed B matrix computation and matrix padding~~ [ hgemm ] Improve transposed B matrix computation and matrix padding @open seasame 07/15 20:33 Jul 15, 2024

taos-ci approved these changes Jul 15, 2024

View reviewed changes

EunjuYang approved these changes Jul 16, 2024

View reviewed changes

jijoongmoon approved these changes Jul 30, 2024

View reviewed changes

github-actions bot added PR/READY2MERGE and removed Need Review labels Jul 30, 2024

jijoongmoon merged commit 7132010 into nnstreamer:main Jul 30, 2024
45 of 46 checks passed

skykongkong8 mentioned this pull request Aug 2, 2024

[ hgemm ] Add experimental kernel #2693

Merged

skykongkong8 deleted the pr/hgemm/Mpadding branch September 23, 2024 01:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ hgemm ] Improve transposed B matrix computation and matrix padding @open seasame 07/15 20:33 #2655

[ hgemm ] Improve transposed B matrix computation and matrix padding @open seasame 07/15 20:33 #2655

skykongkong8 commented Jul 1, 2024 •

edited

Loading

taos-ci commented Jul 1, 2024

taos-ci left a comment

taos-ci left a comment

jijoongmoon commented Jul 11, 2024

skykongkong8 commented Jul 11, 2024

taos-ci left a comment

EunjuYang commented Jul 16, 2024

skykongkong8 commented Jul 16, 2024 •

edited

Loading

jijoongmoon left a comment

skykongkong8 commented Sep 25, 2024 •

edited

Loading

[ hgemm ] Improve transposed B matrix computation and matrix padding @open seasame 07/15 20:33 #2655

[ hgemm ] Improve transposed B matrix computation and matrix padding @open seasame 07/15 20:33 #2655

Conversation

skykongkong8 commented Jul 1, 2024 • edited Loading

taos-ci commented Jul 1, 2024

taos-ci left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

jijoongmoon commented Jul 11, 2024

skykongkong8 commented Jul 11, 2024

taos-ci left a comment

Choose a reason for hiding this comment

EunjuYang commented Jul 16, 2024

skykongkong8 commented Jul 16, 2024 • edited Loading

jijoongmoon left a comment

Choose a reason for hiding this comment

skykongkong8 commented Sep 25, 2024 • edited Loading

skykongkong8 commented Jul 1, 2024 •

edited

Loading

skykongkong8 commented Jul 16, 2024 •

edited

Loading

skykongkong8 commented Sep 25, 2024 •

edited

Loading