Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ hgemm ] Improve transposed B matrix computation and matrix padding @open seasame 07/15 20:33 #2655

Merged
merged 16 commits into from
Jul 30, 2024

Conversation

skykongkong8
Copy link
Member

@skykongkong8 skykongkong8 commented Jul 1, 2024

Major changes:

  1. Implement transposed B matrix data packing-blocking-kernel sequences. Currently only supports for 8x16 kernel.
  2. Implement matrix padding. When input data shape is not compatible with fixed-sized blocking and kernel, adding padding can facilitate them. Note that fine-grained blocking/kernel should be implemented for optimal performance(in terms of speed and memory), but padding can be one of the sub-optimal options for easier implementation.
  3. Split hgemm interface, padding, kernel for easier maintenance.

Following is the unittest output conducted on Galaxy S23 with mean latency. (TC=100)
Additionally, I tested further with experimental GEMM kernel that is quite faster, but less accurate. (not included in this PR, will be introduced in the near future)

TCs fp32 fp16 before fp16 after fp16 experimental wip
conv21 802 ms 1484 ms 692 ms (575 ms)
conv22 810 ms 1561 ms 745 ms (572 ms)
conv23 900 ms 1652 ms 756 ms (570 ms)
conv24_noTrans 805 ms 599 ms 592 ms (431 ms)
conv24_transB 854 ms 999 ms 733 ms (584 ms)
conv768 22575 ms 16448 ms 15937 ms (11163 ms)
dot_gemm_fc 76083 ms 131610 ms 57408 ms (39968 ms)
  • Note that the ratio here might differ when M-K-N configuration changes!
TCs time consumed during padding ratio (padding time / total time * 100)
conv21 0.015 ms 0.0021%
conv22 0.018 ms 0.0024%
conv23 0.015 ms 0.0019%

Self evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 ss.kong@samsung.com

@taos-ci
Copy link
Collaborator

taos-ci commented Jul 1, 2024

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2655. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

@skykongkong8 skykongkong8 requested a review from a team as a code owner July 1, 2024 04:53
@skykongkong8 skykongkong8 changed the title [ hgemm ] Use zero-padding in non-8-divisible M [ WIP ] [ hgemm ] Use zero-padding in non-8-divisible M Jul 1, 2024
@skykongkong8 skykongkong8 changed the title [ WIP ] [ hgemm ] Use zero-padding in non-8-divisible M [ WIP ] [ hgemm ] Use zero-padding in non-8-divisible M @open sesame 07/01 14:08 Jul 1, 2024
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

- Since current kernel / blocking function supports for fixed shape only, implement padding function for temporary solution.
- Note that flexible kernel / blocking implementation should be added for optimal performances
- Current implementation separates padding function for matrix A and B but it will eventually be governed with single function

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- add stdlib.h to hgemm_util.h

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- For easier implementation and maintenance of hgemm packing functions, separate them.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- Previously, hgemm transB computation was relying on transposing the entire matrix and using non-transpose sequence.
- For optimal performance, matrix packing-blocking-kernel sequence for transB case is explicitly implemented.
- Note that current implementation only supports for 8x16 gemm kernel.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
@skykongkong8 skykongkong8 changed the title [ hgemm ] Improve transposed B matrix computation matrix padding [ hgemm ] Improve transposed B matrix computation and matrix padding Jul 11, 2024
- Fix typo and add missing doxygen tags
- Add more exact explanation for doxygen tag briefs

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

@jijoongmoon
Copy link
Collaborator

It is not this PR, but we can also consider using multi-threading for the blocking loop.
If it is ready, please remove do not merge tag.

@skykongkong8
Copy link
Member Author

It is not this PR, but we can also consider using multi-threading for the blocking loop. If it is ready, please remove do not merge tag.

Not yet for the upstream merge. It would invoke some unittest fails because currently we are forcibly padding all matrices, while there are some cases of padding situations NYI. But still I think we can check on some model applications.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- Missing implementations might trigger unittest fails on Android.
- This patch will now support padding function for all combinations of following conditions : matrix A / B, trans/noTrans, M/K/N direction

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- Add TCs checking for padding w.r.t. M, K, N, MK, KN, MKN

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- According to recent papers, using values with distribution of [0,1), or [-1, 1) is widely used when comparing fp16-fp32 precision comparison.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- When comparing outputs computed with different precision, max componentwise relative error is needed.
- (trivial) Use more precision comparison for zeroDivisionError classifying code in cosine similarity function

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
@skykongkong8 skykongkong8 changed the title [ hgemm ] Improve transposed B matrix computation and matrix padding [ hgemm ] Improve transposed B matrix computation and matrix padding @open seasame 07/15 20:33 Jul 15, 2024
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

@EunjuYang
Copy link
Contributor

Would it be possible to combine the hgemm_padding functions of matrices A and B by using row-wise / col-wise multiple padding arguments? (instead of using specific parameter for A and B)
I believe this would allow us to reuse the code for various padding scenarios.

For instance, we could use a generalized version of 'hgemm_padding_A_noTrans_wrt_MK' to handle 'hgemm_padding_B_noTrans_wrt_KN' as well. The only differences between these two functions are in the naming of rows and columns as well as the specific padding sizes used.
By combining them, we can achieve flexibility when dealing with different padding sizes.
Currently, the implementation only supports row 8 / col 8 padding for Matrix A and row 8 / col 16 padding for Matrix B.

@skykongkong8
Copy link
Member Author

skykongkong8 commented Jul 16, 2024

Would it be possible to combine the hgemm_padding functions of matrices A and B by using row-wise / col-wise multiple padding arguments? (instead of using specific parameter for A and B) I believe this would allow us to reuse the code for various padding scenarios.

For instance, we could use a generalized version of 'hgemm_padding_A_noTrans_wrt_MK' to handle 'hgemm_padding_B_noTrans_wrt_KN' as well. The only differences between these two functions are in the naming of rows and columns as well as the specific padding sizes used. By combining them, we can achieve flexibility when dealing with different padding sizes. Currently, the implementation only supports row 8 / col 8 padding for Matrix A and row 8 / col 16 padding for Matrix B.

That is correct, and I am aware of that point. The reason why I implemented like this is just to code faster. Technically, all functions related to matrix padding will be deleted in the future. Adding padding is actually a nonsense in the first place with that pov. Currently thinking like:

  1. implement explicit padding functions -> 2. fuse them into single general function (functions from 1 will help me to debug while implementing) -> 3. eventually, delete them

Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jijoongmoon jijoongmoon merged commit 7132010 into nnstreamer:main Jul 30, 2024
45 of 46 checks passed
@skykongkong8 skykongkong8 deleted the pr/hgemm/Mpadding branch September 23, 2024 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants