-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ hgemm ] Improve transposed B matrix computation and matrix padding @open seasame 07/15 20:33 #2655
Conversation
📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2655. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
- Since current kernel / blocking function supports for fixed shape only, implement padding function for temporary solution. - Note that flexible kernel / blocking implementation should be added for optimal performances - Current implementation separates padding function for matrix A and B but it will eventually be governed with single function **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- add stdlib.h to hgemm_util.h **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- For easier implementation and maintenance of hgemm packing functions, separate them. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- Previously, hgemm transB computation was relying on transposing the entire matrix and using non-transpose sequence. - For optimal performance, matrix packing-blocking-kernel sequence for transB case is explicitly implemented. - Note that current implementation only supports for 8x16 gemm kernel. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- Fix typo and add missing doxygen tags - Add more exact explanation for doxygen tag briefs **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
2f5b97c
to
0f1f607
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
It is not this PR, but we can also consider using multi-threading for the blocking loop. |
Not yet for the upstream merge. It would invoke some unittest fails because currently we are forcibly padding all matrices, while there are some cases of padding situations NYI. But still I think we can check on some model applications. |
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- Missing implementations might trigger unittest fails on Android. - This patch will now support padding function for all combinations of following conditions : matrix A / B, trans/noTrans, M/K/N direction **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- Add TCs checking for padding w.r.t. M, K, N, MK, KN, MKN **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
- According to recent papers, using values with distribution of [0,1), or [-1, 1) is widely used when comparing fp16-fp32 precision comparison. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
ac134c7
to
d2766d6
Compare
d2766d6
to
6823003
Compare
- When comparing outputs computed with different precision, max componentwise relative error is needed. - (trivial) Use more precision comparison for zeroDivisionError classifying code in cosine similarity function **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>
6823003
to
53f7ce2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
Would it be possible to combine the For instance, we could use a generalized version of 'hgemm_padding_A_noTrans_wrt_MK' to handle 'hgemm_padding_B_noTrans_wrt_KN' as well. The only differences between these two functions are in the naming of rows and columns as well as the specific padding sizes used. |
That is correct, and I am aware of that point. The reason why I implemented like this is just to code faster. Technically, all functions related to matrix padding will be deleted in the future. Adding padding is actually a nonsense in the first place with that pov. Currently thinking like:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Interesting information to share: |
Major changes:
Following is the unittest output conducted on Galaxy S23 with mean latency. (TC=100)
Additionally, I tested further with experimental GEMM kernel that is quite faster, but less accurate. (not included in this PR, will be introduced in the near future)
Self evaluation:
Signed-off-by: skykongkong8 ss.kong@samsung.com