-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix float16 and float64 kernels #2412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this can solve the issue?
The error: |
I guess this will generate wrong results when the original threads per block is larger than 256? |
No, it's a resource exceed bug. |
Hmm, if I understand correctly, the implementation is one thread do one plane (h, w) EDT. So if block_count == 1024 and thread_per_block == 512 (which means batch \times channel == 2^19), this change will launch 2^18 threads instead, and some image planes are not going to be updated, right? |
The results are same as before, but inference may a little slower. |
Can you add a test of batch x channels == 2048 or larger? |
Good point, I try it. |
0efcc5f
to
30a901e
Compare
30a901e
to
d32b28d
Compare
I found I should set block_count*4 and set thread_per_block/4 |
Description
Brief Description of the PR:
Fix edt float16 and float64 kernels
Fixes # (issue)
Type of change
Checklist:
How Has This Been Tested?
If you're adding a bugfix or new feature please describe the tests that you ran to verify your changes:
*