fix float16 and float64 kernels #2412

fsx950223 · 2021-03-09T02:59:46Z

Description

Brief Description of the PR:
Fix edt float16 and float64 kernels
Fixes # (issue)

Type of change

Checklist:

I've properly formatted my code according to the guidelines
- By running Black + Flake8
- By running pre-commit hooks
This PR addresses an already submitted issue for TensorFlow Addons
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
This PR contains modifications to C++ custom-ops

How Has This Been Tested?

If you're adding a bugfix or new feature please describe the tests that you ran to verify your changes:
*

WindQAQ

Can you explain why this can solve the issue?

fsx950223 · 2021-03-09T03:47:42Z

Can you explain why this can solve the issue?

The error:
ERROR: too many resources requested for launch.
I don't know why float16 and float64 kernels crashed with the same setting as tf.float32, but reduce block_dim solves the bug.

WindQAQ · 2021-03-09T03:51:00Z

I guess this will generate wrong results when the original threads per block is larger than 256?

fsx950223 · 2021-03-09T03:54:43Z

I guess this will generate wrong results when the original threads per block is larger than 256?

No, it's a resource exceed bug.

WindQAQ · 2021-03-09T04:03:57Z

I guess this will generate wrong results when the original threads per block is larger than 256?

No, it's a resource exceed bug.

Hmm, if I understand correctly, the implementation is one thread do one plane (h, w) EDT. So if block_count == 1024 and thread_per_block == 512 (which means batch \times channel == 2^19), this change will launch 2^18 threads instead, and some image planes are not going to be updated, right?

fsx950223 · 2021-03-09T04:16:07Z

I guess this will generate wrong results when the original threads per block is larger than 256?

No, it's a resource exceed bug.

Hmm, if I understand correctly, the implementation is one thread do one plane (h, w) EDT. So if block_count == 1024 and thread_per_block == 512 (which means batch \times channel == 2^19), this change will launch 2^18 threads instead, and some image planes are not going to be updated, right?

The results are same as before, but inference may a little slower.
Just change the max value of thread_per_block from 1024 to 256.

WindQAQ · 2021-03-09T04:24:00Z

Can you add a test of batch x channels == 2048 or larger?

fsx950223 · 2021-03-09T04:25:58Z

I guess this will generate wrong results when the original threads per block is larger than 256?

No, it's a resource exceed bug.

Hmm, if I understand correctly, the implementation is one thread do one plane (h, w) EDT. So if block_count == 1024 and thread_per_block == 512 (which means batch \times channel == 2^19), this change will launch 2^18 threads instead, and some image planes are not going to be updated, right?

The results are same as before, but inference may a little slower.

Can you add a test of batch x channels == 2048 or larger?

Good point, I try it.

fsx950223 · 2021-03-09T05:11:16Z

Can you add a test of batch x channels == 2048 or larger?

I found I should set block_count*4 and set thread_per_block/4

WindQAQ · 2021-03-09T05:29:26Z

Can you try this?

https://github.com/tensorflow/tensorflow/blob/100c0d6538986797e1b29877b9cbbc687ba54853/tensorflow/core/util/gpu_launch_config.h#L156-L159

fsx950223 · 2021-03-09T05:39:39Z

Can you try this?

https://github.com/tensorflow/tensorflow/blob/100c0d6538986797e1b29877b9cbbc687ba54853/tensorflow/core/util/gpu_launch_config.h#L156-L159

Good, thanks.

fix float16 and float64 kernels

8ee2b9b

boring-cyborg bot added the custom-ops label Mar 9, 2021

google-cla bot added the cla: yes label Mar 9, 2021

fsx950223 requested a review from WindQAQ March 9, 2021 02:59

format code

e3a2dff

WindQAQ reviewed Mar 9, 2021

View reviewed changes

boring-cyborg bot added the image label Mar 9, 2021

fsx950223 force-pushed the optimize_edt_fix branch from 0efcc5f to 30a901e Compare March 9, 2021 05:00

try bigger batch size

d32b28d

fsx950223 force-pushed the optimize_edt_fix branch from 30a901e to d32b28d Compare March 9, 2021 05:10

correct block count

f34f9a6

Use GetGpuLaunchConfig instead

b214c9d

WindQAQ approved these changes Mar 10, 2021

View reviewed changes

WindQAQ merged commit 3bae381 into tensorflow:master Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix float16 and float64 kernels #2412

fix float16 and float64 kernels #2412

fsx950223 commented Mar 9, 2021

WindQAQ left a comment

fsx950223 commented Mar 9, 2021 •

edited

Loading

WindQAQ commented Mar 9, 2021

fsx950223 commented Mar 9, 2021

WindQAQ commented Mar 9, 2021 •

edited

Loading

fsx950223 commented Mar 9, 2021 •

edited

Loading

WindQAQ commented Mar 9, 2021

fsx950223 commented Mar 9, 2021

fsx950223 commented Mar 9, 2021

WindQAQ commented Mar 9, 2021

fsx950223 commented Mar 9, 2021

fix float16 and float64 kernels #2412

fix float16 and float64 kernels #2412

Conversation

fsx950223 commented Mar 9, 2021

Description

Type of change

Checklist:

How Has This Been Tested?

WindQAQ left a comment

Choose a reason for hiding this comment

fsx950223 commented Mar 9, 2021 • edited Loading

WindQAQ commented Mar 9, 2021

fsx950223 commented Mar 9, 2021

WindQAQ commented Mar 9, 2021 • edited Loading

fsx950223 commented Mar 9, 2021 • edited Loading

WindQAQ commented Mar 9, 2021

fsx950223 commented Mar 9, 2021

fsx950223 commented Mar 9, 2021

WindQAQ commented Mar 9, 2021

fsx950223 commented Mar 9, 2021

fsx950223 commented Mar 9, 2021 •

edited

Loading

WindQAQ commented Mar 9, 2021 •

edited

Loading

fsx950223 commented Mar 9, 2021 •

edited

Loading