Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[activations] fused gelu kernel #427

Merged
merged 17 commits into from
Aug 26, 2019

Conversation

WindQAQ
Copy link
Member

@WindQAQ WindQAQ commented Aug 19, 2019

Fused GeLU kernel (forward and backward). On my local machine, achieve 3x and 9x speed up on forward and backward direction respectively compared with python implementation. Also support original impl stated in https://arxiv.org/pdf/1606.08415.pdf.

References:
https://on-demand.gputechconf.com/ai-conference-2019/T1-3_Minseok%20Lee_Adding%20custom%20CUDA%20C++%20Operations%20in%20Tensorflow%20for%20boosting%20BERT%20Inference.pdf
https://github.com/pytorch/pytorch/blob/master/caffe2/operators/gelu_op.cu

@WindQAQ WindQAQ changed the title [WIP] fused gelu kernel [activations] fused gelu kernel Aug 20, 2019
@AakashKumarNain
Copy link
Member

This looks good to me. We can use this with the layer implementation

@WindQAQ
Copy link
Member Author

WindQAQ commented Aug 20, 2019

@AakashKumarNain Thanks! Do you mind if I also put your name and email on README? It would be great if you could also participate in maintaining gelu activation function :-)

@AakashKumarNain
Copy link
Member

@WindQAQ No problem at all. I would love to! Thank you

Copy link

@mostafaelhoushi mostafaelhoushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WindQAQ

Please check my comments below.

Also, I think we can support integer types for Gelu. According to this image below, when both input and output are constrained to integer, than Gelu should behave like Relu. I know that replacing all the terms in the original Gelu function with integers might lead to a different result, so we might handle input data types explicitly to make the forward pass behave as Relu. Not sure how to deal with backward pass for integer data types though.
image

if (approximate) {
const T kAlpha = static_cast<T>(M_2_SQRTPI * M_SQRT1_2);
const T kBeta = kAlpha * static_cast<T>(0.044715) * static_cast<T>(3);
const auto y =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see here that the output y is recalculated. It was already calculated in the forward pass.

How about having the output of the forward pass, to be passed as an additional input to the backward pass in order to save time recalculating? i.e., have typename TTypes<T>::ConstTensor activations passed to the function, and replace this line with:

const auto y = activations

Copy link
Member Author

@WindQAQ WindQAQ Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohoh, sounds great. Let me take a try on this. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, seems that
y = tanh(sqrt(2 / pi) * (x + 0.044715 * x^3))), and
gelu(x) = 0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))

Though we can compute y via y = gelu(x) / (0.5 * x) - 1, this will indeed lose some precision during division. So I tend to preserve the current implementation. What's your thoughts on this @mostafaelhoushi and @AakashKumarNain .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation looks good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think further changes could be done in the future. Let's move on to Layer's subclass! Thanks for you guys' feedback again.


REGISTER_OP("GeluGrad")
.Input("gradients: T")
.Input("features: T")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to my previous comment, we can add:

.Input("activations: T")

so that we don't recalculate the output when calculating the gradient.

class TestGelu(tf.test.TestCase, parameterized.TestCase):
@parameterized.named_parameters(("float16", np.float16),
("float32", np.float32),
("float64", np.float64))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about testing integer and quantization types?
For integer types, I believe that Gelu will simply behave as Relu.

Copy link
Member Author

@WindQAQ WindQAQ Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really get your point. Like int32, do you mean we first cast it to float and do computations in float and finally cast back to int32? If so, it's weird why users don't explicitly cast int32 to float , and cast output to int32.

Actually, most of activation ops in core TF (and PyTorch) can support only floating points input. ReLU/ReLU6 is an exception because cwiseMax/cwiseMin can run in non-floating dtype.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/nn_ops.cc#L1053-L1144

Copy link
Member Author

@WindQAQ WindQAQ Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, after a rough computing with google calculator, I found there are some gap between ReLU and GeLU with int type. When input=2, approximate version shows the result of 1.95459769409 and non-approximate version shows the one of 1.9544997361. Get deeper into the definition of GeLU:

gelu(x) = x * P(X <= x) = x * normcdf(x)

When x=2, gelu(2) = 2 * normcdf(2) ~= 2 * 0.9772 != 2.

Approximate
non-approximate

Copy link
Member

@AakashKumarNain AakashKumarNain Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mostafaelhoushi For integer types, I believe that Gelu will simply behave as Relu. I don't think this is true. Can you elaborate a bit on this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WindQAQ and @AakashKumarNain for your feedback.
I meant if both the input and output are constrained to be integer, then Gelu will behave as Relu. e.g.,

for the example that @WindQAQ mentioned: gelu(2) ~= 2*0.9772 - 1.9554
but if the activations are constrained to be input then we need to round the output to the nearest integer... so round(gelu(2)) = 2 = relu(2)

However, @WindQAQ mentioned an important point that "most of activation ops in core TF (and PyTorch) can support only floating points input. ReLU/ReLU6". Hence, I think you may safely ignore this suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay! Thanks again for the review :-)

@AakashKumarNain
Copy link
Member

@WindQAQ I think we should inject this in #424 now. What do you say?

@WindQAQ WindQAQ requested review from a team and removed request for a team August 23, 2019 01:13
@WindQAQ
Copy link
Member Author

WindQAQ commented Aug 23, 2019

@tensorflow/sig-addons-maintainers mind approving this one? Thanks!

Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks so much.

@seanpmorgan seanpmorgan merged commit 594e183 into tensorflow:master Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants