-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[activations] fused gelu kernel #427
[activations] fused gelu kernel #427
Conversation
This looks good to me. We can use this with the layer implementation |
@AakashKumarNain Thanks! Do you mind if I also put your name and email on README? It would be great if you could also participate in maintaining gelu activation function :-) |
@WindQAQ No problem at all. I would love to! Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @WindQAQ
Please check my comments below.
Also, I think we can support integer types for Gelu. According to this image below, when both input and output are constrained to integer, than Gelu should behave like Relu. I know that replacing all the terms in the original Gelu function with integers might lead to a different result, so we might handle input data types explicitly to make the forward pass behave as Relu. Not sure how to deal with backward pass for integer data types though.
if (approximate) { | ||
const T kAlpha = static_cast<T>(M_2_SQRTPI * M_SQRT1_2); | ||
const T kBeta = kAlpha * static_cast<T>(0.044715) * static_cast<T>(3); | ||
const auto y = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see here that the output y
is recalculated. It was already calculated in the forward pass.
How about having the output of the forward pass, to be passed as an additional input to the backward pass in order to save time recalculating? i.e., have typename TTypes<T>::ConstTensor activations
passed to the function, and replace this line with:
const auto y = activations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohoh, sounds great. Let me take a try on this. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, seems that
y = tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))
, and
gelu(x) = 0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))
Though we can compute y
via y = gelu(x) / (0.5 * x) - 1
, this will indeed lose some precision during division. So I tend to preserve the current implementation. What's your thoughts on this @mostafaelhoushi and @AakashKumarNain .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think further changes could be done in the future. Let's move on to Layer's subclass! Thanks for you guys' feedback again.
|
||
REGISTER_OP("GeluGrad") | ||
.Input("gradients: T") | ||
.Input("features: T") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referring to my previous comment, we can add:
.Input("activations: T")
so that we don't recalculate the output when calculating the gradient.
class TestGelu(tf.test.TestCase, parameterized.TestCase): | ||
@parameterized.named_parameters(("float16", np.float16), | ||
("float32", np.float32), | ||
("float64", np.float64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about testing integer and quantization types?
For integer types, I believe that Gelu will simply behave as Relu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really get your point. Like int32
, do you mean we first cast it to float
and do computations in float
and finally cast back to int32
? If so, it's weird why users don't explicitly cast int32
to float
, and cast output to int32
.
Actually, most of activation ops in core TF (and PyTorch) can support only floating points input. ReLU/ReLU6 is an exception because cwiseMax
/cwiseMin
can run in non-floating dtype.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/nn_ops.cc#L1053-L1144
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, after a rough computing with google calculator, I found there are some gap between ReLU and GeLU with int type. When input=2, approximate version shows the result of 1.95459769409 and non-approximate version shows the one of 1.9544997361. Get deeper into the definition of GeLU:
gelu(x) = x * P(X <= x) = x * normcdf(x)
When x=2
, gelu(2) = 2 * normcdf(2) ~= 2 * 0.9772 != 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mostafaelhoushi For integer types, I believe that Gelu will simply behave as Relu.
I don't think this is true. Can you elaborate a bit on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @WindQAQ and @AakashKumarNain for your feedback.
I meant if both the input and output are constrained to be integer, then Gelu will behave as Relu. e.g.,
for the example that @WindQAQ mentioned: gelu(2) ~= 2*0.9772 - 1.9554
but if the activations are constrained to be input then we need to round the output to the nearest integer... so round(gelu(2)) = 2 = relu(2)
However, @WindQAQ mentioned an important point that "most of activation ops in core TF (and PyTorch) can support only floating points input. ReLU/ReLU6". Hence, I think you may safely ignore this suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay! Thanks again for the review :-)
@tensorflow/sig-addons-maintainers mind approving this one? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks so much.
Fused GeLU kernel (forward and backward). On my local machine, achieve 3x and 9x speed up on forward and backward direction respectively compared with python implementation. Also support original impl stated in https://arxiv.org/pdf/1606.08415.pdf.
References:
https://on-demand.gputechconf.com/ai-conference-2019/T1-3_Minseok%20Lee_Adding%20custom%20CUDA%20C++%20Operations%20in%20Tensorflow%20for%20boosting%20BERT%20Inference.pdf
https://github.com/pytorch/pytorch/blob/master/caffe2/operators/gelu_op.cu