Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mish kernel #569

Merged
merged 17 commits into from
Oct 29, 2019
Merged

add mish kernel #569

merged 17 commits into from
Oct 29, 2019

Conversation

WindQAQ
Copy link
Member

@WindQAQ WindQAQ commented Oct 5, 2019

@WindQAQ
Copy link
Member Author

WindQAQ commented Oct 5, 2019

@digantamisra98 Hello Diganta, would you mind taking a look at this. Also, I would like to drop your name on the maintainer list if you agree. Thanks!

@digantamisra98
Copy link

Hey @WindQAQ. Going through it. As of now, everything looks fine. The arXiv identifier in your first comment is incorrect though, just to point out.
Yeah, sure, feel free to add me on the maintainer list.
Thank you!

@WindQAQ
Copy link
Member Author

WindQAQ commented Oct 5, 2019

Hey @WindQAQ. Going through it. As of now, everything looks fine. The arXiv identifier in your first comment is incorrect though, just to point out.
Yeah, sure, feel free to add me on the maintainer list.
Thank you!

Thanks for the review! Already updated the maintainer list. Please feel free to ping me if the mail address or the user id is wrong 😃

@digantamisra98
Copy link

@WindQAQ The ID and email are correct. I would also encourage you to take a look at this repository (based on CUDA) - https://github.com/thomasbrandon/mish-cuda.
It's much faster. Though I'm not sure, it is the right approach (Since it still doesn't work pretty well on double precision)
Also what is the wait time on the review process?
Thanks

@WindQAQ
Copy link
Member Author

WindQAQ commented Oct 6, 2019

@WindQAQ The ID and email are correct. I would also encourage you to take a look at this repository (based on CUDA) - https://github.com/thomasbrandon/mish-cuda.
It's much faster. Though I'm not sure, it is the right approach (Since it still doesn't work pretty well on double precision)

@digantamisra98 Eigen's GPUDevice is also based on CUDA (and ROCm). My previous experiment on gelu activations shows that Eigen's parallelism has the same performance with customized CUDA kernel in TensorFlow as well as PyTorch.

Also what is the wait time on the review process?

Maybe next week. Thanks!

@WindQAQ
Copy link
Member Author

WindQAQ commented Oct 6, 2019

My performance test on gelu: TFA (Eigen's parallelism) vs PyTorch.
https://colab.research.google.com/drive/1pQFOp5sEp3Gw0LGHlZ6dPsNGfZPWXcgc

Will do a thorough test on TFA versus PyTorch on activation after all of these get merged :P

@digantamisra98
Copy link

Thanks for that notebook link. Now I have better clarity on the same. Keep me posted on the progress of the TFA/Torch tests. Thanks!

@thomasbrandon
Copy link

As author of the PyTorch kernel linked by @digantamisra98 I'll just expand a little.
I tried a similar gradient calculation to the one you have (though with less optimised reuse of intermediates). I found this to be very unstable and training an actual networks generally resulted in nan output killing the whole network (though tensorflow/eigen might have some special handling for this, at least not killing the whole network). This mostly affected float16 calculations but it would also occur with float32 sometimes.
Switching to a symbolically derived gradient calculation resulted in much better stability and also improved performance (likely less pronounced in the more optimised version here which reuses more of the expensive exp calculations).
I just use a fairly naive backward that re-calculates the entire forward then derives the gradient. I have tested a method that stored intermediates in the forward to avoid this but found that the extra memory accesses for this actually reduced performance on CUDA float32 (I didn't test but would assume this would also apply to float16). Though I was only testing synthetically (with CUDA events), there may well be less impact on overall network performance.
Storing intermediates may be more useful for CUDA float64 and CPU where the extra calculations are likely slower than the extra memory accesses, so probably the ideal method would be to selectively use intermediates depending on device/type which I will likely try at some point.

@WindQAQ
Copy link
Member Author

WindQAQ commented Oct 7, 2019

As author of the PyTorch kernel linked by @digantamisra98 I'll just expand a little.
I tried a similar gradient calculation to the one you have (though with less optimised reuse of intermediates). I found this to be very unstable and training an actual networks generally resulted in nan output killing the whole network (though tensorflow/eigen might have some special handling for this, at least not killing the whole network). This mostly affected float16 calculations but it would also occur with float32 sometimes.
Switching to a symbolically derived gradient calculation resulted in much better stability and also improved performance (likely less pronounced in the more optimised version here which reuses more of the expensive exp calculations).
I just use a fairly naive backward that re-calculates the entire forward then derives the gradient. I have tested a method that stored intermediates in the forward to avoid this but found that the extra memory accesses for this actually reduced performance on CUDA float32 (I didn't test but would assume this would also apply to float16). Though I was only testing synthetically (with CUDA events), there may well be less impact on overall network performance.
Storing intermediates may be more useful for CUDA float64 and CPU where the extra calculations are likely slower than the extra memory accesses, so probably the ideal method would be to selectively use intermediates depending on device/type which I will likely try at some point.

Storing intermediate values is kind of trade-off, but as you noted, I will investigate it on the combination of device/type. For the stability, I am aware the softplus operation in the forward pass is very likely to underflow/overflow. Will see how you deal with the backward gradients too! Thank you so much for the suggestion!

@thomasbrandon
Copy link

Yeah, the intermediate stuff was mainly as I gather that's what @digantamisra98 was referring to with "not sure, it is the right approach (Since it still doesn't work pretty well on double precision)".
Was mainly noting what seem to be stable forward/backwards. For forward I used a trick from PyTorch's softplus which is inp < 20 ? : log1p(exp(inp)) : inp. I gather the threshold of 20 is because exp(inp) overflows a float16 at around inp>=20.

@WindQAQ
Copy link
Member Author

WindQAQ commented Oct 11, 2019

Hi @digantamisra98 @thomasbrandon, I have already dealt with precision problem. The softplus part is mainly copied from core TF and backward part is from @thomasbrandon implementation. Would you mind reviewing the PR again? Thank you so much for the time.

@digantamisra98
Copy link

@WindQAQ will check tomorrow and let you know. Thanks!

@digantamisra98
Copy link

@WindQAQ I did a quick check and was going through the log of the failed Ubuntu GPU build and couldn't interpret it correctly. Could you give it a bit of clarity?

@WindQAQ
Copy link
Member Author

WindQAQ commented Oct 12, 2019

@WindQAQ I did a quick check and was going through the log of the failed Ubuntu GPU build and couldn't interpret it correctly. Could you give it a bit of clarity?

It seems to be the upstream's issue.

@seanpmorgan
Copy link
Member

Hi @WindQAQ when time allows, mind refactoring to use the custom_op_library from #581

@seanpmorgan
Copy link
Member

@WindQAQ I did a quick check and was going through the log of the failed Ubuntu GPU build and couldn't interpret it correctly. Could you give it a bit of clarity?

@digantamisra98 would you mind taking a look again. The fail was fixed in upstream tensorflow

@digantamisra98
Copy link

@seanpmorgan yeah checked, all good. Thank you for notifying.

@seanpmorgan
Copy link
Member

@WindQAQ mind resolving conflicts when you get a chance?

Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the contriubtion @WindQAQ and @digantamisra98 for the review

@seanpmorgan seanpmorgan merged commit 093cdfa into tensorflow:master Oct 29, 2019
@WindQAQ
Copy link
Member Author

WindQAQ commented Oct 30, 2019

To whom it may concern about the speed, I benchmark some activations in tensorflow/tensorflow-addons against pytorch's implementation. I am not testing xla and jit here :-)

Because it's hard to build a toolchain on Colab, @thomasbrandon would you mind installing tfa-nightly to run some tests on Mish if you still have concerns on it? Bug report would be really appreciated. Thank you all!

https://colab.research.google.com/drive/1zKuef-upkN_4jFnBRoHLk06xmtIDRemi

@digantamisra98
Copy link

@seanpmorgan is there any scope of moving Mish to TF and Keras since now it is included in PyTorch 1.9. Reference

@bhack
Copy link
Contributor

bhack commented Jun 18, 2021

@digantamisra98 We don't handle anymore, on our side, migrations in the TF ecosystem as you can see in tensorflow/community#241 (comment).

Also, as the ecosystem review draft process is still not multi-lateral in the TF ecosystem, if you are going to create a PR to introduce MISH again in keras like:

keras-team/keras#13440
keras-team/keras#13439

we don't know how the deprecation will be triggered/handled. The process isn't currently documented and my best bet is to collaboratively expand and improve the ecosystem review process to the whole ecosystem.

/cc @seanpmorgan @yarri-oss @theadactyl @qlzh727

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants