Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: TensorFlow on DirectML #243

Merged
merged 12 commits into from
Mar 4, 2022
Merged

RFC: TensorFlow on DirectML #243

merged 12 commits into from
Mar 4, 2022

Conversation

wchao1115
Copy link
Contributor

@wchao1115 wchao1115 commented May 12, 2020

This RFC will be open for comment until Monday, May 25th, 2020.

Status Proposed
RFC # 243
Author(s) Chai Chaoweeraprasit (wchao@microsoft.com), Justin Stoecker (justoeck@microsoft.com), Adrian Tsai (adtsai@microsoft.com), Patrice Vignola (pavignol@microsoft.com)
Sponsor Penporn Koanantakool (penporn@google.com)
Updated 2020-06-08

Objective

Implement a new TensorFlow device type and a new set of kernels based on DirectML, a hardware-accelerated machine learning library on the DirectX 12 Compute platform. This change broadens the reach of TensorFlow beyond its existing GPU footprint and enables high-performance training and inferencing on Windows devices with any DirectX12-capable GPU.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@penpornk
Copy link
Member

penpornk commented May 14, 2020

Thank you for your interest!

For the current TensorFlow stack, we recommend adding the new device and kernels as external modules, connecting to TensorFlow through C APIs. Please see the Modular TensorFlow RFC for more context. Kernels and ops can be registered through the APIs in this ongoing RFC. We will share more updates on device addition APIs next week. In addition, there are a few other uncertainties with this approach:

  • Modular TF cannot guarantee compatibility with the new TensorFlow runtime (TFRT) (but we will try to maximize compatibility if possible). TFRT will be modular, but it is fundamentally different from the current runtime and many things are still evolving.
  • The C APIs will initially be subject to changes, we cannot yet guarantee API stability

What is your integration timeline? TensorFlow is transitioning to a new backend based on TFRT and MLIR, which will not be compatible with the current stack. If your goal is to get your integration productized sometime in 2021, we recommend waiting to integrate with TFRT and MLIR to save throw-away efforts.

@wchao1115
Copy link
Contributor Author

Thank you for your comment @penpornk. Our initial goal is to be compatible with the existing runtime behavior as that's what our audience is expecting. We aim to integrate a new backend to achieve broader hardware and platform reach while maintaining compatibility as a top priority.

We definitely want to learn more about TFRT. Based on the info in the blog, it does look like the refactoring will actually make it easier for us to integrate our device runtime and kernels. So, definitely interested to be in the loop for future development. We have not finalized our initial release timeline, but we have some good progress in our fork to date.

Given the size of the change, I would also like to know your recommendation on how we should think about the code integration strategy. The code change is large but mostly additive.

@mihaimaruseac
Copy link
Contributor

@wchao1115 regarding CLA, you will have to sign with all associated email accounts

@mihaimaruseac
Copy link
Contributor

Just checked the internal data. Your email at ntdev.microsoft.com has not signed the CLA

@penpornk
Copy link
Member

I manually checked and don't see the CLA for wchao@ntdev.microsoft.com. Did you sign with this address or a different alias?

I noticed that both your personal emails are associated with the same github account. Maybe you can add your corp email as the third one (in Github settings) too. (Or do we still require a separate CLA for a corp email address, @mihaimaruseac ?)

@yongtang
Copy link
Member

I think an easy way is to modify the commit and reset the author to associate the email that signed CLA instead.

You can use git commit -s --amend --reset-author to reset the last commit's author info.

To modify all 9 commit, you can use:

git rebase -i HEAD~9 -x "git commit --amend --author 'Author Name <author.name@mail.com>' --no-edit"

@penpornk
Copy link
Member

@yongtang That's a great idea. Thank you!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot removed the cla: no label May 16, 2020
@ematejska ematejska moved this from In Revision to Open reviews in RFC management May 12, 2021
@penpornk
Copy link
Member

is there a plan to support an ARM64 build either on Windows or the Mac OS? I think we'll eventually need that on Windows.

@wchao1115 There is no set plan that we can share yet. But please stay tuned. :)

In case this is useful: We have an aarch64 build config (--config=mkl_aarch64) recently added by @cfRod and @alenik01. This build uses oneDNN with Arm Compute Library support.

@penpornk
Copy link
Member

penpornk commented Jun 8, 2021

Hi @wchao1115,
A friendly reminder that we plan to discuss opportunities for improving TF Windows build infrastructure at the SIG Build meeting today at 2pm PT. If you are interested, please see how to join the call here. :)

@wchao1115
Copy link
Contributor Author

Thanks @penpornk for the reminder. I indeed plan to attend this call at 2 today. Someone else at MS has routed this meeting to me due to our current work on getting TensorFlow to run on WSL.

@wchao1115
Copy link
Contributor Author

@penpornk You mentioned toward the end of the SIG call today that there is this conversation on the Windows build topic that I'm part of. That's true, but it seems I do not have the permission to post on that group. Is there any step I need to take to request for the permission?

@penpornk
Copy link
Member

penpornk commented Jun 8, 2021

@wchao1115 I think you'll need to join the group first. If you go to the group page, there should be a button "Join group" after the group name.

@wchao1115
Copy link
Contributor Author

Just wanted to drop a quick update for the community that the official release of TensorFlow-DirectML 1.15.5 is now available at PyPI. More update can be found here.

We're moving right along to enable DirectML support for TF2 code base with the focus on the TensorFlow pluggable device for DirectML that runs on the framework's latest version from the official branch both on Windows and WSL.

@ematejska
Copy link
Contributor

@penpornk What's the status of this RFC?

@bhack
Copy link
Contributor

bhack commented Feb 4, 2022

What is your integration timeline? TensorFlow is transitioning to a new backend based on TFRT and MLIR, which will not be compatible with the current stack. If your goal is to get your integration productized sometime in 2021, we recommend waiting to integrate with TFRT and MLIR to save throw-away efforts.

Can we have a fresh overview on this claim? What Is the status of TF/TFRT integration in 2022?

@penpornk
Copy link
Member

penpornk commented Feb 4, 2022

@ematejska Sorry for the late reply! We are taking a slightly different approach than initially proposed in this RFC, i.e., making a TF-DirectML PluggableDevice plug-in instead of directly adding a new device type DmlDevice in the TF source code. So far there is no need for a design review yet and this PR has mostly been used for communication / giving updates.

Can we have a fresh overview on this claim? What Is the status of TF/TFRT integration in 2022?

@bhack I believe @wchao1115 and his team are working on a TF-DirectML PluggableDevice plug-in. There is no device API device plug-in C API for TFRT yet.

Update history:
Edited "device API" to "device plug-in C API".

@bhack
Copy link
Contributor

bhack commented Feb 4, 2022

Thanks for the update. As in TFRT I see that we have already Nvidia CUDA and AMD HIP driver wrappers at:

https://github.com/tensorflow/runtime/blob/master/backends/gpu/lib/wrapper/driver_wrapper.cc

See more at:
https://github.com/tensorflow/runtime/blob/master/backends/gpu/README.md

@penpornk
Copy link
Member

penpornk commented Feb 4, 2022

Thanks for the update. As in TFRT I see that we have already Nvidia CUDA and AMD HIP driver wrappers at:

https://github.com/tensorflow/runtime/blob/master/backends/gpu/lib/wrapper/driver_wrapper.cc

See more at: https://github.com/tensorflow/runtime/blob/master/backends/gpu/README.md

@bhack Oops. Sorry. I meant TRFT doesn't have a device plug-in C API yet. (Fixed my earlier post as well.)

@bhack
Copy link
Contributor

bhack commented Feb 4, 2022

@penpornk Thanks so I suppose that as with the original point we don't know anything about a future TFRT device plug-in C API and the PluggableDevice design compatibilty and It will be probably required to integrate twice. Right?

@wchao1115
Copy link
Contributor Author

@bhack As what @penpornk has mentioned here. I just wanted to confirm that we have no plan in place as of now to work on TFRT. Our focus at the moment is on getting a DirectML-based plug-in working with TF2 official builds, both on Linux and on Windows.

@penpornk Given that we've taken your feedback on this RFC and pivot towards the pluggable device on TF2 now, shall we move ahead and approve this RFC for the record? This is a long road that we've started a while back, and would love to close on it.

@bhack
Copy link
Contributor

bhack commented Feb 4, 2022

@wchao1115 Thanks it is is clear I see also some (related) pending MS PRs in TF.
My point was just to understand, when TFRT is ready, if it will require a brand new integration effort like this one.

@penpornk
Copy link
Member

penpornk commented Feb 7, 2022

@penpornk Thanks so I suppose that as with the original point we don't know anything about a future TFRT device plug-in C API and the PluggableDevice design compatibilty and It will be probably required to integrate twice. Right?

@bhack Yes. The future MLIR/TFRT stack will likely require a separate integration. If possible, we would like to make it so that it doesn't require too big of a migration effort, but there is no guarantee.

@penpornk Given that we've taken your feedback on this RFC and pivot towards the pluggable device on TF2 now, shall we move ahead and approve this RFC for the record? This is a long road that we've started a while back, and would love to close on it.

@wchao1115 Sorry about that! I originally kept the PR open in case we'd need a redesign, but didn't think of closing it once you decided to go with PluggableDevice. Let me double check with the team if we need to update the RFC content to reflect the new development.

@ematejska
Copy link
Contributor

ematejska commented Feb 8, 2022

@penpornk How about updating the design document with the current state and we can mark as accepted per your recommendation?

@penpornk
Copy link
Member

penpornk commented Feb 8, 2022

@ematejska Sounds great! Thank you, Ewa!
@wchao1115 Could you please help update the RFC file to explicitly mention the plug-in approach?

@boyedarat
Copy link

@wchao1115 can we have an update on PluggableDevice? Your last comment was that it was something your team had been working on for TF 2.4. Since the API for PluggableDevice is official since TF 2.5, is there a preview build of some kind available? Been waiting for DirectML for TF2 for a very long time.
Thanks.

@wchao1115
Copy link
Contributor Author

We are working on it. Should be out some time soon.

@wchao1115
Copy link
Contributor Author

@penpornk Please take a look at the design change update in the latest commit.

Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the update! And looking forward to the plug-in release! :D

@ematejska ematejska merged commit eff9c82 into tensorflow:master Mar 4, 2022
RFC management automation moved this from Open reviews to Accepted RFCs Mar 4, 2022
@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
RFC management
  
Accepted RFCs
Development

Successfully merging this pull request may close these issues.

None yet