Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel launch latency has regressed significantly over time #3619

Closed
bertmaher opened this issue Apr 9, 2024 · 5 comments
Closed

Kernel launch latency has regressed significantly over time #3619

bertmaher opened this issue Apr 9, 2024 · 5 comments

Comments

@bertmaher
Copy link
Collaborator

I've heard some user complaints about launch latency (when not using cudagraphs, of course), so I wrote a simple launch latency benchmark (https://gist.github.com/bertmaher/e8869ebf5297dfc77e26d51037d21f80) and backfilled data over the last several months. It appears that indeed latency has gone from 70us to 350us in that time period.

Full data for this benchmark is here: https://gist.github.com/bertmaher/7912b1735cf5b7c6427ef62cad4f515c

The exact numbers depend on the number and types of arguments passed into the kernel. For the benchmark I chose arg types from a kernel that a user suggested.

@jlebar
Copy link
Contributor

jlebar commented Apr 9, 2024

See #3503

@apgoucher
Copy link
Collaborator

@bertmaher @jlebar Okay, I've dug into it and I think that I can explain large parts of the regression. Consider this nearly-doubling of wall-clock time from a pair of commits:

55bb88744 [CACHE] Adding RuntimeError on signature mismatch with the cached function (#3389)
walltime ms: 0.366
d42ca115c Adding tl.const annotation to mark and validate that const tensors are not being stored to (#3360)
walltime ms: 0.269
72cba380a [AMD] Add amd f8 datatype (#3322)
walltime ms: 0.191

The first one of those commits (adding 78 microseconds) seems innocuous until you notice that it causes every argument to go through _type_of, which currently assembles a dictionary mapping possibly non-canonical type names to canonical type names in the body of the function.

The second commit (adding 97 microseconds) reruns mangled_type on every argument again to produce the signature, even though we've already done that to get part of the cache key. This compounds with the previous _type_of issue because it calls those same functions again.

Both of these issues -- amongst other things, such as the expensive inspect module function calls -- are fixed in #3638 (tests pass, not yet merged)

I'll see if there's anything egregious in the other significant latency-adding commits.

@apgoucher
Copy link
Collaborator

Yes, commit 12f9062 which adds 63 microseconds is (most probably) a result of using inspect instead of exec, so that should be another 63 microseconds reclaimed by PR #3638

apgoucher added a commit that referenced this issue Apr 12, 2024
This improves kernel launch latency by 2.2x (from 108us to 49us using
@bertmaher's benchmarking script in issue
#3619 ). Thanks also to
@liboyue's analysis and suggestions.

See the discussion in the third-party PR
#3503 (comment)
@apgoucher
Copy link
Collaborator

apgoucher commented Apr 13, 2024

As of #3648 the kernel launch latency is now 6x faster than it was two days ago. I'm marking this issue as closed because it's now faster than it's ever been before (although there may still be some minor opportunities for further improvement).

@bertmaher
Copy link
Collaborator Author

@apgoucher awesome! Thank you for the fast optimizations!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants