Kernel launch latency has regressed significantly over time #3619

bertmaher · 2024-04-09T16:52:02Z

I've heard some user complaints about launch latency (when not using cudagraphs, of course), so I wrote a simple launch latency benchmark (https://gist.github.com/bertmaher/e8869ebf5297dfc77e26d51037d21f80) and backfilled data over the last several months. It appears that indeed latency has gone from 70us to 350us in that time period.

Full data for this benchmark is here: https://gist.github.com/bertmaher/7912b1735cf5b7c6427ef62cad4f515c

The exact numbers depend on the number and types of arguments passed into the kernel. For the benchmark I chose arg types from a kernel that a user suggested.

jlebar · 2024-04-09T16:56:54Z

See #3503

apgoucher · 2024-04-12T12:55:23Z

@bertmaher @jlebar Okay, I've dug into it and I think that I can explain large parts of the regression. Consider this nearly-doubling of wall-clock time from a pair of commits:

55bb88744 [CACHE] Adding RuntimeError on signature mismatch with the cached function (#3389)
walltime ms: 0.366
d42ca115c Adding tl.const annotation to mark and validate that const tensors are not being stored to (#3360)
walltime ms: 0.269
72cba380a [AMD] Add amd f8 datatype (#3322)
walltime ms: 0.191

The first one of those commits (adding 78 microseconds) seems innocuous until you notice that it causes every argument to go through _type_of, which currently assembles a dictionary mapping possibly non-canonical type names to canonical type names in the body of the function.

The second commit (adding 97 microseconds) reruns mangled_type on every argument again to produce the signature, even though we've already done that to get part of the cache key. This compounds with the previous _type_of issue because it calls those same functions again.

Both of these issues -- amongst other things, such as the expensive inspect module function calls -- are fixed in #3638 (tests pass, not yet merged)

I'll see if there's anything egregious in the other significant latency-adding commits.

apgoucher · 2024-04-12T13:02:20Z

Yes, commit 12f9062 which adds 63 microseconds is (most probably) a result of using inspect instead of exec, so that should be another 63 microseconds reclaimed by PR #3638

@bertmaher

This improves kernel launch latency by 2.2x (from 108us to 49us using @bertmaher's benchmarking script in issue #3619 ). Thanks also to @liboyue's analysis and suggestions. See the discussion in the third-party PR #3503 (comment)

apgoucher · 2024-04-13T09:37:12Z

As of #3648 the kernel launch latency is now 6x faster than it was two days ago. I'm marking this issue as closed because it's now faster than it's ever been before (although there may still be some minor opportunities for further improvement).

bertmaher · 2024-04-14T14:26:27Z

@apgoucher awesome! Thank you for the fast optimizations!

apgoucher mentioned this issue Apr 12, 2024

attempt to substantially reduce kernel launch overhead #3638

Merged

apgoucher closed this as completed Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel launch latency has regressed significantly over time #3619

Kernel launch latency has regressed significantly over time #3619

bertmaher commented Apr 9, 2024

jlebar commented Apr 9, 2024

apgoucher commented Apr 12, 2024

apgoucher commented Apr 12, 2024

apgoucher commented Apr 13, 2024 •

edited

Loading

bertmaher commented Apr 14, 2024

Kernel launch latency has regressed significantly over time #3619

Kernel launch latency has regressed significantly over time #3619

Comments

bertmaher commented Apr 9, 2024

jlebar commented Apr 9, 2024

apgoucher commented Apr 12, 2024

apgoucher commented Apr 12, 2024

apgoucher commented Apr 13, 2024 • edited Loading

bertmaher commented Apr 14, 2024

apgoucher commented Apr 13, 2024 •

edited

Loading