WIP: Add tensor descriptor API backed by device-side TMA creation #4916

peterbell10 · 2024-10-15T17:12:28Z

Commits in this PR

WIP: Add tensor descriptor API backed by device-side TMA creation
Fix lit failure
More fixes
Fix aot compile
Fix hip compile
Support allocating memory in noinline functions
Move GlobalScratchAllocOp to TritonGPUIR
Add descriptor lowering lit test
Add note explaining the calling convention

PR chain

👉 WIP: Add tensor descriptor API backed by device-side TMA creation #4916 👈 YOU ARE HERE

ThomasRaoux

Looks good, few minor comments

ThomasRaoux · 2024-10-22T04:06:57Z

include/triton/Dialect/TritonNvidiaGPU/Transforms/Passes.td

@@ -77,4 +77,18 @@ def TritonNvidiaGPUTMALoweringPass : Pass<"triton-nvidia-tma-lowering", "mlir::M
  ];
 }

+def TritonNvidiaGPUGlobalScratchAllocationPass : Pass<"triton-tensor-memory-allocation", "mlir::ModuleOp"> {


why is this nvidia specific? It sounds like we could make it a generic pass?

ThomasRaoux · 2024-10-22T04:10:41Z

lib/Conversion/TritonGPUToLLVM/ControlFlowOpToLLVM.cpp

-        callOp->getLoc(), rewriter, targetInfo, callOp));
+
+    auto opOffsetAttr = caller->getAttrOfType<mlir::IntegerAttr>(
+        "triton_nvidia_gpu.global_scratch_memory_offset");


it's a bit weird to have this in triton_nvidia_gpu namespace, especially since this is in shared code

ThomasRaoux · 2024-10-22T04:14:13Z

lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp

@@ -101,6 +102,102 @@ class TMAStoreLowering
  }
 };

+class TMACreateDescLowering : public OpRewritePattern<MakeTensorDescOp> {


can you add a lit test for this? It is a nice way to have an example of the IR

Jokeren · 2024-10-22T13:33:05Z

include/triton/Conversion/TritonGPUToLLVM/Utility.h

@@ -368,7 +369,7 @@ inline bool isKernel(FunctionOpInterface funcOp) {
 inline Value getStackPointer(RewriterBase &rewriter,
                             FunctionOpInterface funcOp) {
  if (!isKernel(funcOp)) {
-    return funcOp.getArgument(funcOp.getNumArguments() - 1);
+    return funcOp.getArgument(funcOp.getNumArguments() - 2);


We may need to document it more explicitly: what is funcOp.getNumArguments() - 2 and what is funcOp.getNumArguments() - 1

Jokeren · 2024-10-22T13:40:04Z

python/test/unit/hopper/test_experimental_tma.py

+        idx = tl.arange(0, M_BLOCK)[:, None] * N_BLOCK + tl.arange(0, N_BLOCK)[None, :]
+        tl.store(out_ptr + idx, block)
+
+    def alloc_fn(size: int, align: int, stream: Optional[int]):


So the user must have the knowledge to figure out the size of the scratch buffer. Can you justify how it is compatible with our Autotuner?

Not sure I understand your question. The user provided allocation function is called with the correct size as computed by the launcher code. During autotuning the allocation function will be called multiple times with different sizes to allocate new memory.

My question is more about how we implement the alloc_fn when using the autotuner. The block sizes are changed based on the configurations passed by triton.Config.

The size passed to the allocation function is the total allocation size, so the allocation function doesn't need to know anything about the config.

See for example in the next test down where there is a larger grid and so I assert:

assert size == 128 * (grid_m * grid_n)

Just to make sure I understand it clearly.

If the maximum number of grids is 100x100, then we will allocate 100x100x128 anyway for all configurations

No, say you have one config is 100x100 and another is 200x100 then yor allocation function will be called several times with size = 100*100*128 and several times with size = 200*100*128.

I see. Makes sense

git-pr-chain: wip_add_tensor_descriptor_api_backed_by__a0ac

peterbell10 requested review from Jokeren and ptillet as code owners October 15, 2024 17:12

peterbell10 mentioned this pull request Oct 15, 2024

[Frontend] Factor out block shape validation function #4915

Merged

peterbell10 requested a review from ThomasRaoux October 15, 2024 17:25

peterbell10 force-pushed the pb/pr-chain/wip_add_tensor_descriptor_api_backed_by__a0ac branch from 6d6fbfa to bdcbf91 Compare October 15, 2024 17:28

Base automatically changed from pb/pr-chain/frontend_factor_out_block_shape_validati_9ccb to main October 15, 2024 21:38

peterbell10 force-pushed the pb/pr-chain/wip_add_tensor_descriptor_api_backed_by__a0ac branch from bdcbf91 to d0cd5c2 Compare October 15, 2024 21:44

peterbell10 requested review from antiagainst and zhanglx13 as code owners October 17, 2024 13:28

peterbell10 force-pushed the pb/pr-chain/wip_add_tensor_descriptor_api_backed_by__a0ac branch from 53f5d50 to cae0fdf Compare October 17, 2024 14:49

ThomasRaoux reviewed Oct 22, 2024

View reviewed changes

Jokeren reviewed Oct 22, 2024

View reviewed changes

peterbell10 added 6 commits October 22, 2024 17:32

WIP: Add tensor descriptor API backed by device-side TMA creation

a1966b5

git-pr-chain: wip_add_tensor_descriptor_api_backed_by__a0ac

Fix lit failure

941b199

More fixes

8921a31

Fix aot compile

88eb7d8

Fix hip compile

ba760ba

Support allocating memory in noinline functions

f6b43dc

peterbell10 force-pushed the pb/pr-chain/wip_add_tensor_descriptor_api_backed_by__a0ac branch from 57e7f2d to 214f259 Compare October 22, 2024 16:33

peterbell10 added 3 commits October 22, 2024 19:17

Move GlobalScratchAllocOp to TritonGPUIR

614ce18

Add descriptor lowering lit test

b4b6734

Add note explaining the calling convention

837308f

peterbell10 force-pushed the pb/pr-chain/wip_add_tensor_descriptor_api_backed_by__a0ac branch from 214f259 to 837308f Compare October 22, 2024 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add tensor descriptor API backed by device-side TMA creation #4916

WIP: Add tensor descriptor API backed by device-side TMA creation #4916

peterbell10 commented Oct 15, 2024 •

edited

Loading

ThomasRaoux left a comment

ThomasRaoux Oct 22, 2024

ThomasRaoux Oct 22, 2024

ThomasRaoux Oct 22, 2024

Jokeren Oct 22, 2024

Jokeren Oct 22, 2024

peterbell10 Oct 22, 2024 •

edited

Loading

Jokeren Oct 22, 2024

peterbell10 Oct 22, 2024 •

edited

Loading

Jokeren Oct 22, 2024

peterbell10 Oct 22, 2024 •

edited

Loading

Jokeren Oct 22, 2024

WIP: Add tensor descriptor API backed by device-side TMA creation #4916

Are you sure you want to change the base?

WIP: Add tensor descriptor API backed by device-side TMA creation #4916

Conversation

peterbell10 commented Oct 15, 2024 • edited Loading

Commits in this PR

ThomasRaoux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 commented Oct 15, 2024 •

edited

Loading

peterbell10 Oct 22, 2024 •

edited

Loading

peterbell10 Oct 22, 2024 •

edited

Loading

peterbell10 Oct 22, 2024 •

edited

Loading