[mlir] Implement DestinationStyleOpInterface for scf::ForallOp #66981

ubfx · 2023-09-21T06:15:39Z

scf::ForallOp has shared_outs tensor operands which are used to insert partial results into in the parallel terminator. The scf::ForallOp returns one tensor for each shared_out which then contains the combined result from all threads. Since the parallel terminator cannot change the shape of the shared_out, ForallOp is a DestinationStyleOp and this patch implements the interface by declaring the outputs operands as inits in the language of the DPS interface.

For this change to work, we need to add an exception to the Pattern that folds tensor.cast Ops into DPS Ops because scf::Forall needs special handling of it's BlockArgument Type during this folding.

llvmbot · 2023-09-21T06:16:48Z

@llvm/pr-subscribers-mlir-tensor
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-scf

Changes

scf::ForallOp has shared_outs tensor operands which are used to insert partial results into in the parallel terminator. The scf::ForallOp returns one tensor for each shared_out which then contains the combined result from all threads. Since the parallel terminator cannot change the shape of the shared_out, ForallOp is a DestinationStyleOp and this patch implements the interface by declaring the outputs operands as inits in the language of the DPS interface.

For this change to work, we need to add an exception to the Pattern that folds tensor.cast Ops into DPS Ops because scf::Forall needs special handling of it's BlockArgument Type during this folding.

Full diff: https://github.com/llvm/llvm-project/pull/66981.diff

3 Files Affected:

(modified) mlir/include/mlir/Dialect/SCF/IR/SCF.h (+1)
(modified) mlir/include/mlir/Dialect/SCF/IR/SCFOps.td (+10)
(modified) mlir/lib/Dialect/Tensor/IR/TensorOps.cpp (+4)

diff --git a/mlir/include/mlir/Dialect/SCF/IR/SCF.h b/mlir/include/mlir/Dialect/SCF/IR/SCF.h
index 915ab3016b688e7..644118ca884c6b1 100644
--- a/mlir/include/mlir/Dialect/SCF/IR/SCF.h
+++ b/mlir/include/mlir/Dialect/SCF/IR/SCF.h
@@ -19,6 +19,7 @@
 #include "mlir/IR/BuiltinTypes.h"
 #include "mlir/IR/RegionKindInterface.h"
 #include "mlir/Interfaces/ControlFlowInterfaces.h"
+#include "mlir/Interfaces/DestinationStyleOpInterface.h"
 #include "mlir/Interfaces/InferTypeOpInterface.h"
 #include "mlir/Interfaces/LoopLikeInterface.h"
 #include "mlir/Interfaces/ParallelCombiningOpInterface.h"
diff --git a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
index 08b71e20a2bc079..adc7b2e4170cb89 100644
--- a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
+++ b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
@@ -17,6 +17,7 @@ include "mlir/Interfaces/ControlFlowInterfaces.td"
 include "mlir/Interfaces/LoopLikeInterface.td"
 include "mlir/IR/RegionKindInterface.td"
 include "mlir/Dialect/SCF/IR/DeviceMappingInterface.td"
+include "mlir/Interfaces/DestinationStyleOpInterface.td"
 include "mlir/Interfaces/InferTypeOpInterface.td"
 include "mlir/Interfaces/ParallelCombiningOpInterface.td"
 include "mlir/Interfaces/SideEffectInterfaces.td"
@@ -333,6 +334,7 @@ def ForallOp : SCF_Op<"forall", [
        RecursiveMemoryEffects,
        SingleBlockImplicitTerminator<"scf::InParallelOp">,
        DeclareOpInterfaceMethods<RegionBranchOpInterface>,
+       DestinationStyleOpInterface
      ]> {
   let summary = "evaluate a block multiple times in parallel";
   let description = [{
@@ -630,6 +632,14 @@ def ForallOp : SCF_Op<"forall", [
                                  Location loc);
 
     InParallelOp getTerminator();
+
+    // Implement this to declare all shared_outs as inits/outs to 
+    // DestinationStyleOpInterface
+    std::pair<int64_t, int64_t> getDpsInitsPositionRange() {
+      int64_t numOperands = getNumOperands();
+      int64_t numOuts = getOutputs().size();
+      return {numOperands - numOuts, numOperands};
+    }
   }];
 }
 
diff --git a/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp b/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp
index 3e30e320bee8f83..fa91471f33d4bd3 100644
--- a/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp
+++ b/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp
@@ -10,6 +10,7 @@
 #include "mlir/Dialect/Arith/IR/Arith.h"
 #include "mlir/Dialect/Arith/Utils/Utils.h"
 #include "mlir/Dialect/Complex/IR/Complex.h"
+#include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/Dialect/Utils/IndexingUtils.h"
 #include "mlir/Dialect/Utils/ReshapeOpsUtils.h"
@@ -3970,6 +3971,9 @@ struct FoldTensorCastProducerOp
     if (isa<InsertSliceOp>(op.getOperation()))
       return failure();
 
+    if (isa<scf::ForallOp>(op.getOperation()))
+      return failure();
+
     // If no operand comes from a tensor::CastOp and can be folded then fail.
     bool hasTensorCastOperand =
         llvm::any_of(op->getOpOperands(), [&](OpOperand &opOperand) {

zero9178 · 2023-09-21T10:37:22Z

mlir/lib/Dialect/Tensor/IR/TensorOps.cpp

+    // scf::ForallOp also has its own folding logic taking the Types of its
+    // BlockArguments into consideration.
+    if (isa<scf::ForallOp>(op.getOperation()))
+      return failure();
+


These lines are problematic as the tensor dialect does not depend on the SCF dialect.
To make this matter even worse, the SCF dialect currently depends on the Tensor dialect,, creating a circular dependency between the two.
This leads to BUILD_SHARED_LIBS to fail building as well as any downstream user dependending on the tensor dialect to get linker errors that they need to include SCF as well now.

FAILED: lib/libMLIRTensorDialect.so.18git ld.lld: error: undefined symbol: mlir::detail::TypeIDResolver<mlir::scf::ForallOp, void>::id >>> referenced by TensorOps.cpp >>> tools/mlir/lib/Dialect/Tensor/IR/CMakeFiles/obj.MLIRTensorDialect.dir/TensorOps.cpp.o:(FoldTensorCastProducerOp::matchAndRewrite(mlir::DestinationStyleOpInterface, mlir::PatternRewriter&) const) clang-16: error: linker command failed with exit code 1 (use -v to see invocation)

I sadly know very little about the tensor dialect/bufferzization/all that stuff, but is this kind of layering really necessary?

Oh right, I missed that. We should avoid such references to other dialects. FoldTensorCastProducerOp should not fold casts into destination style ops that have a region without further checks. At the moment there is no way to query whether such folding is safe or not.

Maybe we can do something with a new interface. We already have CastOpInterface. We could have another FoldCastOpInterface that ops can implement that support folding of cast ops into themselves.

Thank you for pointing that out!

I just tried adding a simple check for op->getNumRegions() != 0 but this also doesn't work because it excludes all of the linalgOps that are some of the primary targets of this Pattern. How about excluding Ops that have tensor bbArgs, or is this getting too specific?

About the potential Interface - is there a way to implement this so that it doesn't add extra implementation effort to every DPS op?

I think you can just look for LoopLikeOpInterface here... That adds a dependence from DestPassingStyleOpInterface to LoopLikeOpInterface, but we might be able to live with that.

matthias-springer

Layering issues (scf.forall is not a tensor dialect op)

MaheshRavishankar · 2023-09-21T18:02:14Z

I am in favor of this change, but given the dependence issue can you comment a bit more on the motivation for this change?

ubfx · 2023-09-21T19:41:38Z

given the dependence issue can you comment a bit more on the motivation for this change?

My original idea was to implement ValueBoundsOpInterface for scf::ForallOp because the result shapes are coupled to the shared_outs shapes which offers a nice opportunity to propagate shape constraints through loop nests. Matthias mentioned that he had recently implemented shape constraint propagation through DPS ops (#65736) and so implementing the DPS interface for scf::ForallOp would also take care of that.

Additionally, I think forall matches the DPS contract very well - and the language shared_outs already matches the DPS language (outs), so it would be consistent to make it an actual DPS op. But I am unsure whether it would lead to any other improvements regarding bufferization down the line.

One caveat about the ValueBoundsOpInterface: Since the ForallOp also has iteration variables which might get queried through the interface, we might still need an additional implementation of that interface, even if this change goes through.

matthias-springer · 2023-09-22T07:38:33Z

mlir/include/mlir/Dialect/SCF/IR/SCFOps.td

+
+    // Implement this to declare all shared_outs as inits/outs to 
+    // DestinationStyleOpInterface
+    std::pair<int64_t, int64_t> getDpsInitsPositionRange() {


This has changed in #67015.

Thanks, I updated that part.

`scf::ForallOp` has `shared_outs` tensor operands which are used to insert partial results into in the parallel terminator. The `scf::ForallOp` returns one tensor for each `shared_out` which then contains the combined result from all threads. Since the parallel terminator cannot change the shape of the `shared_out`, ForallOp is a `DestinationStyleOp` and this patch implements the interface by declaring the `outputs` operands as `inits` in the language of the DPS interface. For this change to work, we need to add an exception to the Pattern that folds `tensor.cast` Ops into DPS Ops because `scf::Forall` needs special handling of it's `BlockArgument` Type during this folding.

….cast

llvmbot added mlir mlir:tensor mlir:scf labels Sep 21, 2023

ubfx force-pushed the forall-dpsinterface branch from 2822e2e to 4e8e458 Compare September 21, 2023 07:52

matthias-springer approved these changes Sep 21, 2023

View reviewed changes

zero9178 reviewed Sep 21, 2023

View reviewed changes

matthias-springer requested changes Sep 21, 2023

View reviewed changes

ubfx requested a review from matthias-springer September 22, 2023 07:07

matthias-springer approved these changes Sep 22, 2023

View reviewed changes

ubfx force-pushed the forall-dpsinterface branch 2 times, most recently from 95e9841 to 6420451 Compare September 25, 2023 06:31

ubfx added 4 commits September 25, 2023 06:33

add comment clarifying the exception for forallOp folding with tensor…

5ad1cca

….cast

Exclude LoopLikeInterfaceOps instead of scf::ForallOp specifically

1c26da4

Update DPS interface implementation

d9575d4

ubfx force-pushed the forall-dpsinterface branch from 6420451 to d9575d4 Compare September 25, 2023 06:34

ubfx merged commit 24e33b5 into llvm:main Sep 25, 2023
3 checks passed

ubfx deleted the forall-dpsinterface branch September 26, 2023 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir] Implement DestinationStyleOpInterface for scf::ForallOp #66981

[mlir] Implement DestinationStyleOpInterface for scf::ForallOp #66981

ubfx commented Sep 21, 2023

llvmbot commented Sep 21, 2023 •

edited

Loading

zero9178 Sep 21, 2023 •

edited

Loading

matthias-springer Sep 21, 2023

ubfx Sep 21, 2023

MaheshRavishankar Sep 21, 2023

matthias-springer left a comment

MaheshRavishankar commented Sep 21, 2023

ubfx commented Sep 21, 2023 •

edited

Loading

matthias-springer Sep 22, 2023

ubfx Sep 22, 2023

[mlir] Implement DestinationStyleOpInterface for scf::ForallOp #66981

[mlir] Implement DestinationStyleOpInterface for scf::ForallOp #66981

Conversation

ubfx commented Sep 21, 2023

llvmbot commented Sep 21, 2023 • edited Loading

zero9178 Sep 21, 2023 • edited Loading

Choose a reason for hiding this comment

matthias-springer Sep 21, 2023

Choose a reason for hiding this comment

ubfx Sep 21, 2023

Choose a reason for hiding this comment

MaheshRavishankar Sep 21, 2023

Choose a reason for hiding this comment

matthias-springer left a comment

Choose a reason for hiding this comment

MaheshRavishankar commented Sep 21, 2023

ubfx commented Sep 21, 2023 • edited Loading

matthias-springer Sep 22, 2023

Choose a reason for hiding this comment

ubfx Sep 22, 2023

Choose a reason for hiding this comment

llvmbot commented Sep 21, 2023 •

edited

Loading

zero9178 Sep 21, 2023 •

edited

Loading

ubfx commented Sep 21, 2023 •

edited

Loading