Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir] Implement DestinationStyleOpInterface for scf::ForallOp #66981

Merged
merged 4 commits into from
Sep 25, 2023

Conversation

ubfx
Copy link
Member

@ubfx ubfx commented Sep 21, 2023

scf::ForallOp has shared_outs tensor operands which are used to insert partial results into in the parallel terminator. The scf::ForallOp returns one tensor for each shared_out which then contains the combined result from all threads. Since the parallel terminator cannot change the shape of the shared_out, ForallOp is a DestinationStyleOp and this patch implements the interface by declaring the outputs operands as inits in the language of the DPS interface.

For this change to work, we need to add an exception to the Pattern that folds tensor.cast Ops into DPS Ops because scf::Forall needs special handling of it's BlockArgument Type during this folding.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 21, 2023

@llvm/pr-subscribers-mlir-tensor
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-scf

Changes

scf::ForallOp has shared_outs tensor operands which are used to insert partial results into in the parallel terminator. The scf::ForallOp returns one tensor for each shared_out which then contains the combined result from all threads. Since the parallel terminator cannot change the shape of the shared_out, ForallOp is a DestinationStyleOp and this patch implements the interface by declaring the outputs operands as inits in the language of the DPS interface.

For this change to work, we need to add an exception to the Pattern that folds tensor.cast Ops into DPS Ops because scf::Forall needs special handling of it's BlockArgument Type during this folding.


Full diff: https://github.com/llvm/llvm-project/pull/66981.diff

3 Files Affected:

  • (modified) mlir/include/mlir/Dialect/SCF/IR/SCF.h (+1)
  • (modified) mlir/include/mlir/Dialect/SCF/IR/SCFOps.td (+10)
  • (modified) mlir/lib/Dialect/Tensor/IR/TensorOps.cpp (+4)
diff --git a/mlir/include/mlir/Dialect/SCF/IR/SCF.h b/mlir/include/mlir/Dialect/SCF/IR/SCF.h
index 915ab3016b688e7..644118ca884c6b1 100644
--- a/mlir/include/mlir/Dialect/SCF/IR/SCF.h
+++ b/mlir/include/mlir/Dialect/SCF/IR/SCF.h
@@ -19,6 +19,7 @@
 #include "mlir/IR/BuiltinTypes.h"
 #include "mlir/IR/RegionKindInterface.h"
 #include "mlir/Interfaces/ControlFlowInterfaces.h"
+#include "mlir/Interfaces/DestinationStyleOpInterface.h"
 #include "mlir/Interfaces/InferTypeOpInterface.h"
 #include "mlir/Interfaces/LoopLikeInterface.h"
 #include "mlir/Interfaces/ParallelCombiningOpInterface.h"
diff --git a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
index 08b71e20a2bc079..adc7b2e4170cb89 100644
--- a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
+++ b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
@@ -17,6 +17,7 @@ include "mlir/Interfaces/ControlFlowInterfaces.td"
 include "mlir/Interfaces/LoopLikeInterface.td"
 include "mlir/IR/RegionKindInterface.td"
 include "mlir/Dialect/SCF/IR/DeviceMappingInterface.td"
+include "mlir/Interfaces/DestinationStyleOpInterface.td"
 include "mlir/Interfaces/InferTypeOpInterface.td"
 include "mlir/Interfaces/ParallelCombiningOpInterface.td"
 include "mlir/Interfaces/SideEffectInterfaces.td"
@@ -333,6 +334,7 @@ def ForallOp : SCF_Op<"forall", [
        RecursiveMemoryEffects,
        SingleBlockImplicitTerminator<"scf::InParallelOp">,
        DeclareOpInterfaceMethods<RegionBranchOpInterface>,
+       DestinationStyleOpInterface
      ]> {
   let summary = "evaluate a block multiple times in parallel";
   let description = [{
@@ -630,6 +632,14 @@ def ForallOp : SCF_Op<"forall", [
                                  Location loc);
 
     InParallelOp getTerminator();
+
+    // Implement this to declare all shared_outs as inits/outs to 
+    // DestinationStyleOpInterface
+    std::pair<int64_t, int64_t> getDpsInitsPositionRange() {
+      int64_t numOperands = getNumOperands();
+      int64_t numOuts = getOutputs().size();
+      return {numOperands - numOuts, numOperands};
+    }
   }];
 }
 
diff --git a/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp b/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp
index 3e30e320bee8f83..fa91471f33d4bd3 100644
--- a/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp
+++ b/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp
@@ -10,6 +10,7 @@
 #include "mlir/Dialect/Arith/IR/Arith.h"
 #include "mlir/Dialect/Arith/Utils/Utils.h"
 #include "mlir/Dialect/Complex/IR/Complex.h"
+#include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/Dialect/Utils/IndexingUtils.h"
 #include "mlir/Dialect/Utils/ReshapeOpsUtils.h"
@@ -3970,6 +3971,9 @@ struct FoldTensorCastProducerOp
     if (isa<InsertSliceOp>(op.getOperation()))
       return failure();
 
+    if (isa<scf::ForallOp>(op.getOperation()))
+      return failure();
+
     // If no operand comes from a tensor::CastOp and can be folded then fail.
     bool hasTensorCastOperand =
         llvm::any_of(op->getOpOperands(), [&](OpOperand &opOperand) {

Comment on lines 3974 to 3978
// scf::ForallOp also has its own folding logic taking the Types of its
// BlockArguments into consideration.
if (isa<scf::ForallOp>(op.getOperation()))
return failure();

Copy link
Member

@zero9178 zero9178 Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are problematic as the tensor dialect does not depend on the SCF dialect.
To make this matter even worse, the SCF dialect currently depends on the Tensor dialect,, creating a circular dependency between the two.
This leads to BUILD_SHARED_LIBS to fail building as well as any downstream user dependending on the tensor dialect to get linker errors that they need to include SCF as well now.

FAILED: lib/libMLIRTensorDialect.so.18git 
ld.lld: error: undefined symbol: mlir::detail::TypeIDResolver<mlir::scf::ForallOp, void>::id
>>> referenced by TensorOps.cpp
>>>               tools/mlir/lib/Dialect/Tensor/IR/CMakeFiles/obj.MLIRTensorDialect.dir/TensorOps.cpp.o:(FoldTensorCastProducerOp::matchAndRewrite(mlir::DestinationStyleOpInterface, mlir::PatternRewriter&) const)
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)

I sadly know very little about the tensor dialect/bufferzization/all that stuff, but is this kind of layering really necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I missed that. We should avoid such references to other dialects. FoldTensorCastProducerOp should not fold casts into destination style ops that have a region without further checks. At the moment there is no way to query whether such folding is safe or not.

Maybe we can do something with a new interface. We already have CastOpInterface. We could have another FoldCastOpInterface that ops can implement that support folding of cast ops into themselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out!

I just tried adding a simple check for op->getNumRegions() != 0 but this also doesn't work because it excludes all of the linalgOps that are some of the primary targets of this Pattern. How about excluding Ops that have tensor bbArgs, or is this getting too specific?

About the potential Interface - is there a way to implement this so that it doesn't add extra implementation effort to every DPS op?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just look for LoopLikeOpInterface here... That adds a dependence from DestPassingStyleOpInterface to LoopLikeOpInterface, but we might be able to live with that.

Copy link
Member

@matthias-springer matthias-springer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layering issues (scf.forall is not a tensor dialect op)

@MaheshRavishankar
Copy link
Contributor

I am in favor of this change, but given the dependence issue can you comment a bit more on the motivation for this change?

@ubfx
Copy link
Member Author

ubfx commented Sep 21, 2023

given the dependence issue can you comment a bit more on the motivation for this change?

My original idea was to implement ValueBoundsOpInterface for scf::ForallOp because the result shapes are coupled to the shared_outs shapes which offers a nice opportunity to propagate shape constraints through loop nests. Matthias mentioned that he had recently implemented shape constraint propagation through DPS ops (#65736) and so implementing the DPS interface for scf::ForallOp would also take care of that.

Additionally, I think forall matches the DPS contract very well - and the language shared_outs already matches the DPS language (outs), so it would be consistent to make it an actual DPS op. But I am unsure whether it would lead to any other improvements regarding bufferization down the line.

One caveat about the ValueBoundsOpInterface: Since the ForallOp also has iteration variables which might get queried through the interface, we might still need an additional implementation of that interface, even if this change goes through.


// Implement this to declare all shared_outs as inits/outs to
// DestinationStyleOpInterface
std::pair<int64_t, int64_t> getDpsInitsPositionRange() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has changed in #67015.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I updated that part.

@ubfx ubfx force-pushed the forall-dpsinterface branch 2 times, most recently from 95e9841 to 6420451 Compare September 25, 2023 06:31
`scf::ForallOp` has `shared_outs` tensor operands which are used to
insert partial results into in the parallel terminator. The
`scf::ForallOp` returns one tensor for each `shared_out` which then
contains the combined result from all threads. Since the parallel
terminator cannot change the shape of the `shared_out`, ForallOp is a
`DestinationStyleOp` and this patch implements the interface by declaring
the `outputs` operands as `inits` in the language of the DPS interface.

For this change to work, we need to add an exception to the Pattern
that folds `tensor.cast` Ops into DPS Ops because `scf::Forall` needs
special handling of it's `BlockArgument` Type during this folding.
@ubfx ubfx merged commit 24e33b5 into llvm:main Sep 25, 2023
3 checks passed
@ubfx ubfx deleted the forall-dpsinterface branch September 26, 2023 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants