Add fusion_debug dump option with Val log #2326

jacobhinkle · 2023-01-12T20:56:37Z

This PR adds a new dump option called fusion_debug which prints all the Vals and Exprs in a Fusion during compileFusion, and also prints an ordered list of operations that were done on those Vals. For example:

  Logged operations:                                                                                                                      
    0) T2_g : TensorView::reorder(T2_g):                                                                                                  
    1) [ iblockIdx.x4, rS8, rthreadIdx.x7, rUS9 ] : TensorDomain::reorder([ iS4, rS5 ]):   1->0                                           
    2) T2_g : TensorView::reorder(T2_g):                                                                                                  
    3) [ iblockIdx.x4, rS8, rthreadIdx.x7, rUS9 ] : TensorDomain::reorder([ rS5, iS4 ]):   1->0                                           
    4) T2_g : TensorView::split(T2_g): 1, blockDim.x, 1, 0                                                                                
    5) [ iblockIdx.x4, rS8, rthreadIdx.x7, rUS9 ] : TensorDomain::split([ iS4, rS5 ]): 1, blockDim.x, 1, 0                                
    6) rS5 : IterDomain::split(rS5): blockDim.x, 1, 0                                                                                     
    7) rthreadIdx.x7 : IterDomain::parallelize(rS7): threadIdx.x                                                                          
    8) T2_g : TensorView::split(T2_g): 1, 1, 1, 0                                                                                         
    9) [ iblockIdx.x4, rS8, rthreadIdx.x7, rUS9 ] : TensorDomain::split([ iS4, rS6, rthreadIdx.x7 ]): 1, 1, 1, 0                          
    10) rS6 : IterDomain::split(rS6): 1, 1, 0                                                                                             
    11) rUS9 : IterDomain::parallelize(rS9): US                                                                                           
    12) iblockIdx.x4 : IterDomain::parallelize(iS4): blockIdx.x                                                                           
    13) T2_g : TensorView::reorder(T2_g):                                                                                                 
    14) [ iblockIdx.x4, rS8, rthreadIdx.x7, rUS9 ] : TensorDomain::reorder([ iblockIdx.x4, rS8, rUS9, rthreadIdx.x7 ]):   3->2  2->3  1->1
  0->0                                                                                                                                    
    15) T2_g : TensorView::rFactor(T2_g):  1 3                                                                                            
    16) ithreadIdx.x13 : IterDomain::parallelize(iS13): threadIdx.x                                                                       
    17) rUS15 : IterDomain::parallelize(rS15): US                                                                                         
    18) iblockIdx.x16 : IterDomain::parallelize(iS16): blockIdx.x                                                                         
    19) rthreadIdx.x17 : IterDomain::parallelize(rS17): threadIdx.x                                                                       
    20) iblockIdx.x2 : IterDomain::parallelize(iS2): blockIdx.x                                                                           
    21) ithreadIdx.x19 : IterDomain::parallelize(iS19): threadIdx.x                                                                       
    22) iUS21 : IterDomain::parallelize(iS21): US                                                                                         
    23) T0_g : TensorView::inlineAt(T0_g): -1, 1                                                                                          
    24) T1_l : TensorView::inlineAt(T1_l): -1, 1                                                                                          
    25) T3_l : TensorView::inlineAt(T3_l): -1, 1                                                                                          
    26) T2_g : TensorView::inlineAt(T2_g): -1, 1

Notice that the names of Vals in this log are all abbreviated: they're only the names and not other stuff that's typically printed in toString(). To accomplish that, toString now takes a fmt argument whose type SerializationFormat can be one of Default (current behavior), NameOnly, or Debug. The Debug option prints slightly more than Default: in the case of a TensorView it prints contiguity_ for example. The SerializationFormat enum can be extended in the future to includes stuff like JSON, but I haven't explored that much yet.

This enables setting the print format, including printing Fusions to arbitrary ostreams. Currently, there are two options for `SerializationFormat`: Default and Debug. The environment variable `PYTHON_NVFUSER_PRINT_FORMAT` is checked whenever a Fusion is printed (not if you print other individual `Statement`s). Currently the information printed is somewhat sparse, but I plan to add more detailed information to help tracking things like the mismatch in variable name indices between manual and automatically scheduled tests.

NameOnly is meant to become a shorter printout, but currently not all Statements have names that I can print. This commit introduces a sorted `fusion_debug` dump option that enables diffing.

I think we may want this to be a separate dump option since it is a pretty concise look at what happened without the noise that comes with fusion_debug. I'm not currently sure what utility fusion_debug has other than this log, but that may be because I haven't had to dive into the internals of the Vals and Exprs yet for a real problem.

The latter is to be used when not logging to `this`. I made these macros so that we could avoid computation of arguments to log() in Release builds, but currently that check is disabled since I'm not sure we'd want to force a full recompile of pytorch in order to dump this info.

jacobhinkle · 2023-01-13T14:57:49Z

This PR was motivated by cases like the following. Consider this basic fusion:

pytorch/third_party/nvfuser/test/test_gpu_match_frontend.cpp

Lines 197 to 205 in a06224f

    
           auto tv0 = makeSymbolicTensor(2); // {i0, i1} 
        
           auto c0 = IrBuilder::create<Double>(3.0); 
        
           fusion.addInput(tv0); 
        
           auto tv1 = mul(tv0, c0); // {i0, i1} 
        
           auto tv2 = sum(tv1, {-1}, false, DataType::Float); // {i0, r1} 
        
           fusion.addOutput(tv2);

Using manual and automatic scheduling give identical fusion_ir printouts, but looking at the generated kernels, they differ slightly:

These kernels are obviously equivalent, but there are some differences possibly in the ordering of operations. When we dump fusion_debug and diff we see the following operation log (green is automatic and red is manual schedule):

Now we see a lot of differences. There are other differences in the detailed dump that precedes the op log, indicating that some intermediate tensors (that are not shown at all in the fusion math or transforms view) may be parallelized differently, e.g.

I currently print a warning in `Fusion::printDebug` when running a Release build.

jacobhinkle · 2023-01-13T17:03:49Z

With the printout above of the Fusion operation log, I was able to exactly match the automatic scheduler exactly for this problem, and learned a few things about how the automatic scheduler does things. Here is what my previous manual schedule looked like:

  // Perform manual scheduling
  tv2->split(1, NamedScalar::getParallelDim(ParallelType::TIDx));
  tv2->split(1, 1);
  tv2->reorder({{-1, -2}, {-2, -1}});

  auto tv3 = tv2->rFactor({1, 3});

  tv3->axis(0)->parallelize(ParallelType::BIDx);
  tv3->axis(2)->parallelize(ParallelType::TIDx);
  tv3->axis(3)->parallelize(ParallelType::Unswitch);

  // propagate the mapping to other tensors
  TransformPropagatorWithCheck propagator(tv3);
  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
  scheduler_utils::parallelizeAllLike(tv3, {tv0, tv1, tv2});

  tv1->computeAt(tv3, -1);

  inlineMost();

And here is the one that matches the auto scheduler:

  // Perform manual scheduling
  tv2->reorder({{1, 0}});
  tv2->reorder({{1, 0}});
  tv2->split(1, NamedScalar::getParallelDim(ParallelType::TIDx));
  tv2->axis(2)->parallelize(ParallelType::TIDx);
  tv2->split(1, 1);
  tv2->axis(2)->parallelize(ParallelType::Unswitch);
  tv2->axis(0)->parallelize(ParallelType::BIDx);

  // tv2->reorder({{-2, -1}}) has same effect but this shows the mapping explicitly
  tv2->reorder({{0, 0}, {1, 1}, {2, 3}, {3, 2}});

  auto tv3 = tv2->rFactor({1, 3});

  // propagate the mapping to other tensors
  TransformPropagatorWithCheck propagator(tv3);
  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
  scheduler_utils::parallelizeAllLike(tv3, {}, allParallelTypesExcept(
          {ParallelType::Unroll,
           ParallelType::Vectorize,
           ParallelType::MisalignedVectorize}));

  inlineMost();

Note that these both give equal fusion_ir printouts. Here we can see that the reduction scheduler puts parallelize right after splits, which I've found is good practice since it's harder to keep track of the axes you want to parallelize if you do it later after numerous splits and reorders. We also see that you can call parallelizeAllLike with an empty list of tensors to indicate all of them and you can tell it all the parallel types you want to filter on.

One odd thing is the double reorder at the very beginning of the automatic schedule. This doesn't effect the fusion_ir printout or the generated kernel at all, but I put it there because it does show up in the operation log. It is probably a special case of a more general part of the scheduler and is not always a trivial transform, but I haven't yet dug in to find out where that happens in the reduction scheduler.

To make these tests pass including matching the generated kernels, I used the WIP fusion_debug dump output from #2326. This let me see a noisy log of operations that revealed a few patterns I was unaware of. For example, in the FrontendAdd test, the pointwise scheduler is used, and the order of inlining at the end is not defined (and does change) due to using unordered_set. Also, notice that getPointwiseHeuristics actually creates a Val (fusion.oneVal()) so it actually does technically alter the Fusion since it adds a Val to its vals_ list. Beyond those trivial things, I noticed differences in inlining patterns between the pointwise and reduction schedulers, and also saw some different ways to call methods like parallelizeAllLike. One thing I haven't looked more into is the reorders that often happen at the beginning of scheduling both pointwise and reduction fusions. They have no effect on the generated kernels, but it is noticeable and I plan to read up on it soon.

Again I used the fusion_debug dump from #2326 to trace what the reduction scheduler is doing. This time I learned about multiReductionInliner, which uses two calls to parallelizeAllLike for different types of ParallelTypes, followed by an undoing of unrolling and vectorization on the reference tensor. The need for the latter is still a little unclear to me.

To make these tests pass including matching the generated kernels, I used the WIP fusion_debug dump output from #2326. This let me see a noisy log of operations that revealed a few patterns I was unaware of. For example, in the FrontendAdd test, the pointwise scheduler is used, and the order of inlining at the end is not defined (and does change) due to using unordered_set. Also, notice that getPointwiseHeuristics actually creates a Val (fusion.oneVal()) so it actually does technically alter the Fusion since it adds a Val to its vals_ list. Beyond those trivial things, I noticed differences in inlining patterns between the pointwise and reduction schedulers, and also saw some different ways to call methods like parallelizeAllLike. One thing I haven't looked more into is the reorders that often happen at the beginning of scheduling both pointwise and reduction fusions. They have no effect on the generated kernels, but it is noticeable and I plan to read up on it soon.

Again I used the fusion_debug dump from #2326 to trace what the reduction scheduler is doing. This time I learned about multiReductionInliner, which uses two calls to parallelizeAllLike for different types of ParallelTypes, followed by an undoing of unrolling and vectorization on the reference tensor. The need for the latter is still a little unclear to me.

rdspring1 · 2023-01-20T22:57:14Z

third_party/nvfuser/csrc/fusion.cpp

+  FUSER_PERF_SCOPE("Fusion::printDebug");
+
+  out << "Fusion DEBUG INFO {";
+  std::vector<Val*> inputs_;


This vector aliases the fusion's inputs_ vector so it always shows that the fusion doesn't have any inputs.

Yeah, I think we just need to delete this line.

naoyam

Overall, looks pretty good. Would be really useful as it would save repeated ad-hoc printf insertions!

My comments are mostly just some suggestions to consider if we could reduce repeated code by using some of the existing utility routines.

naoyam · 2023-01-24T21:00:46Z

third_party/nvfuser/csrc/utils.h

@@ -118,6 +120,32 @@ TORCH_CUDA_CU_API bool isOptionEnabled(EnableOption option);
 TORCH_CUDA_CU_API const std::vector<std::string>& getEnableOptionArguments(
    EnableOption option);

+TORCH_CUDA_CU_API bool isDebugDumpEnabled(DebugDumpOption option);


Is this a duplicate of another declaration at line 77?

Looks like the below getDebugDumpArguments is also a duplicate.

naoyam · 2023-01-24T21:12:19Z

third_party/nvfuser/csrc/ir_base_nodes.h

+  //! Write a message to this object's log
+  void log(std::string op_name, std::vector<std::string> arg_strings) {
+#ifdef NDEBUG
+    static bool warned = false;


There's TORCH_WARN and TORCH_WARN_ONCE. Not very familiar with them, but you might want to use either of them.

naoyam · 2023-01-24T21:14:32Z

third_party/nvfuser/csrc/ir_base_nodes.h

@@ -361,6 +365,37 @@ class TORCH_CUDA_CU_API Val : public Statement {

  void resolveIndexDtype();

+  //! Get vector of log messages for this object


nit: I'd prefer to have the type explicitly stated as std::vector<...> than having a comment saying it's a vector.

naoyam · 2023-01-24T21:15:32Z

third_party/nvfuser/csrc/ir_base_nodes.h

+  }
+
+  //! Write a message to this object's log
+  void log(std::string op_name, std::vector<std::string> arg_strings) {


Can you move the def to ir_base_nodes.cpp?

naoyam · 2023-01-24T21:17:35Z

third_party/nvfuser/csrc/ir_base_nodes.h

+  }
+
+  //! Write a message to this object's log
+  void log(std::string op_name, std::vector<std::string> arg_strings) {


What operation does op_name refer to?

I see what this means now after reading the other parts of the code. Would be helpful to have some (short) comments about the parameters.

naoyam · 2023-01-24T21:21:02Z

third_party/nvfuser/csrc/ir_interface_nodes.h

@@ -62,6 +62,7 @@ class TORCH_CUDA_CU_API Scalar : public Val {
            (c10::is_complex<UnderlyingType>::value && isComplexType(dtype)),
        "Invalid data type: ",
        dtype);
+    VAL_LOG("Scalar::Scalar", "IrBuilderPasskey", typePrefix(dtype));


I'd drop IrBuilderPasskey as it's not really important to be printed out

naoyam · 2023-01-24T21:33:06Z

third_party/nvfuser/csrc/ir_nodes.cpp

+      "IterDomain::split",
+      factor->toString(0, SerializationFormat::NameOnly),
+      std::to_string(inner_split),
+      std::to_string(trim_out_of_bounds), );


There's Printer utility class in utils.h. Could it help reduce repetition? For example, could VAL_LOG_EXPLICIT just take whatever objects and use Printer to covert to a string?

naoyam · 2023-01-24T21:35:30Z

third_party/nvfuser/csrc/ir_nodes.cpp

@@ -2576,6 +2702,13 @@ void TensorDomain::reorder(const std::unordered_map<int, int>& old2new_) {
  TORCH_INTERNAL_ASSERT(
      !(nDims() == 0 && old2new_.size() > 0),
      "Tried to reorder a 0-dim domain");
+#ifndef NDEBUG


Why is this guarded? Isn't VAL_LOG a no-op when NDEBUG is defined?

naoyam · 2023-01-24T21:39:07Z

third_party/nvfuser/csrc/scheduler/utils.cpp

+    bool comma = false;
+    for (auto t : selected_tvs) {
+      if (comma)
+        ss << ", ";


There's also toDelimitedString in utils.h, which you may be able to just use or extend.

naoyam · 2023-01-24T21:40:51Z

third_party/nvfuser/csrc/fusion.cpp

+  FUSER_PERF_SCOPE("Fusion::printDebug");
+
+  out << "Fusion DEBUG INFO {";
+  std::vector<Val*> inputs_;


Yeah, I think we just need to delete this line.

jacobhinkle added 4 commits January 12, 2023 15:46

Implement Fusion.printDebug() and NameOnly mode

714cc92

NameOnly is meant to become a shorter printout, but currently not all Statements have names that I can print. This commit introduces a sorted `fusion_debug` dump option that enables diffing.

github-actions bot added the NNC label Jan 12, 2023

jacobhinkle changed the base branch from master to devel January 12, 2023 21:00

jacobhinkle added 2 commits January 13, 2023 11:42

Only log Fusion operations in Debug builds

8ff7aa8

I currently print a warning in `Fusion::printDebug` when running a Release build.

Log TensorView/Scalar ctors and parallelizeAllLike

18b9a4a

jacobhinkle force-pushed the fusion_debug_logging branch from 83d6c84 to 18b9a4a Compare January 13, 2023 16:46

Fix typo in compute_at_mode_to_string()

2afe39c

rdspring1 reviewed Jan 20, 2023

View reviewed changes

csarofeen requested review from naoyam and zasdfgbnm January 24, 2023 13:36

naoyam reviewed Jan 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fusion_debug dump option with Val log #2326

Add fusion_debug dump option with Val log #2326

jacobhinkle commented Jan 12, 2023

jacobhinkle commented Jan 13, 2023

jacobhinkle commented Jan 13, 2023

rdspring1 Jan 20, 2023

naoyam Jan 24, 2023

naoyam left a comment

naoyam Jan 24, 2023

naoyam Jan 24, 2023

naoyam Jan 24, 2023

naoyam Jan 24, 2023

naoyam Jan 24, 2023

naoyam Jan 24, 2023

naoyam Jan 24, 2023

naoyam Jan 24, 2023

naoyam Jan 24, 2023

naoyam Jan 24, 2023

naoyam Jan 24, 2023

		@@ -361,6 +365,37 @@ class TORCH_CUDA_CU_API Val : public Statement {

		void resolveIndexDtype();

		//! Get vector of log messages for this object

Add fusion_debug dump option with Val log #2326

Are you sure you want to change the base?

Add fusion_debug dump option with Val log #2326

Conversation

jacobhinkle commented Jan 12, 2023

jacobhinkle commented Jan 13, 2023

jacobhinkle commented Jan 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment