This is squash of the following commits:

commit 5377f48 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Wed Feb 14 13:32:36 2024 -0700 Tapir target tweaks, LTO touch-ups, etc. commit dbfc195 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Feb 13 16:57:54 2024 -0700 Fixes for LTO... commit f9094d3 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Feb 12 14:31:23 2024 -0700 Chasing a bug in the LoopSpawning pass... commit 4ddb9d1 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Feb 12 11:09:20 2024 -0700 Runtime tweaks for refactoring launch parameters (for cuda). Tweaks to multi-file (LTO) euler3d experiment. commit f8ff7c5 Author: Patrick McCormick <> Date: Wed Feb 7 16:32:38 2024 -0700 More launch explorations. commit d720ced Author: Patrick McCormick <> Date: Wed Feb 7 10:10:59 2024 -0700 Tweaks on launch heuristics. commit ab61a3c Author: Patrick McCormick <> Date: Tue Feb 6 13:20:20 2024 -0700 working on experiments for benchmarking. commit f510df1 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Feb 6 13:31:53 2024 -0700 A bit more verobse output. commit 3eec9c0 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Feb 6 13:12:41 2024 -0700 Tweaks for launch heuristics (hacks). commit 75895a7 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Fri Feb 2 08:37:20 2024 -0700 More launch and compiler related tweaks and tests. Fix a mistake in the error reporting for the runtime's dylib handling... commit e8ee550 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Feb 1 13:01:21 2024 -0700 Experimenting with launch details and some nvvm metadata. commit 87fb4e4 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Jan 29 11:22:56 2024 -0700 Tweak to force environment variable to override occupancy-based launch parameter settings. commit 721f9f9 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Jan 29 09:23:19 2024 -0700 Tweaks for attribute support (launch parameters) and runtime auto-adjustment to launch parameters. commit 82c37ac Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Jan 23 11:56:22 2024 -0700 Small touch-ups on build details in experiments. Still finding some issues with kokkos, latest cuda (13.x), and other details (e.g., host compiler). commit c65d80c Merge: 1241ae0 acc3dfb Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Jan 23 11:00:07 2024 -0700 Merge remote-tracking branch 'origin/multi' into dev/16.x commit acc3dfb Author: Tarun Prabhu <tarun@lanl.gov> Date: Thu May 25 10:35:36 2023 -0600 A squash of many commits covering a broad scope: 1. Address some bugs/details/features introduced with the 16.x merge. - includes some minor tweaks for 16.x testing but this needs more work. - clang's sema probably needs to be revisited and improved. 2. A significant overhaul of the runtime to support: - binding of calling threads to unique (gpu) streams - removal of a lot of crufty code that was no longer being used. - simplified kernel launch options/interface - occupancy-based launch parameters (can cause performance regressions) - better environment variable support for tweaking behaviors and more flexibility for experimentation, testing, and debugging. 3. In alignment with lanl#2 portions of the transforms for CUDA and HIP have been cleaned up and simplified (in particular kernel launch details are much cleaner now). 4. Some bug fixes for attempts at post-processing code w/out parallel constructs. New "experiment" introduced to catch this as a regression. 5. Some runtime building blocks for driving prefetch operations. 6. Some new experiments/test codes. 7. Fix for nested outlining -- assumed dead-code elimination pass cleanup but fails with separate host and gpu code transformation modules. Had to introduce dead-code removal prior to gpu module passes (otherwise, the verifier pass fails). 8. Runtime entry points for numpy allocation entry points (e.g., calloc, realloc, etc.). TODO: Potentially some room here for GPU-side operations to improve performance. 9. Attribute support (e.g., target) for Kokkos 'statements'. 10. General code cleanup -- removing warnings, unused code, etc. 11. New support for launch parameter exploration within the experiments code base. 12. Some work on -ffast-math crashes and issues. TODO: This code needs to be further developed (expanded support for double-precision, additional entry points, etc.). There are also some issues here in what is specified on the command line can impact code from the host side but does not have a similar match on the GPU code of code transformation. TODO: ABI and other issues need to further explored. 13. Multiple target support within a code base is supported (e.g., run opencilk cpu threads and cuda-targeted forall loops). 14. Fixes around mutli-thread entry points within the runtime components. 15. Testing and feature support for H100; sync'ing CUDA and PTX version info, etc. commit 1241ae0 Author: Patrick McCormick <> Date: Fri Dec 8 16:50:13 2023 -0700 Dealing with some crufty system libraries on Darwin... This will likely break on newer installs (e.g., Arch). commit c734425 Author: Patrick McCormick <> Date: Fri Dec 8 15:23:21 2023 -0700 Missed cleaning up some debug statements in last commit... TODO: -ffastmath stuff... commit 6d81192 Author: Patrick McCormick <> Date: Fri Dec 8 15:00:12 2023 -0700 Some testing on H100. commit a7b07c0 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Fri Dec 8 14:58:00 2023 -0700 Cuda runtime tweaks for multi-target and multi-threads. Likely still extremely buggy under duress... commit a8bbeeb Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Fri Dec 8 12:47:46 2023 -0700 Quick memory allocation/free mutex for multi-device use cases. commit ea7a1b8 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Dec 5 12:59:32 2023 -0700 More work on regressions, fast-math mode, hip performance, etc. commit 40365ba Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Dec 5 13:02:21 2023 -0700 More work on regressions, fast-math mode, hip performance, etc. commit 9b75e11 Author: Patrick McCormick <> Date: Tue Dec 5 08:52:30 2023 -0700 Working on some issues surrounding --ffast-math: 1. ABI conflicts between the host stage and our module offload generation (e.g., host side passes generate vectorized code that is not supported on GPU backend(s). 2. Host architecture-centric tweaks occur before our GPU transform. That leads to addressing host architecture specific details as part of the transform (e.g., aarch64 and x86_64 will generate different calls vs. sticking with llvm intrinsics). A combo of ABI issues and/or the fact we're too late in the pass pipeline to address this with the current design means more work lies ahead... commit 2668be2 Author: Patrick McCormick <> Date: Mon Nov 27 15:09:29 2023 -0700 work on hip performance details. commit 243ff11 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Nov 28 12:52:52 2023 -0700 Testing streams and odd stalls (UVM?). This version seems to remove the stalls but also on a system with a newer kernel drop... CUDA only at this point. commit e7d0c09 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Nov 27 15:02:47 2023 -0700 Working on some runtime tweaks and clean up. Traced a new crash to the use of a ptxas whole-program optimization flag. commit 9512eb5 Author: Patrick McCormick <> Date: Fri Nov 17 08:45:58 2023 -0700 More work to setup the tests for better HIP and CUDA target flexiblity; including some reduced complexity the command line arg details in the makefile(s) (e.g., strip mining flags for GPU targets moved into the config files vs. being necessary in the makefile setup). Added better (correct) AMDGPU target attribute selection based on multiple target options (prior version was too hard-coded for gfx90a). commit afdb9c2 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Nov 16 12:59:11 2023 -0700 A bit more verbose and shared cuda and hip feature management (e.g., streaming modes). commit 4797933 Author: Patrick McCormick <> Date: Thu Nov 16 12:58:38 2023 -0700 Bug fixes for new prefetch feature set. commit 6875ccf Author: Patrick McCormick <> Date: Thu Nov 16 11:05:27 2023 -0700 More work on HIP performance debugging... commit 3ac2d66 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Nov 16 11:01:31 2023 -0700 First cut at CUDA prefetch streams support. Needs testing... commit 7e3a8a2 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Nov 16 08:14:07 2023 -0700 Some refactoring for HIP details, bug chasing, etc. commit 3f1e09e Author: Patrick McCormick <> Date: Wed Nov 15 09:40:07 2023 -0700 Some hacking for trying to debug AMD HIP code gen/runtime issues. A few new environment variables to make chasing (our tails) easier... - KITRT_THREADS_PER_BLOCK=1024 (default 256) - KITRT_MAX_NUM_PREFETCH_STREAMS=2 (default 4: size of round-robin stream queue for concurrent prefetch calls) - KITRT_DEVICE_ID=5 (default 0: change the default GPU selection) - KITRT_MIN_WARPS_PER_EXEC_UNIT=1 (default 1: reducing resource usage per warp -- impacts register allocation, etc.) The prefetch stream queue is enabled via the command line with "-mllvm -hipabi-streams". commit d3f74a0 Author: Patrick McCormick <> Date: Wed Nov 8 20:43:08 2023 -0700 Some cleanup and work to try and chase down HIP target runtime variabilty. commit c2bb71e Author: Patrick McCormick <> Date: Thu Nov 2 13:04:11 2023 -0600 chasing build issues/warnings/errors. commit b4bafb4 Author: Patrick McCormick <> Date: Thu Nov 2 09:04:25 2023 -0600 Chasing bugs... commit 9299708 Author: Patrick McCormick <> Date: Tue Oct 31 16:23:02 2023 -0600 working on benchmarks commit 74cb34f Author: Patrick McCormick <> Date: Wed Oct 25 14:31:56 2023 -0600 Exploring full kokkos builds w/ clang. commit 4dc3221 Author: Patrick McCormick <> Date: Tue Jun 27 14:13:50 2023 -0600 Some cleanup and small tweaks. commit 5136b24 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Wed Nov 8 20:27:27 2023 -0700 Attempt at a quick multi-stream prefetch feature. commit 8ece574 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Nov 2 11:48:12 2023 -0600 small tweaks to sort out some performance details. commit 9b00669 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Wed Nov 1 20:38:03 2023 -0600 Tweak in attempt to debug potential numa issues that are impacting consistent performance across multiple application runs. commit f5d53a6 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Oct 31 14:46:29 2023 -0600 A bit more cleanup and adding new tests specific to kitsune. commit ff078df Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Oct 31 09:47:10 2023 -0600 A bit more cleanup and adding some infrastructure for the multi-target test code (added makefile and a kokkos version). Not all the pieces are in place to fully test. commit d884674 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Oct 31 08:53:36 2023 -0600 Clean up some code cruft -- no need to duplicate else branch cases. commit 88bc75b Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Oct 31 08:39:57 2023 -0600 Forgot to save a cleaned up comment... commit bd7941e Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Oct 30 20:01:09 2023 -0600 New code to handle tapir attributes on Kokkos "statements". Some new code for cuda memory management details (calloc, realloc, etc.). Along with some prep work for upcoming memory management and movement changes.
tarunprabhu · Feb 16, 2024 · df8b5f7 · df8b5f7
1 parent 942c6f4
commit df8b5f7
Show file tree

Hide file tree

Showing 132 changed files with 205,025 additions and 4,440 deletions.
diff --git a/clang/include/clang/Basic/Attr.td b/clang/include/clang/Basic/Attr.td
@@ -4206,19 +4206,16 @@ def AvailableOnlyInDefaultEvalMethod : InheritableAttr {
 
 def TapirTarget : StmtAttr {
   let Spellings = [CXX11<"tapir","target">];
-  let Subjects = SubjectList<[ForallStmt, CXXForallRangeStmt],
-                 ErrorDiag, "'forall' statement">;
+  let Subjects = SubjectList<[ForallStmt, CXXForallRangeStmt, Stmt],
+                             ErrorDiag, "'parallel' statements">;
   let Args = [
-    EnumArgument<"TapirTargetAttrType", "TapirTargetAttrTy",
     // TODO: Is there a difference between "serial" and "none"?
-    ["cheetah",    "cilk",   "cuda",
-     "hip",        "libomp", "none",
-     "qthreads",   "realm",  "rocm",
-     "serial",     "zero",   "opencl"],
-    ["CheetahRT",  "CilkRT",  "CudaRT",
-     "HipRT",      "OmpRT",   "QthreadsRT",
-     "RealmRT",    "RocmRT",  "SequentialRT",
-     "ZeroRT",     "OpenCLRT"], 0>
+    EnumArgument<"TapirTargetAttrType", "TapirTargetAttrTy",
+                 ["none", "serial", "cuda", "hip", "lambda", "omptask",
+                  "opencilk", "openmp", "qthreads", "realm"],
+                 ["None", "Serial", "Cuda", "Hip", "Lambda", "OMPTask",
+                  "OpenCilk", "OpenMP", "Qthreads", "Realm"],
+                 0>
   ];
   let Documentation = [TapirRTDocs];
 }
@@ -4249,3 +4246,11 @@ def KitsuneMemAccess : InheritableAttr {
 
     let Documentation = [KitsuneMemAccessDocs];
 }
+
+def KitsuneLaunch : StmtAttr {
+    let Spellings = [CXX11<"kitsune","launch">];
+    let Subjects = SubjectList<[ForallStmt,CXXForallRangeStmt],
+    		   ErrorDiag, "'parallel' statements">;
+    let Args = [ExprArgument<"ThreadsPerBlock">];
+   let Documentation = [KitsuneLaunchDocs];    
+}
diff --git a/clang/include/clang/Basic/AttrDocs.td b/clang/include/clang/Basic/AttrDocs.td
@@ -7078,7 +7078,7 @@ the variables were declared in. It is not possible to check the return value
 }];
 }
 
-def TapirDocs : DocumentationCategory<"Tapir Attributes"> {
+def KitsuneDocs : DocumentationCategory<"Kitsune Attributes"> {
   let Content = [{
 Tapir and Kitsune introduce custom C++ attributres that let developer
 control behaviors at the source code level vs. only through command line
@@ -7087,7 +7087,7 @@ arguments.
 }
 
 def TapirRTDocs : Documentation {
-  let Category = TapirDocs;
+  let Category = KitsuneDocs;
   let Content = [{
 
 **Experimental**
@@ -7115,7 +7115,7 @@ Example:
 }
 
 def TapirStrategyDocs : Documentation {
-  let Category = TapirDocs;
+  let Category = KitsuneDocs;
   let Content = [{
 
 **Experimental**
@@ -7137,7 +7137,7 @@ strategies are:
 }
 
 def KitsuneMemAccessDocs : Documentation {
-  let Category = TapirDocs;
+  let Category = KitsuneDocs;
   let Content = [{
 
 The kitsune memory access attributes lets programmers inform the compiler of
@@ -7160,3 +7160,29 @@ inaccurate).
     ...
   }];
 }
+
+
+def KitsuneLaunchDocs : Documentation {
+  let Category = KitsuneDocs;
+  let Content = [{
+
+The kitsune launch attributes lets programmers inform the compiler how
+to tweak the code generation aspects for launching the GPU kernels for
+parallel constructs in the code.  This information is not verified in
+only a minor fashion at compile time and then again at runtime.  In
+general, it is up to the programmer to communicate the details
+correctly in terms of the semantics of the underlying GPU programming
+mechanisms. In general, this is intended more for kitsune development
+and testing vs. an end-user feature. 
+
+- threads_per_block : the number of threads per block to launch the kernel.
+
+.. code-block:: c++
+
+   [[kitsune:launch:threads_per_block(N)]]
+   forall(...) {
+     // loop body goes here. 
+   } 
+...
+  }];
+}
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -4316,26 +4316,22 @@ def note_called_once_gets_called_twice : Note<
 
 // +===== kitsune/tapir: errors related to custom codegen attributes. 
 def warn_tapir_attr_not_enabled: Warning<
-    "tapir attribute encountered but tapir is not enabled (via -ftapir).  "
+    "tapir attribute found but tapir is not enabled.  "
     "Attribute ignored.">,
     InGroup<Tapir>;
 
 def warn_tapir_target_attr_bad_stmt_class: Warning<
-    "tapir target runtime attribute applied to an unsupported statement.  "
+    "tapir target attribute on unsupported statement.  "
     "Attribute ignored.">,
     InGroup<Tapir>;
 
-def warn_kitsune_not_enabled: Warning<
-    "kitsune not enabled (-fkitsune) tapir attribute will be ignored.">,
-    InGroup<Kitsune>;
-
 // rutnime target-centric errors/warnings. 
 def err_tapir_target_attr_unsupported_stmt: Error<
-    "tapir target runtime attribute applied to unsupported statement kind.">;
+    "tapir target attribute on unsupported statement.">;
 def err_tapir_target_attr_wrong_nargs: Error<
-    "tapir target runtime attribute takes a single argument.">;
+    "tapir target attribute takes a single argument.">;
 def err_tapir_target_unknown: Error<
-    "statement using unknonwn tapir target runtime attribute.">;
+    "unknonwn tapir target.">;
 
 // execution strategy/policy errors/warnings. 
 def err_tapir_strategy_attr_unsupported_stmt: Error<
@@ -4345,6 +4341,13 @@ def err_tapir_strategy_attr_wrong_nargs: Error<
 def err_tapir_strategy_unknown: Error<
     "statement using unknown strategy attribute.">;
 
+def err_kitsune_launch_non_integral_type: Error<
+    "launch attribute: threads-per-block must be a built-in integer type.">;
+def err_kitsune_launch_tpb_too_large: Error<
+    "launch attribute: threads-per-block must be a 32-bit value.">;
+def err_kitsune_launch_tpb_must_be_positive: Error<
+    "launch attribute: threads-per-block must be a positive integer value.">;    
+
 // spawn+sync errors/warnings.
 // 
 def warn_spawn_as_loop_body: Warning<

diff --git a/clang/kitsune/CMakeLists.txt b/clang/kitsune/CMakeLists.txt
@@ -138,7 +138,7 @@ endif()
 option(KITSUNE_ENABLE_CUDA_ABI_TARGET
     "Enable the Kitsune+Tapir CUDA Toolkit target and codegen library." OFF)
 if (KITSUNE_ENABLE_CUDA_ABI_TARGET)
-  find_package(CUDAToolkit 10...12 REQUIRED)  # We may need to be a bit more specifc on minor versions here.
+  find_package(CUDAToolkit 12.0 REQUIRED)  
   set(TAPIR_CUDA_ABI_TARGET_CFG_FILENAME
       "cuda.cfg"
       CACHE

diff --git a/clang/kitsune/cuda.cfg.in b/clang/kitsune/cuda.cfg.in
@@ -1,6 +1,9 @@
 # Flags for the CUDA runtime ABI target.
 -D_tapir_cuda_target
--L${CMAKE_INSTALL_PREFIX}/lib
+-mllvm -stripmine-count=1
+-mllvm -stripmine-coarsen-factor=1
 ${KITSUNE_CUDA_ABI_TARGET_EXTRA_LINK_FLAGS}
+-Wl,-rpath=${CMAKE_INSTALL_PREFIX}/lib
+-L${CMAKE_INSTALL_PREFIX}/lib
 -lkitrt
 ${KITSUNE_CUDA_ABI_TARGET_LINK_LIBS}
diff --git a/clang/kitsune/hip.cfg.in b/clang/kitsune/hip.cfg.in
@@ -1,5 +1,8 @@
 # Flags for the HIP runtime ABI target.
+#-ffp-contract=fast-honor-pragmas
 -D_tapir_hip_target
+-mllvm -stripmine-count=1
+-mllvm -stripmine-coarsen-factor=1
 -L${CMAKE_INSTALL_PREFIX}/lib
 ${KITSUNE_HIP_TARGET_EXTRA_LINK_FLAGS} 
 -lkitrt

diff --git a/clang/lib/CodeGen/CGExpr.cpp b/clang/lib/CodeGen/CGExpr.cpp
@@ -5019,7 +5019,14 @@ RValue CodeGenFunction::EmitCallExpr(const CallExpr *E,
       std::string qname = fdecl->getQualifiedNameAsString();
       if (qname == "Kokkos::parallel_for" ||
           qname == "Kokkos::parallel_reduce") {
-        if (EmitKokkosConstruct(E))
+	// We handle the special case of Tapir target attributes on a
+	// Kokkos "statement" elsewhere (as the attribute is not
+	// really attached to the CallExpr but instead the C++ goop
+	// around the call -- implicit and clean up stuff).  If we
+	// have seen such an attribute it was saved and we can simply
+	// pass TapirAttrs on from here for the Kokkos code
+	// transformation/generation.
+        if (EmitKokkosConstruct(E, TapirAttrs))
           return RValue::get(nullptr);
       } else if (getLangOpts().KokkosNoInit &&
                  (qname == "Kokkos::initialize" ||