Skip to content

Commit

Permalink
This is squash of the following commits:
Browse files Browse the repository at this point in the history
commit 5377f48
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Wed Feb 14 13:32:36 2024 -0700

    Tapir target tweaks, LTO touch-ups, etc.

commit dbfc195
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Feb 13 16:57:54 2024 -0700

    Fixes for LTO...

commit f9094d3
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Mon Feb 12 14:31:23 2024 -0700

    Chasing a bug in the LoopSpawning pass...

commit 4ddb9d1
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Mon Feb 12 11:09:20 2024 -0700

    Runtime tweaks for refactoring launch parameters (for cuda).

    Tweaks to multi-file (LTO) euler3d experiment.

commit f8ff7c5
Author: Patrick McCormick <>
Date:   Wed Feb 7 16:32:38 2024 -0700

    More launch explorations.

commit d720ced
Author: Patrick McCormick <>
Date:   Wed Feb 7 10:10:59 2024 -0700

    Tweaks on launch heuristics.

commit ab61a3c
Author: Patrick McCormick <>
Date:   Tue Feb 6 13:20:20 2024 -0700

    working on experiments for benchmarking.

commit f510df1
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Feb 6 13:31:53 2024 -0700

    A bit more verobse output.

commit 3eec9c0
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Feb 6 13:12:41 2024 -0700

    Tweaks for launch heuristics (hacks).

commit 75895a7
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Fri Feb 2 08:37:20 2024 -0700

    More launch and compiler related tweaks and tests.  Fix a mistake in
    the error reporting for the runtime's dylib handling...

commit e8ee550
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Thu Feb 1 13:01:21 2024 -0700

    Experimenting with launch details and some nvvm metadata.

commit 87fb4e4
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Mon Jan 29 11:22:56 2024 -0700

    Tweak to force environment variable to override occupancy-based
    launch parameter settings.

commit 721f9f9
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Mon Jan 29 09:23:19 2024 -0700

    Tweaks for attribute support (launch parameters) and runtime
    auto-adjustment to launch parameters.

commit 82c37ac
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Jan 23 11:56:22 2024 -0700

    Small touch-ups on build details in experiments.

    Still finding some issues with kokkos, latest cuda (13.x), and other
    details (e.g., host compiler).

commit c65d80c
Merge: 1241ae0 acc3dfb
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Jan 23 11:00:07 2024 -0700

    Merge remote-tracking branch 'origin/multi' into dev/16.x

commit acc3dfb
Author: Tarun Prabhu <tarun@lanl.gov>
Date:   Thu May 25 10:35:36 2023 -0600

    A squash of many commits covering a broad scope:

       1. Address some bugs/details/features introduced with the 16.x merge.
          - includes some minor tweaks for 16.x testing but this needs more work.
          - clang's sema probably needs to be revisited and improved.
       2. A significant overhaul of the runtime to support:
          - binding of calling threads to unique (gpu) streams
          - removal of a lot of crufty code that was no longer being used.
          - simplified kernel launch options/interface
          - occupancy-based launch parameters (can cause performance regressions)
          - better environment variable support for tweaking behaviors and
            more flexibility for experimentation, testing, and debugging.
       3. In alignment with lanl#2 portions of the transforms for CUDA and HIP have
          been cleaned up and simplified (in particular kernel launch details are
          much cleaner now).
       4. Some bug fixes for attempts at post-processing code w/out parallel
          constructs.  New "experiment" introduced to catch this as a regression.
       5. Some runtime building blocks for driving prefetch operations.
       6. Some new experiments/test codes.
       7. Fix for nested outlining -- assumed dead-code elimination pass cleanup
          but fails with separate host and gpu code transformation modules.  Had
          to introduce dead-code removal prior to gpu module passes (otherwise, the
          verifier pass fails).
       8. Runtime entry points for numpy allocation entry points (e.g., calloc,
          realloc, etc.).  TODO: Potentially some room here for GPU-side operations to
          improve performance.
       9. Attribute support (e.g., target) for Kokkos 'statements'.
       10. General code cleanup -- removing warnings, unused code, etc.
       11. New support for launch parameter exploration within the experiments code
           base.
       12. Some work on -ffast-math crashes and issues.  TODO: This code needs to be
           further developed (expanded support for double-precision, additional entry
           points, etc.).  There are also some issues here in what is specified on
           the command line can impact code from the host side but does not have a
           similar match on the GPU code of code transformation.  TODO: ABI and
           other issues need to further explored.
       13. Multiple target support within a code base is supported (e.g., run opencilk
           cpu threads and cuda-targeted forall loops).
       14. Fixes around mutli-thread entry points within the runtime components.
       15. Testing and feature support for H100; sync'ing CUDA and PTX version info, etc.

commit 1241ae0
Author: Patrick McCormick <>
Date:   Fri Dec 8 16:50:13 2023 -0700

    Dealing with some crufty system libraries on Darwin... This will likely break on
    newer installs (e.g., Arch).

commit c734425
Author: Patrick McCormick <>
Date:   Fri Dec 8 15:23:21 2023 -0700

    Missed cleaning up some debug statements in last commit...

    TODO: -ffastmath stuff...

commit 6d81192
Author: Patrick McCormick <>
Date:   Fri Dec 8 15:00:12 2023 -0700

    Some testing on H100.

commit a7b07c0
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Fri Dec 8 14:58:00 2023 -0700

    Cuda runtime tweaks for multi-target and multi-threads.  Likely still extremely
    buggy under duress...

commit a8bbeeb
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Fri Dec 8 12:47:46 2023 -0700

    Quick memory allocation/free mutex for multi-device use cases.

commit ea7a1b8
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Dec 5 12:59:32 2023 -0700

    More work on regressions, fast-math mode, hip performance, etc.

commit 40365ba
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Dec 5 13:02:21 2023 -0700

    More work on regressions, fast-math mode, hip performance, etc.

commit 9b75e11
Author: Patrick McCormick <>
Date:   Tue Dec 5 08:52:30 2023 -0700

    Working on some issues surrounding --ffast-math:

      1. ABI conflicts between the host stage and our module offload
         generation (e.g., host side passes generate vectorized code that is
         not supported on GPU backend(s).

      2. Host architecture-centric tweaks occur before our GPU transform.
         That leads to addressing host architecture specific details as
         part of the transform (e.g., aarch64 and x86_64 will generate
         different calls vs. sticking with llvm intrinsics).

    A combo of ABI issues and/or the fact we're too late in the pass pipeline
    to address this with the current design means more work lies ahead...

commit 2668be2
Author: Patrick McCormick <>
Date:   Mon Nov 27 15:09:29 2023 -0700

    work on hip performance details.

commit 243ff11
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Nov 28 12:52:52 2023 -0700

    Testing streams and odd stalls (UVM?).  This version seems to remove
    the stalls but also on a system with a newer kernel drop...  CUDA only
    at this point.

commit e7d0c09
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Mon Nov 27 15:02:47 2023 -0700

    Working on some runtime tweaks and clean up.  Traced a new crash to
    the use of a ptxas whole-program optimization flag.

commit 9512eb5
Author: Patrick McCormick <>
Date:   Fri Nov 17 08:45:58 2023 -0700

    More work to setup the tests for better HIP and CUDA target flexiblity;
    including some reduced complexity the command line arg details in the
    makefile(s) (e.g., strip mining flags for GPU targets moved into the
    config files vs. being necessary in the makefile setup).

    Added better (correct) AMDGPU target attribute selection based on
    multiple target options (prior version was too hard-coded for gfx90a).

commit afdb9c2
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Thu Nov 16 12:59:11 2023 -0700

    A bit more verbose and shared cuda and hip feature management (e.g.,
    streaming modes).

commit 4797933
Author: Patrick McCormick <>
Date:   Thu Nov 16 12:58:38 2023 -0700

    Bug fixes for new prefetch feature set.

commit 6875ccf
Author: Patrick McCormick <>
Date:   Thu Nov 16 11:05:27 2023 -0700

    More work on HIP performance debugging...

commit 3ac2d66
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Thu Nov 16 11:01:31 2023 -0700

    First cut at CUDA prefetch streams support.  Needs testing...

commit 7e3a8a2
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Thu Nov 16 08:14:07 2023 -0700

    Some refactoring for HIP details, bug chasing, etc.

commit 3f1e09e
Author: Patrick McCormick <>
Date:   Wed Nov 15 09:40:07 2023 -0700

    Some hacking for trying to debug AMD HIP code gen/runtime issues.  A few new environment variables to
    make chasing (our tails) easier...

     - KITRT_THREADS_PER_BLOCK=1024 (default 256)
     - KITRT_MAX_NUM_PREFETCH_STREAMS=2 (default 4: size of round-robin stream queue for concurrent prefetch calls)
     - KITRT_DEVICE_ID=5 (default 0: change the default GPU selection)
     - KITRT_MIN_WARPS_PER_EXEC_UNIT=1 (default 1: reducing resource usage per warp -- impacts register allocation, etc.)

    The prefetch stream queue is enabled via the command line with "-mllvm -hipabi-streams".

commit d3f74a0
Author: Patrick McCormick <>
Date:   Wed Nov 8 20:43:08 2023 -0700

    Some cleanup and work to try and chase down HIP target runtime variabilty.

commit c2bb71e
Author: Patrick McCormick <>
Date:   Thu Nov 2 13:04:11 2023 -0600

    chasing build issues/warnings/errors.

commit b4bafb4
Author: Patrick McCormick <>
Date:   Thu Nov 2 09:04:25 2023 -0600

    Chasing bugs...

commit 9299708
Author: Patrick McCormick <>
Date:   Tue Oct 31 16:23:02 2023 -0600

    working on benchmarks

commit 74cb34f
Author: Patrick McCormick <>
Date:   Wed Oct 25 14:31:56 2023 -0600

    Exploring full kokkos builds w/ clang.

commit 4dc3221
Author: Patrick McCormick <>
Date:   Tue Jun 27 14:13:50 2023 -0600

    Some cleanup and small tweaks.

commit 5136b24
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Wed Nov 8 20:27:27 2023 -0700

    Attempt at a quick multi-stream prefetch feature.

commit 8ece574
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Thu Nov 2 11:48:12 2023 -0600

    small tweaks to sort out some performance details.

commit 9b00669
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Wed Nov 1 20:38:03 2023 -0600

    Tweak in attempt to debug potential numa issues that are impacting
    consistent performance across multiple application runs.

commit f5d53a6
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Oct 31 14:46:29 2023 -0600

    A bit more cleanup and adding new tests specific to kitsune.

commit ff078df
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Oct 31 09:47:10 2023 -0600

    A bit more cleanup and adding some infrastructure for the multi-target
    test code (added makefile and a kokkos version).  Not all the pieces
    are in place to fully test.

commit d884674
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Oct 31 08:53:36 2023 -0600

    Clean up some code cruft -- no need to duplicate else branch cases.

commit 88bc75b
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Tue Oct 31 08:39:57 2023 -0600

    Forgot to save a cleaned up comment...

commit bd7941e
Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com>
Date:   Mon Oct 30 20:01:09 2023 -0600

    New code to handle tapir attributes on Kokkos "statements".

    Some new code for cuda memory management details (calloc, realloc,
    etc.).  Along with some prep work for upcoming memory management and
    movement changes.
  • Loading branch information
pmccormick authored and tarunprabhu committed Feb 16, 2024
1 parent 942c6f4 commit df8b5f7
Show file tree
Hide file tree
Showing 132 changed files with 205,025 additions and 4,440 deletions.
27 changes: 16 additions & 11 deletions clang/include/clang/Basic/Attr.td
Original file line number Diff line number Diff line change
Expand Up @@ -4206,19 +4206,16 @@ def AvailableOnlyInDefaultEvalMethod : InheritableAttr {

def TapirTarget : StmtAttr {
let Spellings = [CXX11<"tapir","target">];
let Subjects = SubjectList<[ForallStmt, CXXForallRangeStmt],
ErrorDiag, "'forall' statement">;
let Subjects = SubjectList<[ForallStmt, CXXForallRangeStmt, Stmt],
ErrorDiag, "'parallel' statements">;
let Args = [
EnumArgument<"TapirTargetAttrType", "TapirTargetAttrTy",
// TODO: Is there a difference between "serial" and "none"?
["cheetah", "cilk", "cuda",
"hip", "libomp", "none",
"qthreads", "realm", "rocm",
"serial", "zero", "opencl"],
["CheetahRT", "CilkRT", "CudaRT",
"HipRT", "OmpRT", "QthreadsRT",
"RealmRT", "RocmRT", "SequentialRT",
"ZeroRT", "OpenCLRT"], 0>
EnumArgument<"TapirTargetAttrType", "TapirTargetAttrTy",
["none", "serial", "cuda", "hip", "lambda", "omptask",
"opencilk", "openmp", "qthreads", "realm"],
["None", "Serial", "Cuda", "Hip", "Lambda", "OMPTask",
"OpenCilk", "OpenMP", "Qthreads", "Realm"],
0>
];
let Documentation = [TapirRTDocs];
}
Expand Down Expand Up @@ -4249,3 +4246,11 @@ def KitsuneMemAccess : InheritableAttr {

let Documentation = [KitsuneMemAccessDocs];
}

def KitsuneLaunch : StmtAttr {
let Spellings = [CXX11<"kitsune","launch">];
let Subjects = SubjectList<[ForallStmt,CXXForallRangeStmt],
ErrorDiag, "'parallel' statements">;
let Args = [ExprArgument<"ThreadsPerBlock">];
let Documentation = [KitsuneLaunchDocs];
}
34 changes: 30 additions & 4 deletions clang/include/clang/Basic/AttrDocs.td
Original file line number Diff line number Diff line change
Expand Up @@ -7078,7 +7078,7 @@ the variables were declared in. It is not possible to check the return value
}];
}

def TapirDocs : DocumentationCategory<"Tapir Attributes"> {
def KitsuneDocs : DocumentationCategory<"Kitsune Attributes"> {
let Content = [{
Tapir and Kitsune introduce custom C++ attributres that let developer
control behaviors at the source code level vs. only through command line
Expand All @@ -7087,7 +7087,7 @@ arguments.
}

def TapirRTDocs : Documentation {
let Category = TapirDocs;
let Category = KitsuneDocs;
let Content = [{

**Experimental**
Expand Down Expand Up @@ -7115,7 +7115,7 @@ Example:
}

def TapirStrategyDocs : Documentation {
let Category = TapirDocs;
let Category = KitsuneDocs;
let Content = [{

**Experimental**
Expand All @@ -7137,7 +7137,7 @@ strategies are:
}

def KitsuneMemAccessDocs : Documentation {
let Category = TapirDocs;
let Category = KitsuneDocs;
let Content = [{

The kitsune memory access attributes lets programmers inform the compiler of
Expand All @@ -7160,3 +7160,29 @@ inaccurate).
...
}];
}


def KitsuneLaunchDocs : Documentation {
let Category = KitsuneDocs;
let Content = [{

The kitsune launch attributes lets programmers inform the compiler how
to tweak the code generation aspects for launching the GPU kernels for
parallel constructs in the code. This information is not verified in
only a minor fashion at compile time and then again at runtime. In
general, it is up to the programmer to communicate the details
correctly in terms of the semantics of the underlying GPU programming
mechanisms. In general, this is intended more for kitsune development
and testing vs. an end-user feature.

- threads_per_block : the number of threads per block to launch the kernel.

.. code-block:: c++

[[kitsune:launch:threads_per_block(N)]]
forall(...) {
// loop body goes here.
}
...
}];
}
21 changes: 12 additions & 9 deletions clang/include/clang/Basic/DiagnosticSemaKinds.td
Original file line number Diff line number Diff line change
Expand Up @@ -4316,26 +4316,22 @@ def note_called_once_gets_called_twice : Note<

// +===== kitsune/tapir: errors related to custom codegen attributes.
def warn_tapir_attr_not_enabled: Warning<
"tapir attribute encountered but tapir is not enabled (via -ftapir). "
"tapir attribute found but tapir is not enabled. "
"Attribute ignored.">,
InGroup<Tapir>;

def warn_tapir_target_attr_bad_stmt_class: Warning<
"tapir target runtime attribute applied to an unsupported statement. "
"tapir target attribute on unsupported statement. "
"Attribute ignored.">,
InGroup<Tapir>;

def warn_kitsune_not_enabled: Warning<
"kitsune not enabled (-fkitsune) tapir attribute will be ignored.">,
InGroup<Kitsune>;

// rutnime target-centric errors/warnings.
def err_tapir_target_attr_unsupported_stmt: Error<
"tapir target runtime attribute applied to unsupported statement kind.">;
"tapir target attribute on unsupported statement.">;
def err_tapir_target_attr_wrong_nargs: Error<
"tapir target runtime attribute takes a single argument.">;
"tapir target attribute takes a single argument.">;
def err_tapir_target_unknown: Error<
"statement using unknonwn tapir target runtime attribute.">;
"unknonwn tapir target.">;

// execution strategy/policy errors/warnings.
def err_tapir_strategy_attr_unsupported_stmt: Error<
Expand All @@ -4345,6 +4341,13 @@ def err_tapir_strategy_attr_wrong_nargs: Error<
def err_tapir_strategy_unknown: Error<
"statement using unknown strategy attribute.">;

def err_kitsune_launch_non_integral_type: Error<
"launch attribute: threads-per-block must be a built-in integer type.">;
def err_kitsune_launch_tpb_too_large: Error<
"launch attribute: threads-per-block must be a 32-bit value.">;
def err_kitsune_launch_tpb_must_be_positive: Error<
"launch attribute: threads-per-block must be a positive integer value.">;

// spawn+sync errors/warnings.
//
def warn_spawn_as_loop_body: Warning<
Expand Down
2 changes: 1 addition & 1 deletion clang/kitsune/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ endif()
option(KITSUNE_ENABLE_CUDA_ABI_TARGET
"Enable the Kitsune+Tapir CUDA Toolkit target and codegen library." OFF)
if (KITSUNE_ENABLE_CUDA_ABI_TARGET)
find_package(CUDAToolkit 10...12 REQUIRED) # We may need to be a bit more specifc on minor versions here.
find_package(CUDAToolkit 12.0 REQUIRED)
set(TAPIR_CUDA_ABI_TARGET_CFG_FILENAME
"cuda.cfg"
CACHE
Expand Down
5 changes: 4 additions & 1 deletion clang/kitsune/cuda.cfg.in
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Flags for the CUDA runtime ABI target.
-D_tapir_cuda_target
-L${CMAKE_INSTALL_PREFIX}/lib
-mllvm -stripmine-count=1
-mllvm -stripmine-coarsen-factor=1
${KITSUNE_CUDA_ABI_TARGET_EXTRA_LINK_FLAGS}
-Wl,-rpath=${CMAKE_INSTALL_PREFIX}/lib
-L${CMAKE_INSTALL_PREFIX}/lib
-lkitrt
${KITSUNE_CUDA_ABI_TARGET_LINK_LIBS}
3 changes: 3 additions & 0 deletions clang/kitsune/hip.cfg.in
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Flags for the HIP runtime ABI target.
#-ffp-contract=fast-honor-pragmas
-D_tapir_hip_target
-mllvm -stripmine-count=1
-mllvm -stripmine-coarsen-factor=1
-L${CMAKE_INSTALL_PREFIX}/lib
${KITSUNE_HIP_TARGET_EXTRA_LINK_FLAGS}
-lkitrt
Expand Down
9 changes: 8 additions & 1 deletion clang/lib/CodeGen/CGExpr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5019,7 +5019,14 @@ RValue CodeGenFunction::EmitCallExpr(const CallExpr *E,
std::string qname = fdecl->getQualifiedNameAsString();
if (qname == "Kokkos::parallel_for" ||
qname == "Kokkos::parallel_reduce") {
if (EmitKokkosConstruct(E))
// We handle the special case of Tapir target attributes on a
// Kokkos "statement" elsewhere (as the attribute is not
// really attached to the CallExpr but instead the C++ goop
// around the call -- implicit and clean up stuff). If we
// have seen such an attribute it was saved and we can simply
// pass TapirAttrs on from here for the Kokkos code
// transformation/generation.
if (EmitKokkosConstruct(E, TapirAttrs))
return RValue::get(nullptr);
} else if (getLangOpts().KokkosNoInit &&
(qname == "Kokkos::initialize" ||
Expand Down
Loading

0 comments on commit df8b5f7

Please sign in to comment.