WIP

ROCm · Jan 18, 2024 · fc43a47 · fc43a47
1 parent e190a5f
commit fc43a47
Show file tree

Hide file tree

Showing 11 changed files with 132 additions and 6 deletions.
diff --git a/docs/.sphinx/_toc.yml.in b/docs/.sphinx/_toc.yml.in
@@ -8,6 +8,9 @@ subtrees:
     - file: reference/math_api
     - file: reference/glossary
     - file: reference/deprecated_api_list
+- caption: User Understand
+  entries:
+    - file: understand/programming_model
 - caption: User How to Guides
   entries:
     - file: how_to_guides/install.md

diff --git a/docs/conf.py b/docs/conf.py
@@ -15,6 +15,9 @@
 for sphinx_var in ROCmDocs.SPHINX_VARS:
     globals()[sphinx_var] = getattr(docs_core, sphinx_var)
 
+extensions += ["sphinxcontrib.doxylink"]
+doxygen_html = ".doxygen/docBin/html"
+
 # rocm-docs-core might or might not have changed these yet (depending on version),
 # and we don't want to wipe their settings if they did
 if not "html_theme_options" in globals():

diff --git a/docs/data/understand/programming_model/cdna2_gcd.png b/docs/data/understand/programming_model/cdna2_gcd.png
diff --git a/docs/data/understand/programming_model/cdna3_cu.png b/docs/data/understand/programming_model/cdna3_cu.png
diff --git a/docs/data/understand/programming_model/cdna3_cu_dark.png b/docs/data/understand/programming_model/cdna3_cu_dark.png
diff --git a/docs/data/understand/programming_model/rdna3_cu.drawio b/docs/data/understand/programming_model/rdna3_cu.drawio
diff --git a/docs/data/understand/programming_model/rdna3_cu.png b/docs/data/understand/programming_model/rdna3_cu.png
diff --git a/docs/data/understand/programming_model/simt.svg b/docs/data/understand/programming_model/simt.svg
diff --git a/docs/index.md b/docs/index.md
@@ -14,13 +14,17 @@ These documents are targeted at authors of HIP libraries and applications.
 
 - {doc}`/.doxygen/docBin/html/index`
   - {doc}`/.doxygen/docBin/html/modules`
+- {doc}`/reference/programming_model`
 - {doc}`/reference/kernel_language`
 - {doc}`/reference/math_api`
 - {doc}`/reference/deprecated_api_list`
 
 :::
 
 :::{grid-item-card} Understand HIP
+
+- {doc}`/understand/programming_model`
+
 :::
 
 :::{grid-item-card} How-to Guides

diff --git a/docs/reference/programming_model.md b/docs/reference/programming_model.md
@@ -200,3 +200,31 @@ SIMD architecture (like GPUs) not by exploiting data parallelism within a
 single instance of a kernel and spreading identical instructions over the SIMD
 engines at hand, but by scalarizing the entire kernel and issuing the scalar
 instructions of multiple kernel instances to each of the SIMD engine lanes.
+
+### Kernel launch
+
+Kernels may be launched in multiple ways all with different syntaxes and
+intended use-cases.
+
+- Using the triple-chevron `<<<...>>>` operator on a `__global__` annotated
+  function.
+
+- Using `hipLaunchKernelGGL()` on a `__global__` annotated function.
+
+  ```{tip}
+  This name by default is a macro expanding to triple-chevron. In cases where
+  language syntax extensions are undesirable, or where launching templated
+  and/or overloaded kernel functions define the `HIP_TEMPLATE_KERNEL_LAUNCH`
+  preprocessor macro before including the HIP headers to turn it into a
+  templated function.
+  ```
+
+- Using the
+  {doxygen}`launch APIs supporting the triple-chevron syntax <Clang>` directly.
+
+  ```{caution}
+  These APIs are intended to be used/generated by tools such as the HIP
+  compiler itself and not intended towards end-user code. Should you be writing
+  a tool having to launch device code using HIP, consider using these over the
+  alternatives.
+  ```
diff --git a/docs/understand/programming_model.md b/docs/understand/programming_model.md
@@ -6,7 +6,41 @@ architectures, such as GPUs. As a consequence, one needs a basic understanding
 of the underlying device architecture to make efficient use of HIP and GPGPU
 (General Purpose Graphics Processing Unit) programming in general.
 
-## RDNA & CDNA architecture summary
+## RDNA & CDNA Architecture Summary
+
+Most GPU architectures, much like RDNA and CDNA have a hierarchical structure.
+The inner-most piece is some Single Instruction Multiple Data (SIMD) enabled
+vector Arithmetic Logical Unit (ALU). Most recent GPUs beside the vector ALU
+also house some matrix ALU for accelerating algorithms of well defined shapes.
+
+Some number of vector and matrix ALUs comprise a larger block, often referred
+to as a Compute Unit (OpenCL, AMD block diagrams) but is referred to as Multi
+Processor in HIP terms.
+
+:::{figure-md} rdna3_cu
+
+<img src="../data/understand/programming_model/rdna3_cu.png" alt="Block Diagram of an RDNA3 Compute Unit.">
+
+Block Diagram of an RDNA3 Compute Unit.
+:::
+
+:::{figure-md} cdna3_cu
+
+<img src="../data/understand/programming_model/cdna3_cu.png" alt="Block Diagram of an CDNA3 Compute Unit.">
+
+Block Diagram of an CDNA3 Compute Unit.
+:::
+
+For hardware implementation's sake, some number of Multi Processors are grouped
+together into a Shader Engine or Compute Engine, typically sharing some fixed
+function units or memory subsystem resources.
+
+:::{figure-md} cdna2_gcd
+
+<img src="../data/understand/programming_model/cdna2_gcd.png" alt="Block Diagram of a CDNA2 Graphics Compute Die.">
+
+Block Diagram of a CDNA2 Graphics Compute Die.
+:::
 
 ## Single Instruction Multiple Threads
 
@@ -25,15 +59,68 @@ instructions of multiple kernel instances to each of the SIMD engine lanes.
 Consider the following kernel
 
 ```cu
-__global__ void k(float4* arr)
+__global__ void k(float4* a, const float4* b)
 {
   int tid = threadIdx.x;
   int bid = blockIdx.x;
   int bdim = blockDim.x;
 
-  arr[tid] =
-    (tid + bid - bdim) *
-    arr[tid] +
-    arr[tid];
+  a[tid] += (tid + bid - bdim) * b[tid];
 }
 ```
+
+The incoming four-vector of floating-point values `a` is multiplied by a scalar
+and then multiplied element-wise by another four-vector. On modern SIMD-capable
+architectures the four-vector ops are expected to compile to a single SIMD
+instruction. GPU execution of this kernel however will typically look the
+following:
+
+:::{figure-md} cdna3_cu
+
+<img src="../data/understand/programming_model/simt.svg" alt="Instruction flow of the sample SIMT program.">
+
+Instruction flow of the sample SIMT program.
+:::
+
+In HIP, lanes of a SIMD architecture are fed by mapping threads of a SIMT
+execution, one thread down each lane of a SIMD engine. Execution parallelism
+isn't exploited from the width of the built-in vector types, but via the thread
+id constants `threadIdx.x`, `blockIdx.x`, etc. For more details, refer to
+{ref}`inherent_tread_model`.
+
+## Heterogenous Programming
+
+The HIP programming model assumes two execution contexts. One is referred to as
+_host_ while compute kernels execute on a _device_. These contexts have
+different capabilities, therefor slightly different rules apply. The _host_
+execution is defined by the C++ abstract machine, while _device_ execution
+follows the HIP model, primarily defined by SIMT. These execution contexts in
+code are signified by the `__host__` and `__device__` decorators. There are a
+few key differences between the two:
+
+- The C++ abstract machine assumes a unified memory address space, meaning that
+  one can always access any given address in memory (assuming the absence of
+  data races). HIP however introduces several memory namespaces, an address
+  from one means nothing in another. Moreover not all address spaces are
+  accessible from all contexts.
+
+  If one were to look at {ref}`cdna2_gcd` and inside the {ref}`cdna3_cu`,
+  every Compute Unit has an instance of storage backing the namespace
+  `__shared__`. Even if the host were to have access to these regions of
+  memory, the performance benefits of the segmented memory subsystem are
+  supported by the inability of meaningful asynchronous accesss from the host.
+
+- Not all C++ language features map cleanly to typical device architectures,
+  some are very expensive (meaning: slow) to implement on GPU devices, therefor
+  they are forbidden in device contexts to avoid users tapping into features
+  unexpectedly decimating their program's performance. Offload devices targeted
+  by HIP aren't general purpose devices, at least not in the sense a CPU is.
+  HIP focuses on data parallel computations and as such caters to throughput
+  optimized architectures, such as GPUs or accelerators derived from GPU
+  architectures.
+
+- Asynchrony is at the forefront of the HIP API. Computations launched by HIP
+  execute asynchronously on the device and it is the user's responsibility to
+  synchronize their data dispatch/fetch with computations on the device. HIP
+  does perform implicit synchronization on occasions, but unlike some APIs
+  (OpenCL, SYCL) by and large places the onus of synchronization on the user.