Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
MathiasMagnus committed Jan 18, 2024
1 parent e190a5f commit fc43a47
Show file tree
Hide file tree
Showing 11 changed files with 132 additions and 6 deletions.
3 changes: 3 additions & 0 deletions docs/.sphinx/_toc.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ subtrees:
- file: reference/math_api
- file: reference/glossary
- file: reference/deprecated_api_list
- caption: User Understand
entries:
- file: understand/programming_model
- caption: User How to Guides
entries:
- file: how_to_guides/install.md
Expand Down
3 changes: 3 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@
for sphinx_var in ROCmDocs.SPHINX_VARS:
globals()[sphinx_var] = getattr(docs_core, sphinx_var)

extensions += ["sphinxcontrib.doxylink"]
doxygen_html = ".doxygen/docBin/html"

# rocm-docs-core might or might not have changed these yet (depending on version),
# and we don't want to wipe their settings if they did
if not "html_theme_options" in globals():
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/data/understand/programming_model/simt.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,17 @@ These documents are targeted at authors of HIP libraries and applications.

- {doc}`/.doxygen/docBin/html/index`
- {doc}`/.doxygen/docBin/html/modules`
- {doc}`/reference/programming_model`
- {doc}`/reference/kernel_language`
- {doc}`/reference/math_api`
- {doc}`/reference/deprecated_api_list`

:::

:::{grid-item-card} Understand HIP

- {doc}`/understand/programming_model`

:::

:::{grid-item-card} How-to Guides
Expand Down
28 changes: 28 additions & 0 deletions docs/reference/programming_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,3 +200,31 @@ SIMD architecture (like GPUs) not by exploiting data parallelism within a
single instance of a kernel and spreading identical instructions over the SIMD
engines at hand, but by scalarizing the entire kernel and issuing the scalar
instructions of multiple kernel instances to each of the SIMD engine lanes.

### Kernel launch

Kernels may be launched in multiple ways all with different syntaxes and
intended use-cases.

- Using the triple-chevron `<<<...>>>` operator on a `__global__` annotated
function.

- Using `hipLaunchKernelGGL()` on a `__global__` annotated function.

```{tip}
This name by default is a macro expanding to triple-chevron. In cases where
language syntax extensions are undesirable, or where launching templated
and/or overloaded kernel functions define the `HIP_TEMPLATE_KERNEL_LAUNCH`
preprocessor macro before including the HIP headers to turn it into a
templated function.
```

- Using the
{doxygen}`launch APIs supporting the triple-chevron syntax <Clang>` directly.

```{caution}
These APIs are intended to be used/generated by tools such as the HIP
compiler itself and not intended towards end-user code. Should you be writing
a tool having to launch device code using HIP, consider using these over the
alternatives.
```
99 changes: 93 additions & 6 deletions docs/understand/programming_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,41 @@ architectures, such as GPUs. As a consequence, one needs a basic understanding
of the underlying device architecture to make efficient use of HIP and GPGPU
(General Purpose Graphics Processing Unit) programming in general.

## RDNA & CDNA architecture summary
## RDNA & CDNA Architecture Summary

Most GPU architectures, much like RDNA and CDNA have a hierarchical structure.
The inner-most piece is some Single Instruction Multiple Data (SIMD) enabled
vector Arithmetic Logical Unit (ALU). Most recent GPUs beside the vector ALU
also house some matrix ALU for accelerating algorithms of well defined shapes.

Some number of vector and matrix ALUs comprise a larger block, often referred
to as a Compute Unit (OpenCL, AMD block diagrams) but is referred to as Multi
Processor in HIP terms.

:::{figure-md} rdna3_cu

<img src="../data/understand/programming_model/rdna3_cu.png" alt="Block Diagram of an RDNA3 Compute Unit.">

Block Diagram of an RDNA3 Compute Unit.
:::

:::{figure-md} cdna3_cu

<img src="../data/understand/programming_model/cdna3_cu.png" alt="Block Diagram of an CDNA3 Compute Unit.">

Block Diagram of an CDNA3 Compute Unit.
:::

For hardware implementation's sake, some number of Multi Processors are grouped
together into a Shader Engine or Compute Engine, typically sharing some fixed
function units or memory subsystem resources.

:::{figure-md} cdna2_gcd

<img src="../data/understand/programming_model/cdna2_gcd.png" alt="Block Diagram of a CDNA2 Graphics Compute Die.">

Block Diagram of a CDNA2 Graphics Compute Die.
:::

## Single Instruction Multiple Threads

Expand All @@ -25,15 +59,68 @@ instructions of multiple kernel instances to each of the SIMD engine lanes.
Consider the following kernel

```cu
__global__ void k(float4* arr)
__global__ void k(float4* a, const float4* b)
{
int tid = threadIdx.x;
int bid = blockIdx.x;
int bdim = blockDim.x;
arr[tid] =
(tid + bid - bdim) *
arr[tid] +
arr[tid];
a[tid] += (tid + bid - bdim) * b[tid];
}
```

The incoming four-vector of floating-point values `a` is multiplied by a scalar
and then multiplied element-wise by another four-vector. On modern SIMD-capable
architectures the four-vector ops are expected to compile to a single SIMD
instruction. GPU execution of this kernel however will typically look the
following:

:::{figure-md} cdna3_cu

<img src="../data/understand/programming_model/simt.svg" alt="Instruction flow of the sample SIMT program.">

Instruction flow of the sample SIMT program.
:::

In HIP, lanes of a SIMD architecture are fed by mapping threads of a SIMT
execution, one thread down each lane of a SIMD engine. Execution parallelism
isn't exploited from the width of the built-in vector types, but via the thread
id constants `threadIdx.x`, `blockIdx.x`, etc. For more details, refer to
{ref}`inherent_tread_model`.

## Heterogenous Programming

The HIP programming model assumes two execution contexts. One is referred to as
_host_ while compute kernels execute on a _device_. These contexts have
different capabilities, therefor slightly different rules apply. The _host_
execution is defined by the C++ abstract machine, while _device_ execution
follows the HIP model, primarily defined by SIMT. These execution contexts in
code are signified by the `__host__` and `__device__` decorators. There are a
few key differences between the two:

- The C++ abstract machine assumes a unified memory address space, meaning that
one can always access any given address in memory (assuming the absence of
data races). HIP however introduces several memory namespaces, an address
from one means nothing in another. Moreover not all address spaces are
accessible from all contexts.

If one were to look at {ref}`cdna2_gcd` and inside the {ref}`cdna3_cu`,
every Compute Unit has an instance of storage backing the namespace
`__shared__`. Even if the host were to have access to these regions of
memory, the performance benefits of the segmented memory subsystem are
supported by the inability of meaningful asynchronous accesss from the host.

- Not all C++ language features map cleanly to typical device architectures,
some are very expensive (meaning: slow) to implement on GPU devices, therefor
they are forbidden in device contexts to avoid users tapping into features
unexpectedly decimating their program's performance. Offload devices targeted
by HIP aren't general purpose devices, at least not in the sense a CPU is.
HIP focuses on data parallel computations and as such caters to throughput
optimized architectures, such as GPUs or accelerators derived from GPU
architectures.

- Asynchrony is at the forefront of the HIP API. Computations launched by HIP
execute asynchronously on the device and it is the user's responsibility to
synchronize their data dispatch/fetch with computations on the device. HIP
does perform implicit synchronization on occasions, but unlike some APIs
(OpenCL, SYCL) by and large places the onus of synchronization on the user.

0 comments on commit fc43a47

Please sign in to comment.