From b31a2a358f74b013978a41321cccab54776baf11 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?M=C3=A1t=C3=A9=20Ferenc=20Nagy-Egri?= <mate@streamhpc.com>
Date: Fri, 9 Feb 2024 13:14:32 +0100
Subject: [PATCH] SWDEV-427628 - Documentation restructure

Upstream of documentation restructure and adds some of the absent math
and async function documentation.

Change-Id: Ia7a6fbe053afe85015b04e20a610115be3c664af
---
 .readthedocs.yaml                             |    3 +
 .../contributing.md                           |   13 +-
 docs/conf.py                                  |    3 +
 docs/developer_guide/build.md                 |  200 -
 docs/developer_guide/logging.md               |  233 -
 docs/doxygen/Doxyfile                         |  782 ++--
 docs/how-to/debugging.rst                     |  381 ++
 docs/how-to/logging.rst                       |  235 +
 docs/how_to_guides/debugging.md               |  301 --
 docs/how_to_guides/install.md                 |   53 -
 docs/index.md                                 |   64 +-
 docs/install/build.rst                        |  256 ++
 docs/install/install.rst                      |   79 +
 docs/{reference => old}/glossary.md           |    0
 docs/old/reference/kernel_language.rst        | 1117 +++++
 docs/{ => old}/reference/terms.md             |    0
 docs/{ => old}/user_guide/faq.md              |    0
 .../user_guide/hip_porting_driver_api.md      |    0
 .../{ => old}/user_guide/hip_porting_guide.md |   33 +-
 docs/{ => old}/user_guide/hip_rtc.md          |    0
 .../user_guide/programming_manual.md          |    0
 docs/reference/deprecated_api_list.md         |   88 -
 docs/reference/deprecated_api_list.rst        |   91 +
 docs/reference/kernel_language.md             |  824 ----
 docs/reference/math_api.md                    | 3846 -----------------
 docs/sphinx/_toc.yml.in                       |   61 +-
 include/hip/hip_runtime_api.h                 |   18 +-
 27 files changed, 2796 insertions(+), 5885 deletions(-)
 rename docs/{developer_guide => about}/contributing.md (95%)
 delete mode 100755 docs/developer_guide/build.md
 delete mode 100644 docs/developer_guide/logging.md
 create mode 100644 docs/how-to/debugging.rst
 create mode 100644 docs/how-to/logging.rst
 delete mode 100644 docs/how_to_guides/debugging.md
 delete mode 100644 docs/how_to_guides/install.md
 create mode 100644 docs/install/build.rst
 create mode 100644 docs/install/install.rst
 rename docs/{reference => old}/glossary.md (100%)
 create mode 100644 docs/old/reference/kernel_language.rst
 rename docs/{ => old}/reference/terms.md (100%)
 rename docs/{ => old}/user_guide/faq.md (100%)
 rename docs/{ => old}/user_guide/hip_porting_driver_api.md (100%)
 rename docs/{ => old}/user_guide/hip_porting_guide.md (93%)
 rename docs/{ => old}/user_guide/hip_rtc.md (100%)
 rename docs/{ => old}/user_guide/programming_manual.md (100%)
 delete mode 100644 docs/reference/deprecated_api_list.md
 create mode 100644 docs/reference/deprecated_api_list.rst
 delete mode 100644 docs/reference/kernel_language.md
 delete mode 100644 docs/reference/math_api.md

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
index 2439bb85c9..127d6c4983 100644
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
@@ -19,3 +19,6 @@ build:
    apt_packages:
      - "doxygen"
      - "graphviz" # For dot graphs in doxygen
+   jobs:
+     post_checkout:
+       - git clone --depth=1 --branch rocdoc-195 https://github.com/StreamHPC/llvm-project.git ../llvm-project
diff --git a/docs/developer_guide/contributing.md b/docs/about/contributing.md
similarity index 95%
rename from docs/developer_guide/contributing.md
rename to docs/about/contributing.md
index 7229f7bf28..5b234d4c6d 100644
--- a/docs/developer_guide/contributing.md
+++ b/docs/about/contributing.md
@@ -1,7 +1,11 @@
-# Contributor Guidelines
+# Contributor guidelines
 
-## Make Tips
-`ROCM_PATH` is path where ROCM is installed. BY default `ROCM_PATH` is `/opt/rocm`.
+If you want to contribute to the HIP project, review the following guidelines. If you want to contribute
+to our documentation, refer to {doc}`Contribute to ROCm docs <rocm:conribute/index>`.
+
+## Make tips
+
+`ROCM_PATH` is path where ROCm is installed. BY default `ROCM_PATH` is `/opt/rocm`.
 When building HIP, you will likely want to build and install to a local user-accessible directory (rather than `<ROCM_PATH>`).
 This can be easily be done by setting the `-DCMAKE_INSTALL_PREFIX` variable when running cmake.  Typical use case is to
 set `CMAKE_INSTALL_PREFIX` to your HIP git root, and then ensure `HIP_PATH` points to this directory. For example
@@ -15,9 +19,8 @@ export HIP_PATH=
 
 After making HIP, don't forget the "make install" step !
 
+## Add a new HIP API
 
-
-## Adding a new HIP API
 - Add a translation to the hipify-clang tool ; many examples abound.
     - For stat tracking purposes, place the API into an appropriate stat category ("dev", "mem", "stream", etc).
 - Add a inlined NVIDIA implementation for the function in include/hip/nvidia_detail/hip_runtime_api.h.
diff --git a/docs/conf.py b/docs/conf.py
index 251b58acd0..5188e069eb 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -37,3 +37,6 @@
 
 for sphinx_var in ROCmDocs.SPHINX_VARS:
     globals()[sphinx_var] = getattr(docs_core, sphinx_var)
+
+cpp_id_attributes = ["__global__", "__device__", "__host__", "__forceinline__", "static"]
+cpp_paren_attributes = ["__declspec"]
diff --git a/docs/developer_guide/build.md b/docs/developer_guide/build.md
deleted file mode 100755
index b13f6f0d5c..0000000000
--- a/docs/developer_guide/build.md
+++ /dev/null
@@ -1,200 +0,0 @@
-# Building HIP from Source
-
-## Prerequisites
-
-HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler, or a CUDA platform with nvcc installed.
-Before build and run HIP, make sure drivers and pre-build packages are installed properly on the platform.
-
-### AMD platform
-Install ROCm packages or pre-built binary packages using the package manager. Refer to the ROCm Installation Guide at https://rocm.docs.amd.com for more information on installing ROCm.
-
-```shell
-sudo apt install mesa-common-dev
-sudo apt install clang
-sudo apt install comgr
-sudo apt-get -y install rocm-dkms
-sudo apt-get install -y libelf-dev
-```
-
-### NVIDIA platform
-
-Install Nvidia driver and pre-build packages (see HIP Installation Guide at https://docs.amd.com/ for the release)
-
-### Branch of repository
-
-Before get HIP source code, set the expected branch of repository at the variable `ROCM_BRANCH`.
-For example, for ROCm 6.1 release branch, set
-```shell
-export ROCM_BRANCH=rocm-6.1.x
-```
-
-ROCm5.7 release branch, set
-```shell
-export ROCM_BRANCH=rocm-5.7.x
-```
-Similiar format for future branches.
-
-`ROCM_PATH` is path where ROCM is installed. BY default `ROCM_PATH` is at `/opt/rocm`.
-
-
-## Build HIP
-
-
-### Get HIP source code
-
-A new repository 'hipother' is added in the ROCm 6.1 release, which is branched out from HIP.
-The 'hipother'provides files required to support the HIP back-end implementation on some non-AMD platforms, like NVIDIA.
-
-```shell
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/HIPCC.git hipcc
-```
-
-### Set the environment variables
-
-```shell
-export CLR_DIR="$(readlink -f clr)"
-export HIP_DIR="$(readlink -f hip)"
-export HIP_OTHER="$(readlink -f hipother)"
-export HIPCC_DIR="$(readlink -f hipcc)"
-```
-
-Note, starting from ROCM 5.6 release, clr is a new repository including the previous ROCclr, HIPAMD and OpenCl repositories.
-ROCclr is defined on AMD platform that HIP uses Radeon Open Compute Common Language Runtime (ROCclr), which is a virtual device interface that HIP runtimes interact with different backends.
-HIPAMD provides implementation specifically for the AMD platform.
-OpenCL provides headers that ROCclr runtime currently depends on.
-hipother provides headers and implementation specifically for the non-AMD platform, like NVIDIA.
-
-### Build the HIPCC runtime
-
-```shell
-cd "$HIPCC_DIR"
-mkdir -p build; cd build
-cmake ..
-make -j4
-```
-
-### Build HIP on the AMD platform
-
-#### Build HIP runtime
-Commands to build HIP runtime on the AMD platform are as following. The option 'HIP_PLATFORM=amd' should be defined.
-
-```shell
-cd "$CLR_DIR"
-mkdir -p build; cd build
-cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIPCC_BIN_DIR=$HIPCC_DIR/build -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
-make -j$(nproc)
-sudo make install
-```
-
-Note, if `CMAKE_INSTALL_PREFIX` is not specified, hip runtime will be installed to `<ROCM_PATH>/hip`.
-By default, release version of HIP is built.
-
-
-#### Default paths and environment variables
-
-   * By default HIP looks for HSA in `<ROCM_PATH>/hsa` (can be overridden by setting `HSA_PATH` environment variable).
-   * By default HIP is installed into `<ROCM_PATH>/hip`.
-   * By default HIP looks for clang in `<ROCM_PATH>/llvm/bin` (can be overridden by setting `HIP_CLANG_PATH` environment variable)
-   * By default HIP looks for device library in `<ROCM_PATH>/lib` (can be overridden by setting `DEVICE_LIB_PATH` environment variable).
-   * Optionally, consider adding `<ROCM_PATH>/bin` to your `PATH` to make it easier to use the tools.
-   * Optionally, set `HIPCC_VERBOSE=7` to output the command line for compilation.
-
-#### Generate profiling header after adding/changing a HIP API
-
-When you add or change a HIP API, you must generate a new `hip_prof_str.h` header. ROCm tools like ROCProfiler and ROCTracer use this header to track HIP APIs.
-To generate the header after your change, use the tool `hip_prof_gen.py` present in `hipamd/src`.
-
-Usage:
-
-`hip_prof_gen.py [-v] <input HIP API .h file> <patched srcs path> <previous output> [<output>]`
-
-Flags:
-
-  * -v - verbose messages
-  * -r - process source directory recursively
-  * -t - API types matching check
-  * --priv - private API check
-  * -e - on error exit mode
-  * -p - HIP_INIT_API macro patching mode
-
-Example Usage:
-```shell
-hip_prof_gen.py -v -p -t --priv <hip>/include/hip/hip_runtime_api.h \
-<hipamd>/src <hipamd>/include/hip/amd_detail/hip_prof_str.h         \
-<hipamd>/include/hip/amd_detail/hip_prof_str.h.new
-```
-
-#### Build HIP tests
-
-HIP catch tests, with the newly architectured Catch2, are officially separated from the HIP project. The HIP catch tests are moved to the HIP tests repository and can be built using the  instructions in the following sections.
-
-##### Get HIP tests source code
-
-```shell
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/hip-tests.git
-```
-##### Build HIP tests from source
-
-```shell
-export HIPTESTS_DIR="$(readlink -f hip-tests)"
-cd "$HIPTESTS_DIR"
-mkdir -p build; cd build
-cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/install # or any path where HIP is installed, for example, /opt/rocm.
-make -j$(nproc) build_tests # build tests
-cd build/catch_tests && ctest # to run all tests.
-```
-HIP catch tests are built under the folder $HIPTESTS_DIR/build.
-Note that when using ctest, you can use different ctest options, for example, to run all tests with the keyword hipMemset,
-```
-ctest -R hipMemset
-```
-Use the below option will print test failures for failed tests,
-```
-ctest --output-on-failure
-```
-For more information, refer to https://cmake.org/cmake/help/v3.5/manual/ctest.1.html.
-
-To run any single catch test, the following is an example,
-
-```shell
-cd $HIPTESTS_DIR/build/catch_tests/unit/texture
-./TextureTest
-```
-
-##### Build HIP Catch2 standalone tests
-
-HIP Catch2 supports building standalone tests, for example,
-
-```shell
-cd "$HIPTESTS_DIR"
-hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc -I ./catch/include ./catch/hipTestMain/standalone_main.cc -I ./catch/external/Catch2 -o hipPointerGetAttributes
-./hipPointerGetAttributes
-...
-
-All tests passed
-```
-
-### Build HIP on the NVIDIA platform
-
-
-#### Build HIP runtime
-Commands to build HIP on the NVIDIA platform are as following. The options 'HIPNV_DIR=$HIP_OTHER/hipnv' and 'HIP_PLATFORM=nvidia' should be defined.
-
-```shell
-cd "$CLR_DIR"
-mkdir -p build; cd build
-cmake -DHIP_COMMON_DIR=$HIP_DIR -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIPCC_BIN_DIR=$HIPCC_DIR/build -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=$HIP_OTHER/hipnv -DHIP_PLATFORM=nvidia ..
-make -j$(nproc)
-sudo make install
-```
-
-#### Build HIP tests
-Build HIP tests commands on NVIDIA platform are basically the same as AMD, except set `-DHIP_PLATFORM=nvidia`.
-
-## Run HIP
-
-Compile and run the [square sample](https://github.com/ROCm/hip-tests/tree/rocm-6.0.x/samples/0_Intro/square).
-
diff --git a/docs/developer_guide/logging.md b/docs/developer_guide/logging.md
deleted file mode 100644
index 417b2988ef..0000000000
--- a/docs/developer_guide/logging.md
+++ /dev/null
@@ -1,233 +0,0 @@
-# Logging Mechanisms
-
-HIP provides a logging mechanism, which is a convenient way of printing
-important information so as to trace HIP API and runtime codes during the
-execution of HIP application.
-It assists HIP development team in the development of HIP runtime, and is useful
-for HIP application developers as well.
-Depending on the setting of logging level and logging mask, HIP logging will
-print different kinds of information, for different types of functionalities
-such as HIP APIs, executed kernels, queue commands and queue contents, etc.
-
-## HIP Logging Level:
-
-By default, HIP logging is disabled, it can be enabled via the `AMD_LOG_LEVEL`
-environment variable.
-The value controls the logging level. The levels are defined as:
-
-```cpp
-enum LogLevel {
-  LOG_NONE    = 0,
-  LOG_ERROR   = 1,
-  LOG_WARNING = 2,
-  LOG_INFO    = 3,
-  LOG_DEBUG   = 4
-};
-```
-
-## HIP Logging Mask:
-
-Logging mask is designed to print types of functionalities during the execution
-of HIP application.
-It can be set as one of the following values,
-
-```cpp
-enum LogMask {
-  LOG_API       = 1,      //!< (0x1)     API call
-  LOG_CMD       = 2,      //!< (0x2)     Kernel and Copy Commands and Barriers
-  LOG_WAIT      = 4,      //!< (0x4)     Synchronization and waiting for commands to finish
-  LOG_AQL       = 8,      //!< (0x8)     Decode and display AQL packets
-  LOG_QUEUE     = 16,     //!< (0x10)    Queue commands and queue contents
-  LOG_SIG       = 32,     //!< (0x20)    Signal creation, allocation, pool
-  LOG_LOCK      = 64,     //!< (0x40)    Locks and thread-safety code.
-  LOG_KERN      = 128,    //!< (0x80)    Kernel creations and arguments, etc.
-  LOG_COPY      = 256,    //!< (0x100)   Copy debug
-  LOG_COPY2     = 512,    //!< (0x200)   Detailed copy debug
-  LOG_RESOURCE  = 1024,   //!< (0x400)   Resource allocation, performance-impacting events.
-  LOG_INIT      = 2048,   //!< (0x800)   Initialization and shutdown
-  LOG_MISC      = 4096,   //!< (0x1000)  Misc debug, not yet classified
-  LOG_AQL2      = 8192,   //!< (0x2000)  Show raw bytes of AQL packet
-  LOG_CODE      = 16384,  //!< (0x4000)  Show code creation debug
-  LOG_CMD2      = 32768,  //!< (0x8000)  More detailed command info, including barrier commands
-  LOG_LOCATION  = 65536,  //!< (0x10000) Log message location
-  LOG_MEM       = 131072, //!< (0x20000) Memory allocation
-  LOG_MEM_POOL  = 262144, //!< (0x40000) Memory pool allocation, including memory in graphs
-  LOG_ALWAYS    = -1      //!< (0xFFFFFFFF) Log always even mask flag is zero
-};
-```
-
-Once `AMD_LOG_LEVEL` is set, logging mask is set as default with the value
-`-1`.
-However, for different purpose of logging functionalities, logging mask can be
-defined as well via environment variable `AMD_LOG_MASK`
-
-## HIP Logging command:
-
-To pring HIP logging information, the function is defined as
-```cpp
-#define ClPrint(level, mask, format, ...)                                       \
-  do {                                                                          \
-    if (AMD_LOG_LEVEL >= level) {                                               \
-      if (AMD_LOG_MASK & mask || mask == amd::LOG_ALWAYS) {                     \
-        if (AMD_LOG_MASK & amd::LOG_LOCATION) {                                 \
-          amd::log_printf(level, __FILENAME__, __LINE__, format, ##__VA_ARGS__);\
-        } else {                                                                \
-          amd::log_printf(level, "", 0, format, ##__VA_ARGS__);                 \
-        }                                                                       \
-      }                                                                         \
-    }                                                                           \
-  } while (false)
-```
-
-So in HIP code, call `ClPrint()` function with proper input varibles as needed,
-for example,
-```cpp
-ClPrint(amd::LOG_INFO, amd::LOG_INIT, "Initializing HSA stack.");
-```
-
-## HIP Logging Example:
-
-Below is an example to enable HIP logging and get logging information during execution of hipinfo on Linux,
-
-```console
-user@user-test:~/hip/bin$ export AMD_LOG_LEVEL=4
-user@user-test:~/hip/bin$ ./hipinfo
-
-:3:rocdevice.cpp            :444 : 115921848303 us: [pid:177158 tid:0x7f941a0d5a80] Initializing HSA stack.
-:3:comgrctx.cpp             :33  : 115921854454 us: [pid:177158 tid:0x7f941a0d5a80] Loading COMGR library.
-:3:rocdevice.cpp            :210 : 115921854490 us: [pid:177158 tid:0x7f941a0d5a80] Numa selects cpu agent[0]=0xcc4ef0(fine=0xbbffe0,coarse=0xcc53d0) for gpu agent=0xcc59c0 CPU<->GPU XGMI=0
-:3:rocdevice.cpp            :1680: 115921854758 us: [pid:177158 tid:0x7f941a0d5a80] Gfx Major/Minor/Stepping: 10/3/1
-:3:rocdevice.cpp            :1682: 115921854764 us: [pid:177158 tid:0x7f941a0d5a80] HMM support: 1, XNACK: 0, Direct host access: 0
-:3:rocdevice.cpp            :1684: 115921854766 us: [pid:177158 tid:0x7f941a0d5a80] Max SDMA Read Mask: 0x3, Max SDMA Write Mask: 0x3
-:4:rocdevice.cpp            :2063: 115921854812 us: [pid:177158 tid:0x7f941a0d5a80] Allocate hsa host memory 0x7f941a179000, size 0x38
-:4:rocdevice.cpp            :2063: 115921855244 us: [pid:177158 tid:0x7f941a0d5a80] Allocate hsa host memory 0x7f930c400000, size 0x101000
-:4:rocdevice.cpp            :2063: 115921856057 us: [pid:177158 tid:0x7f941a0d5a80] Allocate hsa host memory 0x7f930c200000, size 0x101000
-:4:runtime.cpp              :83  : 115921856451 us: [pid:177158 tid:0x7f941a0d5a80] init
-:3:hip_context.cpp          :48  : 115921856457 us: [pid:177158 tid:0x7f941a0d5a80] Direct Dispatch: 1
-:3:hip_device_runtime.cpp   :546 : 115921856476 us: [pid:177158 tid:0x7f941a0d5a80]  hipGetDeviceCount ( 0x7ffc69af52e4 ) 
-:3:hip_device_runtime.cpp   :548 : 115921856479 us: [pid:177158 tid:0x7f941a0d5a80] hipGetDeviceCount: Returned hipSuccess : 
-:3:hip_device_runtime.cpp   :561 : 115921856484 us: [pid:177158 tid:0x7f941a0d5a80]  hipSetDevice ( 0 ) 
-:3:hip_device_runtime.cpp   :565 : 115921856488 us: [pid:177158 tid:0x7f941a0d5a80] hipSetDevice: Returned hipSuccess : 
---------------------------------------------------------------------------------
-device#                           0
-:3:hip_device.cpp           :381 : 115921856498 us: [pid:177158 tid:0x7f941a0d5a80]  hipGetDeviceProperties ( 0x7ffc69af4fa0, 0 ) 
-:3:hip_device.cpp           :383 : 115921856502 us: [pid:177158 tid:0x7f941a0d5a80] hipGetDeviceProperties: Returned hipSuccess : 
-Name:                             AMD Radeon RX 6700 XT
-pciBusID:                         3
-pciDeviceID:                      0
-pciDomainID:                      0
-multiProcessorCount:              20
-maxThreadsPerMultiProcessor:      2048
-isMultiGpuBoard:                  0
-clockRate:                        2855 Mhz
-memoryClockRate:                  1000 Mhz
-memoryBusWidth:                   192
-totalGlobalMem:                   11.98 GB
-totalConstMem:                    2147483647
-sharedMemPerBlock:                64.00 KB
-canMapHostMemory:                 1
-regsPerBlock:                     65536
-warpSize:                         32
-l2CacheSize:                      3145728
-computeMode:                      0
-maxThreadsPerBlock:               1024
-maxThreadsDim.x:                  1024
-maxThreadsDim.y:                  1024
-maxThreadsDim.z:                  1024
-maxGridSize.x:                    2147483647
-maxGridSize.y:                    65536
-maxGridSize.z:                    65536
-major:                            10
-minor:                            3
-concurrentKernels:                1
-cooperativeLaunch:                1
-cooperativeMultiDeviceLaunch:     1
-isIntegrated:                     0
-maxTexture1D:                     16384
-maxTexture2D.width:               16384
-maxTexture2D.height:              16384
-maxTexture3D.width:               16384
-maxTexture3D.height:              16384
-maxTexture3D.depth:               8192
-isLargeBar:                       0
-asicRevision:                     0
-maxSharedMemoryPerMultiProcessor: 64.00 KB
-clockInstructionRate:             1000.00 Mhz
-...
-gcnArchName:                      gfx1031
-:3:hip_device_runtime.cpp   :546 : 115921856613 us: [pid:177158 tid:0x7f941a0d5a80]  hipGetDeviceCount ( 0x7ffc69af4f8c ) 
-:3:hip_device_runtime.cpp   :548 : 115921856616 us: [pid:177158 tid:0x7f941a0d5a80] hipGetDeviceCount: Returned hipSuccess : 
-:3:hip_peer.cpp             :176 : 115921856625 us: [pid:177158 tid:0x7f941a0d5a80]  hipDeviceCanAccessPeer ( 0x7ffc69af4f90, 0, 0 ) 
-:3:hip_peer.cpp             :177 : 115921856628 us: [pid:177158 tid:0x7f941a0d5a80] hipDeviceCanAccessPeer: Returned hipSuccess : 
-peers:
-:3:hip_peer.cpp             :176 : 115921856633 us: [pid:177158 tid:0x7f941a0d5a80]  hipDeviceCanAccessPeer ( 0x7ffc69af4f90, 0, 0 ) 
-:3:hip_peer.cpp             :177 : 115921856636 us: [pid:177158 tid:0x7f941a0d5a80] hipDeviceCanAccessPeer: Returned hipSuccess : 
-non-peers:                        device#0 
-
-:3:hip_memory.cpp           :764 : 115921856649 us: [pid:177158 tid:0x7f941a0d5a80]  hipMemGetInfo ( 0x7ffc69af4f90, 0x7ffc69af4f98 ) 
-:3:hip_memory.cpp           :788 : 115921856654 us: [pid:177158 tid:0x7f941a0d5a80] hipMemGetInfo: Returned hipSuccess : 
-memInfo.total:                    11.98 GB
-memInfo.free:                     11.86 GB (99%)
-```
-
-On Windows, AMD_LOG_LEVEL can be set via environment variable from advanced system setting, or from Command prompt run as administrator, as shown below as an example, which shows some debug log information calling backend runtime on Windows.
-```
-C:\hip\bin>set AMD_LOG_LEVEL=4
-C:\hip\bin>hipinfo
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\comgrctx.cpp:33  : 605413686305 us: 29864: [tid:0x9298] Loading COMGR library.
-:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\platform\runtime.cpp:83  : 605413869411 us: 29864: [tid:0x9298] init
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_context.cpp:47  : 605413869502 us: 29864: [tid:0x9298] Direct Dispatch: 0
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:543 : 605413870553 us: 29864: [tid:0x9298] hipGetDeviceCount: Returned hipSuccess :
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:556 : 605413870631 us: 29864: [tid:0x9298] ←[32m hipSetDevice ( 0 ) ←[0m
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:561 : 605413870848 us: 29864: [tid:0x9298] hipSetDevice: Returned hipSuccess :
---------------------------------------------------------------------------------
-device#                           0
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device.cpp:346 : 605413871623 us: 29864: [tid:0x9298] ←[32m hipGetDeviceProperties ( 0000008AEBEFF8C8, 0 ) ←[0m
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device.cpp:348 : 605413871695 us: 29864: [tid:0x9298] hipGetDeviceProperties: Returned hipSuccess :
-Name:                             AMD Radeon(TM) Graphics
-pciBusID:                         3
-pciDeviceID:                      0
-pciDomainID:                      0
-multiProcessorCount:              7
-maxThreadsPerMultiProcessor:      2560
-isMultiGpuBoard:                  0
-clockRate:                        1600 Mhz
-memoryClockRate:                  1333 Mhz
-memoryBusWidth:                   0
-totalGlobalMem:                   12.06 GB
-totalConstMem:                    2147483647
-sharedMemPerBlock:                64.00 KB
-...
-gcnArchName:                      gfx90c:xnack-
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:541 : 605413924779 us: 29864: [tid:0x9298] ←[32m hipGetDeviceCount ( 0000008AEBEFF8A4 ) ←[0m
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:543 : 605413925075 us: 29864: [tid:0x9298] hipGetDeviceCount: Returned hipSuccess :
-peers:                            :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:176 : 605413928643 us: 29864: [tid:0x9298] ←[32m hipDeviceCanAccessPeer ( 0000008AEBEFF890, 0, 0 ) ←[0m
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:177 : 605413928743 us: 29864: [tid:0x9298] hipDeviceCanAccessPeer: Returned hipSuccess :
-non-peers:                        :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:176 : 605413930830 us: 29864: [tid:0x9298] ←[32m hipDeviceCanAccessPeer ( 0000008AEBEFF890, 0, 0 ) ←[0m
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:177 : 605413930882 us: 29864: [tid:0x9298] hipDeviceCanAccessPeer: Returned hipSuccess :
-device#0
-...
-:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.cpp:430 : 605414517802 us: 29864: [tid:0x9298] Free-:     8000 bytes, VM[ 3007c8000,  3007d0000]
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\devprogram.cpp:2979: 605414517893 us: 29864: [tid:0x9298] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferToImage
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\devprogram.cpp:2979: 605414518259 us: 29864: [tid:0x9298] For Init/Fini: Kernel Name: __amd_rocclr_copyBuffer
-...
-:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.cpp:206 : 605414523422 us: 29864: [tid:0x9298] Alloc: 100000 bytes, ptr[00000003008D0000-00000003009D0000], obj[00000003007D0000-00000003047D0000]
-:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.cpp:206 : 605414523767 us: 29864: [tid:0x9298] Alloc: 100000 bytes, ptr[00000003009D0000-0000000300AD0000], obj[00000003007D0000-00000003047D0000]
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_memory.cpp:681 : 605414524092 us: 29864: [tid:0x9298] hipMemGetInfo: Returned hipSuccess :
-memInfo.total:                    12.06 GB
-memInfo.free:                     11.93 GB (99%)
-```
-
-## HIP Logging Tips:
-
-- HIP logging works for both release and debug version of HIP application.
-- Logging function with different logging level can be called in the code as
-  needed.
-- Information with logging level less than AMD_LOG_LEVEL will be printed.
-- If need to save the HIP logging output information in a file, just define the
-  file at the command when run the application at the terminal, for example,
-
-```console
-user@user-test:~/hip/bin$ ./hipinfo > ~/hip_log.txt
-```
-
diff --git a/docs/doxygen/Doxyfile b/docs/doxygen/Doxyfile
index a08f1740cb..6211ad601c 100644
--- a/docs/doxygen/Doxyfile
+++ b/docs/doxygen/Doxyfile
@@ -1,4 +1,4 @@
-# Doxyfile 1.8.17
+# Doxyfile 1.9.7
 
 # This file describes the settings to be used by the documentation system
 # doxygen (www.doxygen.org) for a project.
@@ -12,6 +12,16 @@
 # For lists, items can also be appended using:
 # TAG += value [value, ...]
 # Values that contain spaces should be placed between quotes (\" \").
+#
+# Note:
+#
+# Use doxygen to compare the used configuration file with the template
+# configuration file:
+# doxygen -x [configFile]
+# Use doxygen to compare the used configuration file with the template
+# configuration file without replacing the environment variables or CMake type
+# replacement variables:
+# doxygen -x_noenv [configFile]
 
 #---------------------------------------------------------------------------
 # Project related configuration options
@@ -60,16 +70,28 @@ PROJECT_LOGO           =
 
 OUTPUT_DIRECTORY       = .
 
-# If the CREATE_SUBDIRS tag is set to YES then doxygen will create 4096 sub-
-# directories (in 2 levels) under the output directory of each output format and
-# will distribute the generated files over these directories. Enabling this
+# If the CREATE_SUBDIRS tag is set to YES then doxygen will create up to 4096
+# sub-directories (in 2 levels) under the output directory of each output format
+# and will distribute the generated files over these directories. Enabling this
 # option can be useful when feeding doxygen a huge amount of source files, where
 # putting all generated files in the same directory would otherwise causes
-# performance problems for the file system.
+# performance problems for the file system. Adapt CREATE_SUBDIRS_LEVEL to
+# control the number of sub-directories.
 # The default value is: NO.
 
 CREATE_SUBDIRS         = NO
 
+# Controls the number of sub-directories that will be created when
+# CREATE_SUBDIRS tag is set to YES. Level 0 represents 16 directories, and every
+# level increment doubles the number of directories, resulting in 4096
+# directories at level 8 which is the default and also the maximum value. The
+# sub-directories are organized in 2 levels, the first level always has a fixed
+# number of 16 directories.
+# Minimum value: 0, maximum value: 8, default value: 8.
+# This tag requires that the tag CREATE_SUBDIRS is set to YES.
+
+CREATE_SUBDIRS_LEVEL   = 8
+
 # If the ALLOW_UNICODE_NAMES tag is set to YES, doxygen will allow non-ASCII
 # characters to appear in the names of generated files. If set to NO, non-ASCII
 # characters will be escaped, for example _xE3_x81_x84 will be used for Unicode
@@ -81,26 +103,18 @@ ALLOW_UNICODE_NAMES    = NO
 # The OUTPUT_LANGUAGE tag is used to specify the language in which all
 # documentation generated by doxygen is written. Doxygen will use this
 # information to generate all constant output in the proper language.
-# Possible values are: Afrikaans, Arabic, Armenian, Brazilian, Catalan, Chinese,
-# Chinese-Traditional, Croatian, Czech, Danish, Dutch, English (United States),
-# Esperanto, Farsi (Persian), Finnish, French, German, Greek, Hungarian,
-# Indonesian, Italian, Japanese, Japanese-en (Japanese with English messages),
-# Korean, Korean-en (Korean with English messages), Latvian, Lithuanian,
-# Macedonian, Norwegian, Persian (Farsi), Polish, Portuguese, Romanian, Russian,
-# Serbian, Serbian-Cyrillic, Slovak, Slovene, Spanish, Swedish, Turkish,
-# Ukrainian and Vietnamese.
+# Possible values are: Afrikaans, Arabic, Armenian, Brazilian, Bulgarian,
+# Catalan, Chinese, Chinese-Traditional, Croatian, Czech, Danish, Dutch, English
+# (United States), Esperanto, Farsi (Persian), Finnish, French, German, Greek,
+# Hindi, Hungarian, Indonesian, Italian, Japanese, Japanese-en (Japanese with
+# English messages), Korean, Korean-en (Korean with English messages), Latvian,
+# Lithuanian, Macedonian, Norwegian, Persian (Farsi), Polish, Portuguese,
+# Romanian, Russian, Serbian, Serbian-Cyrillic, Slovak, Slovene, Spanish,
+# Swedish, Turkish, Ukrainian and Vietnamese.
 # The default value is: English.
 
 OUTPUT_LANGUAGE        = English
 
-# The OUTPUT_TEXT_DIRECTION tag is used to specify the direction in which all
-# documentation generated by doxygen is written. Doxygen will use this
-# information to generate all generated output in the proper direction.
-# Possible values are: None, LTR, RTL and Context.
-# The default value is: None.
-
-OUTPUT_TEXT_DIRECTION  = None
-
 # If the BRIEF_MEMBER_DESC tag is set to YES, doxygen will include brief member
 # descriptions after the members that are listed in the file and class
 # documentation (similar to Javadoc). Set to NO to disable this.
@@ -227,6 +241,14 @@ QT_AUTOBRIEF           = NO
 
 MULTILINE_CPP_IS_BRIEF = NO
 
+# By default Python docstrings are displayed as preformatted text and doxygen's
+# special commands cannot be used. By setting PYTHON_DOCSTRING to NO the
+# doxygen's special commands can be used and the contents of the docstring
+# documentation blocks is shown as doxygen documentation.
+# The default value is: YES.
+
+PYTHON_DOCSTRING       = YES
+
 # If the INHERIT_DOCS tag is set to YES then an undocumented member inherits the
 # documentation from any documented member that it re-implements.
 # The default value is: YES.
@@ -250,25 +272,19 @@ TAB_SIZE               = 4
 # the documentation. An alias has the form:
 # name=value
 # For example adding
-# "sideeffect=@par Side Effects:\n"
+# "sideeffect=@par Side Effects:^^"
 # will allow you to put the command \sideeffect (or @sideeffect) in the
 # documentation, which will result in a user-defined paragraph with heading
-# "Side Effects:". You can put \n's in the value part of an alias to insert
-# newlines (in the resulting output). You can put ^^ in the value part of an
-# alias to insert a newline as if a physical newline was in the original file.
-# When you need a literal { or } or , in the value part of an alias you have to
-# escape them by means of a backslash (\), this can lead to conflicts with the
-# commands \{ and \} for these it is advised to use the version @{ and @} or use
-# a double escape (\\{ and \\})
+# "Side Effects:". Note that you cannot put \n's in the value part of an alias
+# to insert newlines (in the resulting output). You can put ^^ in the value part
+# of an alias to insert a newline as if a physical newline was in the original
+# file. When you need a literal { or } or , in the value part of an alias you
+# have to escape them by means of a backslash (\), this can lead to conflicts
+# with the commands \{ and \} for these it is advised to use the version @{ and
+# @} or use a double escape (\\{ and \\})
 
 ALIASES                =
 
-# This tag can be used to specify a number of word-keyword mappings (TCL only).
-# A mapping has the form "name=value". For example adding "class=itcl::class"
-# will allow you to use the command class in the itcl::class meaning.
-
-TCL_SUBST              =
-
 # Set the OPTIMIZE_OUTPUT_FOR_C tag to YES if your project consists of C sources
 # only. Doxygen will then generate output that is more tailored for C. For
 # instance, some of the names that are used will be different. The list of all
@@ -310,18 +326,21 @@ OPTIMIZE_OUTPUT_SLICE  = NO
 # extension. Doxygen has a built-in mapping, but you can override or extend it
 # using this tag. The format is ext=language, where ext is a file extension, and
 # language is one of the parsers supported by doxygen: IDL, Java, JavaScript,
-# Csharp (C#), C, C++, D, PHP, md (Markdown), Objective-C, Python, Slice,
-# Fortran (fixed format Fortran: FortranFixed, free formatted Fortran:
+# Csharp (C#), C, C++, Lex, D, PHP, md (Markdown), Objective-C, Python, Slice,
+# VHDL, Fortran (fixed format Fortran: FortranFixed, free formatted Fortran:
 # FortranFree, unknown formatted Fortran: Fortran. In the later case the parser
 # tries to guess whether the code is fixed or free formatted code, this is the
-# default for Fortran type files), VHDL, tcl. For instance to make doxygen treat
-# .inc files as Fortran files (default is PHP), and .f files as C (default is
-# Fortran), use: inc=Fortran f=C.
+# default for Fortran type files). For instance to make doxygen treat .inc files
+# as Fortran files (default is PHP), and .f files as C (default is Fortran),
+# use: inc=Fortran f=C.
 #
 # Note: For files without extension you can use no_extension as a placeholder.
 #
 # Note that for custom extensions you also need to set FILE_PATTERNS otherwise
-# the files are not read by doxygen.
+# the files are not read by doxygen. When specifying no_extension you should add
+# * to the FILE_PATTERNS.
+#
+# Note see also the list of default file extension mappings.
 
 EXTENSION_MAPPING      =
 
@@ -344,6 +363,17 @@ MARKDOWN_SUPPORT       = YES
 
 TOC_INCLUDE_HEADINGS   = 5
 
+# The MARKDOWN_ID_STYLE tag can be used to specify the algorithm used to
+# generate identifiers for the Markdown headings. Note: Every identifier is
+# unique.
+# Possible values are: DOXYGEN Use a fixed 'autotoc_md' string followed by a
+# sequence number starting at 0. and GITHUB Use the lower case version of title
+# with any whitespace replaced by '-' and punctations characters removed..
+# The default value is: DOXYGEN.
+# This tag requires that the tag MARKDOWN_SUPPORT is set to YES.
+
+MARKDOWN_ID_STYLE      = DOXYGEN
+
 # When enabled doxygen tries to link words that correspond to documented
 # classes, or namespaces to their corresponding documentation. Such a link can
 # be prevented in individual cases by putting a % sign in front of the word or
@@ -360,7 +390,7 @@ AUTOLINK_SUPPORT       = YES
 # diagrams that involve STL classes more complete and accurate.
 # The default value is: NO.
 
-BUILTIN_STL_SUPPORT    =
+BUILTIN_STL_SUPPORT    = NO
 
 # If you use Microsoft's C++/CLI language, you should set this option to YES to
 # enable parsing support.
@@ -455,6 +485,27 @@ TYPEDEF_HIDES_STRUCT   = YES
 
 LOOKUP_CACHE_SIZE      = 0
 
+# The NUM_PROC_THREADS specifies the number of threads doxygen is allowed to use
+# during processing. When set to 0 doxygen will based this on the number of
+# cores available in the system. You can set it explicitly to a value larger
+# than 0 to get more control over the balance between CPU load and processing
+# speed. At this moment only the input processing can be done using multiple
+# threads. Since this is still an experimental feature the default is set to 1,
+# which effectively disables parallel processing. Please report any issues you
+# encounter. Generating dot graphs in parallel is controlled by the
+# DOT_NUM_THREADS setting.
+# Minimum value: 0, maximum value: 32, default value: 1.
+
+NUM_PROC_THREADS       = 1
+
+# If the TIMESTAMP tag is set different from NO then each generated page will
+# contain the date or date and time when the page was generated. Setting this to
+# NO can help when comparing the output of multiple runs.
+# Possible values are: YES, NO, DATETIME and DATE.
+# The default value is: NO.
+
+TIMESTAMP              = NO
+
 #---------------------------------------------------------------------------
 # Build related configuration options
 #---------------------------------------------------------------------------
@@ -518,6 +569,13 @@ EXTRACT_LOCAL_METHODS  = NO
 
 EXTRACT_ANON_NSPACES   = NO
 
+# If this flag is set to YES, the name of an unnamed parameter in a declaration
+# will be determined by the corresponding definition. By default unnamed
+# parameters remain unnamed in the output.
+# The default value is: YES.
+
+RESOLVE_UNNAMED_PARAMS = YES
+
 # If the HIDE_UNDOC_MEMBERS tag is set to YES, doxygen will hide all
 # undocumented members inside documented classes or files. If set to NO these
 # members will be included in the various overviews, but no documentation
@@ -529,7 +587,8 @@ HIDE_UNDOC_MEMBERS     = NO
 # If the HIDE_UNDOC_CLASSES tag is set to YES, doxygen will hide all
 # undocumented classes that are normally visible in the class hierarchy. If set
 # to NO, these classes will be included in the various overviews. This option
-# has no effect if EXTRACT_ALL is enabled.
+# will also hide undocumented C++ concepts if enabled. This option has no effect
+# if EXTRACT_ALL is enabled.
 # The default value is: NO.
 
 HIDE_UNDOC_CLASSES     = NO
@@ -555,12 +614,20 @@ HIDE_IN_BODY_DOCS      = NO
 
 INTERNAL_DOCS          = NO
 
-# If the CASE_SENSE_NAMES tag is set to NO then doxygen will only generate file
-# names in lower-case letters. If set to YES, upper-case letters are also
-# allowed. This is useful if you have classes or files whose names only differ
-# in case and if your file system supports case sensitive file names. Windows
-# (including Cygwin) ands Mac users are advised to set this option to NO.
-# The default value is: system dependent.
+# With the correct setting of option CASE_SENSE_NAMES doxygen will better be
+# able to match the capabilities of the underlying filesystem. In case the
+# filesystem is case sensitive (i.e. it supports files in the same directory
+# whose names only differ in casing), the option must be set to YES to properly
+# deal with such files in case they appear in the input. For filesystems that
+# are not case sensitive the option should be set to NO to properly deal with
+# output files written for symbols that only differ in casing, such as for two
+# classes, one named CLASS and the other named Class, and to also support
+# references to files without having to specify the exact matching casing. On
+# Windows (including Cygwin) and MacOS, users should typically set this option
+# to NO, whereas on Linux or other Unix flavors it should typically be set to
+# YES.
+# Possible values are: SYSTEM, NO and YES.
+# The default value is: SYSTEM.
 
 CASE_SENSE_NAMES       = NO
 
@@ -578,6 +645,12 @@ HIDE_SCOPE_NAMES       = NO
 
 HIDE_COMPOUND_REFERENCE= NO
 
+# If the SHOW_HEADERFILE tag is set to YES then the documentation for a class
+# will show which file needs to be included to use the class.
+# The default value is: YES.
+
+SHOW_HEADERFILE        = YES
+
 # If the SHOW_INCLUDE_FILES tag is set to YES then doxygen will put a list of
 # the files that are included by a file in the documentation of that file.
 # The default value is: YES.
@@ -735,7 +808,8 @@ FILE_VERSION_FILTER    =
 # output files in an output format independent way. To create the layout file
 # that represents doxygen's defaults, run doxygen with the -l option. You can
 # optionally specify a file name after the option, if omitted DoxygenLayout.xml
-# will be used as the name of the layout file.
+# will be used as the name of the layout file. See also section "Changing the
+# layout of pages" for information.
 #
 # Note that if you run doxygen from a directory containing a file called
 # DoxygenLayout.xml, doxygen will parse it automatically even if the LAYOUT_FILE
@@ -781,24 +855,50 @@ WARNINGS               = YES
 WARN_IF_UNDOCUMENTED   = YES
 
 # If the WARN_IF_DOC_ERROR tag is set to YES, doxygen will generate warnings for
-# potential errors in the documentation, such as not documenting some parameters
-# in a documented function, or documenting parameters that don't exist or using
-# markup commands wrongly.
+# potential errors in the documentation, such as documenting some parameters in
+# a documented function twice, or documenting parameters that don't exist or
+# using markup commands wrongly.
 # The default value is: YES.
 
 WARN_IF_DOC_ERROR      = YES
 
+# If WARN_IF_INCOMPLETE_DOC is set to YES, doxygen will warn about incomplete
+# function parameter documentation. If set to NO, doxygen will accept that some
+# parameters have no documentation without warning.
+# The default value is: YES.
+
+WARN_IF_INCOMPLETE_DOC = YES
+
 # This WARN_NO_PARAMDOC option can be enabled to get warnings for functions that
 # are documented, but have no documentation for their parameters or return
-# value. If set to NO, doxygen will only warn about wrong or incomplete
-# parameter documentation, but not about the absence of documentation. If
-# EXTRACT_ALL is set to YES then this flag will automatically be disabled.
+# value. If set to NO, doxygen will only warn about wrong parameter
+# documentation, but not about the absence of documentation. If EXTRACT_ALL is
+# set to YES then this flag will automatically be disabled. See also
+# WARN_IF_INCOMPLETE_DOC
 # The default value is: NO.
 
 WARN_NO_PARAMDOC       = NO
 
+# If WARN_IF_UNDOC_ENUM_VAL option is set to YES, doxygen will warn about
+# undocumented enumeration values. If set to NO, doxygen will accept
+# undocumented enumeration values. If EXTRACT_ALL is set to YES then this flag
+# will automatically be disabled.
+# The default value is: NO.
+
+WARN_IF_UNDOC_ENUM_VAL = NO
+
 # If the WARN_AS_ERROR tag is set to YES then doxygen will immediately stop when
-# a warning is encountered.
+# a warning is encountered. If the WARN_AS_ERROR tag is set to FAIL_ON_WARNINGS
+# then doxygen will continue running as if WARN_AS_ERROR tag is set to NO, but
+# at the end of the doxygen process doxygen will return with a non-zero status.
+# If the WARN_AS_ERROR tag is set to FAIL_ON_WARNINGS_PRINT then doxygen behaves
+# like FAIL_ON_WARNINGS but in case no WARN_LOGFILE is defined doxygen will not
+# write the warning messages in between other messages but write them at the end
+# of a run, in case a WARN_LOGFILE is defined the warning messages will be
+# besides being in the defined file also be shown at the end of a run, unless
+# the WARN_LOGFILE is defined as - i.e. standard output (stdout) in that case
+# the behavior will remain as with the setting FAIL_ON_WARNINGS.
+# Possible values are: NO, YES, FAIL_ON_WARNINGS and FAIL_ON_WARNINGS_PRINT.
 # The default value is: NO.
 
 WARN_AS_ERROR          = NO
@@ -809,13 +909,27 @@ WARN_AS_ERROR          = NO
 # and the warning text. Optionally the format may contain $version, which will
 # be replaced by the version of the file (if it could be obtained via
 # FILE_VERSION_FILTER)
+# See also: WARN_LINE_FORMAT
 # The default value is: $file:$line: $text.
 
 WARN_FORMAT            = "$file:$line: $text"
 
+# In the $text part of the WARN_FORMAT command it is possible that a reference
+# to a more specific place is given. To make it easier to jump to this place
+# (outside of doxygen) the user can define a custom "cut" / "paste" string.
+# Example:
+# WARN_LINE_FORMAT = "'vi $file +$line'"
+# See also: WARN_FORMAT
+# The default value is: at line $line of file $file.
+
+WARN_LINE_FORMAT       = "at line $line of file $file"
+
 # The WARN_LOGFILE tag can be used to specify a file to which warning and error
 # messages should be written. If left blank the output is written to standard
-# error (stderr).
+# error (stderr). In case the file specified cannot be opened for writing the
+# warning and error messages are written to standard error. When as file - is
+# specified the warning and error messages are written to standard output
+# (stdout).
 
 WARN_LOGFILE           =
 
@@ -831,17 +945,29 @@ WARN_LOGFILE           =
 
 INPUT                  = mainpage.md \
                          ../../include/hip \
-                         ../../../clr/hipamd/include/hip/amd_detail/amd_hip_gl_interop.h
+                         ../../../clr/hipamd/include/hip/amd_detail/amd_hip_gl_interop.h \
+                         ../../../llvm-project/clang/lib/Headers/__clang_hip_math.h
 
 # This tag can be used to specify the character encoding of the source files
 # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
 # libiconv (or the iconv built into libc) for the transcoding. See the libiconv
-# documentation (see: https://www.gnu.org/software/libiconv/) for the list of
-# possible encodings.
+# documentation (see:
+# https://www.gnu.org/software/libiconv/) for the list of possible encodings.
+# See also: INPUT_FILE_ENCODING
 # The default value is: UTF-8.
 
 INPUT_ENCODING         = UTF-8
 
+# This tag can be used to specify the character encoding of the source files
+# that doxygen parses The INPUT_FILE_ENCODING tag can be used to specify
+# character encoding on a per file pattern basis. Doxygen will compare the file
+# name with each pattern and apply the encoding instead of the default
+# INPUT_ENCODING) if there is a match. The character encodings are a list of the
+# form: pattern=encoding (like *.php=ISO-8859-1). See cfg_input_encoding
+# "INPUT_ENCODING" for further information on supported encodings.
+
+INPUT_FILE_ENCODING    =
+
 # If the value of the INPUT tag contains directories, you can use the
 # FILE_PATTERNS tag to specify one or more wildcard patterns (like *.cpp and
 # *.h) to filter out the source-files in the directories.
@@ -850,12 +976,14 @@ INPUT_ENCODING         = UTF-8
 # need to set EXTENSION_MAPPING for the extension otherwise the files are not
 # read by doxygen.
 #
+# Note the list of default checked file patterns might differ from the list of
+# default file extension mappings.
+#
 # If left blank the following patterns are tested:*.c, *.cc, *.cxx, *.cpp,
 # *.c++, *.java, *.ii, *.ixx, *.ipp, *.i++, *.inl, *.idl, *.ddl, *.odl, *.h,
-# *.hh, *.hxx, *.hpp, *.h++, *.cs, *.d, *.php, *.php4, *.php5, *.phtml, *.inc,
-# *.m, *.markdown, *.md, *.mm, *.dox (to be provided as doxygen C comment),
-# *.doc (to be provided as doxygen C comment), *.txt (to be provided as doxygen
-# C comment), *.py, *.pyw, *.f90, *.f95, *.f03, *.f08, *.f, *.for, *.tcl, *.vhd,
+# *.hh, *.hxx, *.hpp, *.h++, *.l, *.cs, *.d, *.php, *.php4, *.php5, *.phtml,
+# *.inc, *.m, *.markdown, *.md, *.mm, *.dox (to be provided as doxygen C
+# comment), *.py, *.pyw, *.f90, *.f95, *.f03, *.f08, *.f18, *.f, *.for, *.vhd,
 # *.vhdl, *.ucf, *.qsf and *.ice.
 
 FILE_PATTERNS          = *.c \
@@ -933,10 +1061,7 @@ EXCLUDE_PATTERNS       =
 # (namespaces, classes, functions, etc.) that should be excluded from the
 # output. The symbol name can be a fully qualified name, a word, or if the
 # wildcard * is used, a substring. Examples: ANamespace, AClass,
-# AClass::ANamespace, ANamespace::*Test
-#
-# Note that the wildcards are matched against the file with absolute path, so to
-# exclude all test directories use the pattern */test/*
+# ANamespace::AClass, ANamespace::*Test
 
 EXCLUDE_SYMBOLS        =
 
@@ -981,6 +1106,11 @@ IMAGE_PATH             =
 # code is scanned, but not when the output code is generated. If lines are added
 # or removed, the anchors will not be placed correctly.
 #
+# Note that doxygen will use the data processed and written to standard output
+# for further processing, therefore nothing else, like debug statements or used
+# commands (so in case of a Windows batch file always use @echo OFF), should be
+# written to standard output.
+#
 # Note that for custom extensions or not directly supported extensions you also
 # need to set EXTENSION_MAPPING for the extension otherwise the files are not
 # properly processed by doxygen.
@@ -1022,6 +1152,15 @@ FILTER_SOURCE_PATTERNS =
 
 USE_MDFILE_AS_MAINPAGE =
 
+# The Fortran standard specifies that for fixed formatted Fortran code all
+# characters from position 72 are to be considered as comment. A common
+# extension is to allow longer lines before the automatic comment starts. The
+# setting FORTRAN_COMMENT_AFTER will also make it possible that longer lines can
+# be processed before the automatic comment starts.
+# Minimum value: 7, maximum value: 10000, default value: 72.
+
+FORTRAN_COMMENT_AFTER  = 72
+
 #---------------------------------------------------------------------------
 # Configuration options related to source browsing
 #---------------------------------------------------------------------------
@@ -1109,15 +1248,23 @@ USE_HTAGS              = NO
 VERBATIM_HEADERS       = YES
 
 # If the CLANG_ASSISTED_PARSING tag is set to YES then doxygen will use the
-# clang parser (see: http://clang.llvm.org/) for more accurate parsing at the
-# cost of reduced performance. This can be particularly helpful with template
-# rich C++ code for which doxygen's built-in parser lacks the necessary type
-# information.
+# clang parser (see:
+# http://clang.llvm.org/) for more accurate parsing at the cost of reduced
+# performance. This can be particularly helpful with template rich C++ code for
+# which doxygen's built-in parser lacks the necessary type information.
 # Note: The availability of this option depends on whether or not doxygen was
 # generated with the -Duse_libclang=ON option for CMake.
 # The default value is: NO.
 
-CLANG_ASSISTED_PARSING = NO
+CLANG_ASSISTED_PARSING = YES
+
+# If the CLANG_ASSISTED_PARSING tag is set to YES and the CLANG_ADD_INC_PATHS
+# tag is set to YES then doxygen will add the directory of each input to the
+# include path.
+# The default value is: YES.
+# This tag requires that the tag CLANG_ASSISTED_PARSING is set to YES.
+
+CLANG_ADD_INC_PATHS    = YES
 
 # If clang assisted parsing is enabled you can provide the compiler with command
 # line options that you would normally use when invoking the compiler. Note that
@@ -1128,10 +1275,13 @@ CLANG_ASSISTED_PARSING = NO
 CLANG_OPTIONS          =
 
 # If clang assisted parsing is enabled you can provide the clang parser with the
-# path to the compilation database (see:
-# http://clang.llvm.org/docs/HowToSetupToolingForLLVM.html) used when the files
-# were built. This is equivalent to specifying the "-p" option to a clang tool,
-# such as clang-check. These options will then be passed to the parser.
+# path to the directory containing a file called compile_commands.json. This
+# file is the compilation database (see:
+# http://clang.llvm.org/docs/HowToSetupToolingForLLVM.html) containing the
+# options used when the source files were built. This is equivalent to
+# specifying the -p option to a clang tool, such as clang-check. These options
+# will then be passed to the parser. Any options specified with CLANG_OPTIONS
+# will be added as well.
 # Note: The availability of this option depends on whether or not doxygen was
 # generated with the -Duse_libclang=ON option for CMake.
 
@@ -1148,17 +1298,11 @@ CLANG_DATABASE_PATH    =
 
 ALPHABETICAL_INDEX     = YES
 
-# The COLS_IN_ALPHA_INDEX tag can be used to specify the number of columns in
-# which the alphabetical index list will be split.
-# Minimum value: 1, maximum value: 20, default value: 5.
-# This tag requires that the tag ALPHABETICAL_INDEX is set to YES.
-
-COLS_IN_ALPHA_INDEX    = 5
-
-# In case all classes in a project start with a common prefix, all classes will
-# be put under the same header in the alphabetical index. The IGNORE_PREFIX tag
-# can be used to specify a prefix (or a list of prefixes) that should be ignored
-# while generating the index headers.
+# The IGNORE_PREFIX tag can be used to specify a prefix (or a list of prefixes)
+# that should be ignored while generating the index headers. The IGNORE_PREFIX
+# tag works for classes, function and member names. The entity will be placed in
+# the alphabetical list under the first letter of the entity name that remains
+# after removing the prefix.
 # This tag requires that the tag ALPHABETICAL_INDEX is set to YES.
 
 IGNORE_PREFIX          =
@@ -1237,7 +1381,12 @@ HTML_STYLESHEET        = ../_doxygen/stylesheet.css
 # Doxygen will copy the style sheet files to the output directory.
 # Note: The order of the extra style sheet files is of importance (e.g. the last
 # style sheet in the list overrules the setting of the previous ones in the
-# list). For an example see the documentation.
+# list).
+# Note: Since the styling of scrollbars can currently not be overruled in
+# Webkit/Chromium, the styling will be left out of the default doxygen.css if
+# one or more extra stylesheets have been specified. So if scrollbar
+# customization is desired it has to be added explicitly. For an example see the
+# documentation.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
 HTML_EXTRA_STYLESHEET  = ../_doxygen/extra_stylesheet.css
@@ -1252,9 +1401,22 @@ HTML_EXTRA_STYLESHEET  = ../_doxygen/extra_stylesheet.css
 
 HTML_EXTRA_FILES       =
 
+# The HTML_COLORSTYLE tag can be used to specify if the generated HTML output
+# should be rendered with a dark or light theme.
+# Possible values are: LIGHT always generate light mode output, DARK always
+# generate dark mode output, AUTO_LIGHT automatically set the mode according to
+# the user preference, use light mode if no preference is set (the default),
+# AUTO_DARK automatically set the mode according to the user preference, use
+# dark mode if no preference is set and TOGGLE allow to user to switch between
+# light and dark mode via a button.
+# The default value is: AUTO_LIGHT.
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+HTML_COLORSTYLE        = AUTO_LIGHT
+
 # The HTML_COLORSTYLE_HUE tag controls the color of the HTML output. Doxygen
 # will adjust the colors in the style sheet and background images according to
-# this color. Hue is specified as an angle on a colorwheel, see
+# this color. Hue is specified as an angle on a color-wheel, see
 # https://en.wikipedia.org/wiki/Hue for more information. For instance the value
 # 0 represents red, 60 is yellow, 120 is green, 180 is cyan, 240 is blue, 300
 # purple, and 360 is red again.
@@ -1264,7 +1426,7 @@ HTML_EXTRA_FILES       =
 HTML_COLORSTYLE_HUE    = 220
 
 # The HTML_COLORSTYLE_SAT tag controls the purity (or saturation) of the colors
-# in the HTML output. For a value of 0 the output will use grayscales only. A
+# in the HTML output. For a value of 0 the output will use gray-scales only. A
 # value of 255 will produce the most vivid colors.
 # Minimum value: 0, maximum value: 255, default value: 100.
 # This tag requires that the tag GENERATE_HTML is set to YES.
@@ -1282,15 +1444,6 @@ HTML_COLORSTYLE_SAT    = 100
 
 HTML_COLORSTYLE_GAMMA  = 80
 
-# If the HTML_TIMESTAMP tag is set to YES then the footer of each generated HTML
-# page will contain the date and time when the page was generated. Setting this
-# to YES can help to show when doxygen was last run and thus if the
-# documentation is up to date.
-# The default value is: NO.
-# This tag requires that the tag GENERATE_HTML is set to YES.
-
-HTML_TIMESTAMP         = NO
-
 # If the HTML_DYNAMIC_MENUS tag is set to YES then the generated HTML
 # documentation will contain a main index with vertical navigation menus that
 # are dynamically created via JavaScript. If disabled, the navigation index will
@@ -1325,10 +1478,11 @@ HTML_INDEX_NUM_ENTRIES = 100
 
 # If the GENERATE_DOCSET tag is set to YES, additional index files will be
 # generated that can be used as input for Apple's Xcode 3 integrated development
-# environment (see: https://developer.apple.com/xcode/), introduced with OSX
-# 10.5 (Leopard). To create a documentation set, doxygen will generate a
-# Makefile in the HTML output directory. Running make will produce the docset in
-# that directory and running make install will install the docset in
+# environment (see:
+# https://developer.apple.com/xcode/), introduced with OSX 10.5 (Leopard). To
+# create a documentation set, doxygen will generate a Makefile in the HTML
+# output directory. Running make will produce the docset in that directory and
+# running make install will install the docset in
 # ~/Library/Developer/Shared/Documentation/DocSets so that Xcode will find it at
 # startup. See https://developer.apple.com/library/archive/featuredarticles/Doxy
 # genXcode/_index.html for more information.
@@ -1345,6 +1499,13 @@ GENERATE_DOCSET        = NO
 
 DOCSET_FEEDNAME        = "Doxygen generated docs"
 
+# This tag determines the URL of the docset feed. A documentation feed provides
+# an umbrella under which multiple documentation sets from a single provider
+# (such as a company or product suite) can be grouped.
+# This tag requires that the tag GENERATE_DOCSET is set to YES.
+
+DOCSET_FEEDURL         =
+
 # This tag specifies a string that should uniquely identify the documentation
 # set bundle. This should be a reverse domain-name style string, e.g.
 # com.mycompany.MyDocSet. Doxygen will append .docset to the name.
@@ -1370,8 +1531,12 @@ DOCSET_PUBLISHER_NAME  = Publisher
 # If the GENERATE_HTMLHELP tag is set to YES then doxygen generates three
 # additional HTML index files: index.hhp, index.hhc, and index.hhk. The
 # index.hhp is a project file that can be read by Microsoft's HTML Help Workshop
-# (see: https://www.microsoft.com/en-us/download/details.aspx?id=21138) on
-# Windows.
+# on Windows. In the beginning of 2021 Microsoft took the original page, with
+# a.o. the download links, offline the HTML help workshop was already many years
+# in maintenance mode). You can download the HTML help workshop from the web
+# archives at Installation executable (see:
+# http://web.archive.org/web/20160201063255/http://download.microsoft.com/downlo
+# ad/0/A/9/0A939EF6-E31C-430F-A3DF-DFAE7960D564/htmlhelp.exe).
 #
 # The HTML Help Workshop contains a compiler that can convert all HTML output
 # generated by doxygen into a single compiled HTML file (.chm). Compiled HTML
@@ -1401,7 +1566,7 @@ CHM_FILE               =
 HHC_LOCATION           =
 
 # The GENERATE_CHI flag controls if a separate .chi index file is generated
-# (YES) or that it should be included in the master .chm file (NO).
+# (YES) or that it should be included in the main .chm file (NO).
 # The default value is: NO.
 # This tag requires that the tag GENERATE_HTMLHELP is set to YES.
 
@@ -1428,6 +1593,16 @@ BINARY_TOC             = NO
 
 TOC_EXPAND             = NO
 
+# The SITEMAP_URL tag is used to specify the full URL of the place where the
+# generated documentation will be placed on the server by the user during the
+# deployment of the documentation. The generated sitemap is called sitemap.xml
+# and placed on the directory specified by HTML_OUTPUT. In case no SITEMAP_URL
+# is specified no sitemap is generated. For information about the sitemap
+# protocol see https://www.sitemaps.org
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+SITEMAP_URL            =
+
 # If the GENERATE_QHP tag is set to YES and both QHP_NAMESPACE and
 # QHP_VIRTUAL_FOLDER are set, an additional index file will be generated that
 # can be used as input for Qt's qhelpgenerator to generate a Qt Compressed Help
@@ -1446,7 +1621,8 @@ QCH_FILE               =
 
 # The QHP_NAMESPACE tag specifies the namespace to use when generating Qt Help
 # Project output. For more information please see Qt Help Project / Namespace
-# (see: https://doc.qt.io/archives/qt-4.8/qthelpproject.html#namespace).
+# (see:
+# https://doc.qt.io/archives/qt-4.8/qthelpproject.html#namespace).
 # The default value is: org.doxygen.Project.
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
@@ -1454,8 +1630,8 @@ QHP_NAMESPACE          = org.doxygen.Project
 
 # The QHP_VIRTUAL_FOLDER tag specifies the namespace to use when generating Qt
 # Help Project output. For more information please see Qt Help Project / Virtual
-# Folders (see: https://doc.qt.io/archives/qt-4.8/qthelpproject.html#virtual-
-# folders).
+# Folders (see:
+# https://doc.qt.io/archives/qt-4.8/qthelpproject.html#virtual-folders).
 # The default value is: doc.
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
@@ -1463,16 +1639,16 @@ QHP_VIRTUAL_FOLDER     = doc
 
 # If the QHP_CUST_FILTER_NAME tag is set, it specifies the name of a custom
 # filter to add. For more information please see Qt Help Project / Custom
-# Filters (see: https://doc.qt.io/archives/qt-4.8/qthelpproject.html#custom-
-# filters).
+# Filters (see:
+# https://doc.qt.io/archives/qt-4.8/qthelpproject.html#custom-filters).
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
 QHP_CUST_FILTER_NAME   =
 
 # The QHP_CUST_FILTER_ATTRS tag specifies the list of the attributes of the
 # custom filter to add. For more information please see Qt Help Project / Custom
-# Filters (see: https://doc.qt.io/archives/qt-4.8/qthelpproject.html#custom-
-# filters).
+# Filters (see:
+# https://doc.qt.io/archives/qt-4.8/qthelpproject.html#custom-filters).
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
 QHP_CUST_FILTER_ATTRS  =
@@ -1484,9 +1660,9 @@ QHP_CUST_FILTER_ATTRS  =
 
 QHP_SECT_FILTER_ATTRS  =
 
-# The QHG_LOCATION tag can be used to specify the location of Qt's
-# qhelpgenerator. If non-empty doxygen will try to run qhelpgenerator on the
-# generated .qhp file.
+# The QHG_LOCATION tag can be used to specify the location (absolute path
+# including file name) of Qt's qhelpgenerator. If non-empty doxygen will try to
+# run qhelpgenerator on the generated .qhp file.
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
 QHG_LOCATION           =
@@ -1529,16 +1705,28 @@ DISABLE_INDEX          = NO
 # to work a browser that supports JavaScript, DHTML, CSS and frames is required
 # (i.e. any modern browser). Windows users are probably better off using the
 # HTML help feature. Via custom style sheets (see HTML_EXTRA_STYLESHEET) one can
-# further fine-tune the look of the index. As an example, the default style
-# sheet generated by doxygen has an example that shows how to put an image at
-# the root of the tree instead of the PROJECT_NAME. Since the tree basically has
-# the same information as the tab index, you could consider setting
-# DISABLE_INDEX to YES when enabling this option.
+# further fine tune the look of the index (see "Fine-tuning the output"). As an
+# example, the default style sheet generated by doxygen has an example that
+# shows how to put an image at the root of the tree instead of the PROJECT_NAME.
+# Since the tree basically has the same information as the tab index, you could
+# consider setting DISABLE_INDEX to YES when enabling this option.
 # The default value is: NO.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
 GENERATE_TREEVIEW      = NO
 
+# When both GENERATE_TREEVIEW and DISABLE_INDEX are set to YES, then the
+# FULL_SIDEBAR option determines if the side bar is limited to only the treeview
+# area (value NO) or if it should extend to the full height of the window (value
+# YES). Setting this to YES gives a layout similar to
+# https://docs.readthedocs.io with more room for contents, but less room for the
+# project logo, title, and description. If either GENERATE_TREEVIEW or
+# DISABLE_INDEX is set to NO, this option has no effect.
+# The default value is: NO.
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+FULL_SIDEBAR           = NO
+
 # The ENUM_VALUES_PER_LINE tag can be used to set the number of enum values that
 # doxygen will group on one line in the generated HTML documentation.
 #
@@ -1563,6 +1751,24 @@ TREEVIEW_WIDTH         = 250
 
 EXT_LINKS_IN_WINDOW    = NO
 
+# If the OBFUSCATE_EMAILS tag is set to YES, doxygen will obfuscate email
+# addresses.
+# The default value is: YES.
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+OBFUSCATE_EMAILS       = YES
+
+# If the HTML_FORMULA_FORMAT option is set to svg, doxygen will use the pdf2svg
+# tool (see https://github.com/dawbarton/pdf2svg) or inkscape (see
+# https://inkscape.org) to generate formulas as SVG images instead of PNGs for
+# the HTML output. These images will generally look nicer at scaled resolutions.
+# Possible values are: png (the default) and svg (looks nicer but requires the
+# pdf2svg or inkscape tool).
+# The default value is: png.
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+HTML_FORMULA_FORMAT    = png
+
 # Use this tag to change the font size of LaTeX formulas included as images in
 # the HTML documentation. When you change the font size after a successful
 # doxygen run you need to manually remove any form_*.png images from the HTML
@@ -1572,17 +1778,6 @@ EXT_LINKS_IN_WINDOW    = NO
 
 FORMULA_FONTSIZE       = 10
 
-# Use the FORMULA_TRANSPARENT tag to determine whether or not the images
-# generated for formulas are transparent PNGs. Transparent PNGs are not
-# supported properly for IE 6.0, but are supported on all modern browsers.
-#
-# Note that when changing this option you need to delete any form_*.png files in
-# the HTML output directory before the changes have effect.
-# The default value is: YES.
-# This tag requires that the tag GENERATE_HTML is set to YES.
-
-FORMULA_TRANSPARENT    = YES
-
 # The FORMULA_MACROFILE can contain LaTeX \newcommand and \renewcommand commands
 # to create new LaTeX commands to be used in formulas as building blocks. See
 # the section "Including formulas" for details.
@@ -1600,11 +1795,29 @@ FORMULA_MACROFILE      =
 
 USE_MATHJAX            = YES
 
+# With MATHJAX_VERSION it is possible to specify the MathJax version to be used.
+# Note that the different versions of MathJax have different requirements with
+# regards to the different settings, so it is possible that also other MathJax
+# settings have to be changed when switching between the different MathJax
+# versions.
+# Possible values are: MathJax_2 and MathJax_3.
+# The default value is: MathJax_2.
+# This tag requires that the tag USE_MATHJAX is set to YES.
+
+MATHJAX_VERSION        = MathJax_2
+
 # When MathJax is enabled you can set the default output format to be used for
-# the MathJax output. See the MathJax site (see:
-# http://docs.mathjax.org/en/latest/output.html) for more details.
+# the MathJax output. For more details about the output format see MathJax
+# version 2 (see:
+# http://docs.mathjax.org/en/v2.7-latest/output.html) and MathJax version 3
+# (see:
+# http://docs.mathjax.org/en/latest/web/components/output.html).
 # Possible values are: HTML-CSS (which is slower, but has the best
-# compatibility), NativeMML (i.e. MathML) and SVG.
+# compatibility. This is the name for Mathjax version 2, for MathJax version 3
+# this will be translated into chtml), NativeMML (i.e. MathML. Only supported
+# for NathJax 2. For MathJax version 3 chtml will be used instead.), chtml (This
+# is the name for Mathjax version 3, for MathJax version 2 this will be
+# translated into HTML-CSS) and SVG.
 # The default value is: HTML-CSS.
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
@@ -1617,22 +1830,29 @@ MATHJAX_FORMAT         = HTML-CSS
 # MATHJAX_RELPATH should be ../mathjax. The default value points to the MathJax
 # Content Delivery Network so you can quickly see the result without installing
 # MathJax. However, it is strongly recommended to install a local copy of
-# MathJax from https://www.mathjax.org before deployment.
-# The default value is: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/.
+# MathJax from https://www.mathjax.org before deployment. The default value is:
+# - in case of MathJax version 2: https://cdn.jsdelivr.net/npm/mathjax@2
+# - in case of MathJax version 3: https://cdn.jsdelivr.net/npm/mathjax@3
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
 MATHJAX_RELPATH        = http://cdn.mathjax.org/mathjax/latest
 
 # The MATHJAX_EXTENSIONS tag can be used to specify one or more MathJax
 # extension names that should be enabled during MathJax rendering. For example
+# for MathJax version 2 (see
+# https://docs.mathjax.org/en/v2.7-latest/tex.html#tex-and-latex-extensions):
 # MATHJAX_EXTENSIONS = TeX/AMSmath TeX/AMSsymbols
+# For example for MathJax version 3 (see
+# http://docs.mathjax.org/en/latest/input/tex/extensions/index.html):
+# MATHJAX_EXTENSIONS = ams
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
 MATHJAX_EXTENSIONS     =
 
 # The MATHJAX_CODEFILE tag can be used to specify a file with javascript pieces
 # of code that will be used on startup of the MathJax code. See the MathJax site
-# (see: http://docs.mathjax.org/en/latest/output.html) for more details. For an
+# (see:
+# http://docs.mathjax.org/en/v2.7-latest/output.html) for more details. For an
 # example see the documentation.
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
@@ -1679,7 +1899,8 @@ SERVER_BASED_SEARCH    = NO
 #
 # Doxygen ships with an example indexer (doxyindexer) and search engine
 # (doxysearch.cgi) which are based on the open source search engine library
-# Xapian (see: https://xapian.org/).
+# Xapian (see:
+# https://xapian.org/).
 #
 # See the section "External Indexing and Searching" for details.
 # The default value is: NO.
@@ -1692,8 +1913,9 @@ EXTERNAL_SEARCH        = NO
 #
 # Doxygen ships with an example indexer (doxyindexer) and search engine
 # (doxysearch.cgi) which are based on the open source search engine library
-# Xapian (see: https://xapian.org/). See the section "External Indexing and
-# Searching" for details.
+# Xapian (see:
+# https://xapian.org/). See the section "External Indexing and Searching" for
+# details.
 # This tag requires that the tag SEARCHENGINE is set to YES.
 
 SEARCHENGINE_URL       =
@@ -1802,29 +2024,31 @@ PAPER_TYPE             = a4
 
 EXTRA_PACKAGES         =
 
-# The LATEX_HEADER tag can be used to specify a personal LaTeX header for the
-# generated LaTeX document. The header should contain everything until the first
-# chapter. If it is left blank doxygen will generate a standard header. See
-# section "Doxygen usage" for information on how to let doxygen write the
-# default header to a separate file.
+# The LATEX_HEADER tag can be used to specify a user-defined LaTeX header for
+# the generated LaTeX document. The header should contain everything until the
+# first chapter. If it is left blank doxygen will generate a standard header. It
+# is highly recommended to start with a default header using
+# doxygen -w latex new_header.tex new_footer.tex new_stylesheet.sty
+# and then modify the file new_header.tex. See also section "Doxygen usage" for
+# information on how to generate the default header that doxygen normally uses.
 #
-# Note: Only use a user-defined header if you know what you are doing! The
-# following commands have a special meaning inside the header: $title,
-# $datetime, $date, $doxygenversion, $projectname, $projectnumber,
-# $projectbrief, $projectlogo. Doxygen will replace $title with the empty
-# string, for the replacement values of the other commands the user is referred
-# to HTML_HEADER.
+# Note: Only use a user-defined header if you know what you are doing!
+# Note: The header is subject to change so you typically have to regenerate the
+# default header when upgrading to a newer version of doxygen. The following
+# commands have a special meaning inside the header (and footer): For a
+# description of the possible markers and block names see the documentation.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 LATEX_HEADER           =
 
-# The LATEX_FOOTER tag can be used to specify a personal LaTeX footer for the
-# generated LaTeX document. The footer should contain everything after the last
-# chapter. If it is left blank doxygen will generate a standard footer. See
+# The LATEX_FOOTER tag can be used to specify a user-defined LaTeX footer for
+# the generated LaTeX document. The footer should contain everything after the
+# last chapter. If it is left blank doxygen will generate a standard footer. See
 # LATEX_HEADER for more information on how to generate a default footer and what
-# special commands can be used inside the footer.
-#
-# Note: Only use a user-defined footer if you know what you are doing!
+# special commands can be used inside the footer. See also section "Doxygen
+# usage" for information on how to generate the default footer that doxygen
+# normally uses. Note: Only use a user-defined footer if you know what you are
+# doing!
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 LATEX_FOOTER           =
@@ -1857,18 +2081,26 @@ LATEX_EXTRA_FILES      =
 
 PDF_HYPERLINKS         = YES
 
-# If the USE_PDFLATEX tag is set to YES, doxygen will use pdflatex to generate
-# the PDF file directly from the LaTeX files. Set this option to YES, to get a
-# higher quality PDF documentation.
+# If the USE_PDFLATEX tag is set to YES, doxygen will use the engine as
+# specified with LATEX_CMD_NAME to generate the PDF file directly from the LaTeX
+# files. Set this option to YES, to get a higher quality PDF documentation.
+#
+# See also section LATEX_CMD_NAME for selecting the engine.
 # The default value is: YES.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 USE_PDFLATEX           = YES
 
-# If the LATEX_BATCHMODE tag is set to YES, doxygen will add the \batchmode
-# command to the generated LaTeX files. This will instruct LaTeX to keep running
-# if errors occur, instead of asking the user for help. This option is also used
-# when generating formulas in HTML.
+# The LATEX_BATCHMODE tag ignals the behavior of LaTeX in case of an error.
+# Possible values are: NO same as ERROR_STOP, YES same as BATCH, BATCH In batch
+# mode nothing is printed on the terminal, errors are scrolled as if <return> is
+# hit at every error; missing files that TeX tries to input or request from
+# keyboard input (\read on a not open input stream) cause the job to abort,
+# NON_STOP In nonstop mode the diagnostic message will appear on the terminal,
+# but there is no possibility of user interaction just like in batch mode,
+# SCROLL In scroll mode, TeX will stop only for missing files to input or if
+# keyboard input is necessary and ERROR_STOP In errorstop mode, TeX will stop at
+# each error, asking for user intervention.
 # The default value is: NO.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
@@ -1881,16 +2113,6 @@ LATEX_BATCHMODE        = NO
 
 LATEX_HIDE_INDICES     = NO
 
-# If the LATEX_SOURCE_CODE tag is set to YES then doxygen will include source
-# code with syntax highlighting in the LaTeX output.
-#
-# Note that which sources are shown also depends on other settings such as
-# SOURCE_BROWSER.
-# The default value is: NO.
-# This tag requires that the tag GENERATE_LATEX is set to YES.
-
-LATEX_SOURCE_CODE      = NO
-
 # The LATEX_BIB_STYLE tag can be used to specify the style to use for the
 # bibliography, e.g. plainnat, or ieeetr. See
 # https://en.wikipedia.org/wiki/BibTeX and \cite for more info.
@@ -1899,14 +2121,6 @@ LATEX_SOURCE_CODE      = NO
 
 LATEX_BIB_STYLE        = plain
 
-# If the LATEX_TIMESTAMP tag is set to YES then the footer of each generated
-# page will contain the date and time when the page was generated. Setting this
-# to NO can help when comparing the output of multiple runs.
-# The default value is: NO.
-# This tag requires that the tag GENERATE_LATEX is set to YES.
-
-LATEX_TIMESTAMP        = NO
-
 # The LATEX_EMOJI_DIRECTORY tag is used to specify the (relative or absolute)
 # path from which the emoji images will be read. If a relative path is entered,
 # it will be relative to the LATEX_OUTPUT directory. If left blank the
@@ -1971,16 +2185,6 @@ RTF_STYLESHEET_FILE    =
 
 RTF_EXTENSIONS_FILE    =
 
-# If the RTF_SOURCE_CODE tag is set to YES then doxygen will include source code
-# with syntax highlighting in the RTF output.
-#
-# Note that which sources are shown also depends on other settings such as
-# SOURCE_BROWSER.
-# The default value is: NO.
-# This tag requires that the tag GENERATE_RTF is set to YES.
-
-RTF_SOURCE_CODE        = NO
-
 #---------------------------------------------------------------------------
 # Configuration options related to the man page output
 #---------------------------------------------------------------------------
@@ -2077,21 +2281,12 @@ GENERATE_DOCBOOK       = NO
 
 DOCBOOK_OUTPUT         = docbook
 
-# If the DOCBOOK_PROGRAMLISTING tag is set to YES, doxygen will include the
-# program listings (including syntax highlighting and cross-referencing
-# information) to the DOCBOOK output. Note that enabling this will significantly
-# increase the size of the DOCBOOK output.
-# The default value is: NO.
-# This tag requires that the tag GENERATE_DOCBOOK is set to YES.
-
-DOCBOOK_PROGRAMLISTING = NO
-
 #---------------------------------------------------------------------------
 # Configuration options for the AutoGen Definitions output
 #---------------------------------------------------------------------------
 
 # If the GENERATE_AUTOGEN_DEF tag is set to YES, doxygen will generate an
-# AutoGen Definitions (see http://autogen.sourceforge.net/) file that captures
+# AutoGen Definitions (see https://autogen.sourceforge.net/) file that captures
 # the structure of the code including all documentation. Note that this feature
 # is still experimental and incomplete at the moment.
 # The default value is: NO.
@@ -2161,7 +2356,7 @@ MACRO_EXPANSION        = YES
 # The default value is: NO.
 # This tag requires that the tag ENABLE_PREPROCESSING is set to YES.
 
-EXPAND_ONLY_PREDEF     = YES
+EXPAND_ONLY_PREDEF     = NO
 
 # If the SEARCH_INCLUDES tag is set to YES, the include files in the
 # INCLUDE_PATH will be searched if a #include is found.
@@ -2172,7 +2367,8 @@ SEARCH_INCLUDES        = NO
 
 # The INCLUDE_PATH tag can be used to specify one or more directories that
 # contain include files that are not input files but should be processed by the
-# preprocessor.
+# preprocessor. Note that the INCLUDE_PATH is not recursive, so the setting of
+# RECURSIVE has no effect here.
 # This tag requires that the tag SEARCH_INCLUDES is set to YES.
 
 INCLUDE_PATH           =
@@ -2194,7 +2390,9 @@ INCLUDE_FILE_PATTERNS  =
 # This tag requires that the tag ENABLE_PREPROCESSING is set to YES.
 
 PREDEFINED             = __HIP_PLATFORM_AMD__ \
-                         __dparm(x)=
+                         __HIP__ \
+                         __dparm(x)= \
+                         __attribute__(x)=
 
 # If the MACRO_EXPANSION and EXPAND_ONLY_PREDEF tags are set to YES then this
 # tag can be used to specify a list of macro names that should be expanded. The
@@ -2262,25 +2460,9 @@ EXTERNAL_GROUPS        = YES
 EXTERNAL_PAGES         = YES
 
 #---------------------------------------------------------------------------
-# Configuration options related to the dot tool
+# Configuration options related to diagram generator tools
 #---------------------------------------------------------------------------
 
-# If the CLASS_DIAGRAMS tag is set to YES, doxygen will generate a class diagram
-# (in HTML and LaTeX) for classes with base or super classes. Setting the tag to
-# NO turns the diagrams off. Note that this option also works with HAVE_DOT
-# disabled, but it is recommended to install and use dot, since it yields more
-# powerful graphs.
-# The default value is: YES.
-
-CLASS_DIAGRAMS         = NO
-
-# You can include diagrams made with dia in doxygen documentation. Doxygen will
-# then run dia to produce the diagram and insert it in the documentation. The
-# DIA_PATH tag allows you to specify the directory where the dia binary resides.
-# If left empty dia is assumed to be found in the default search path.
-
-DIA_PATH               =
-
 # If set to YES the inheritance and collaboration graphs will hide inheritance
 # and usage relations if the target is undocumented or is not a class.
 # The default value is: YES.
@@ -2289,10 +2471,10 @@ HIDE_UNDOC_RELATIONS   = YES
 
 # If you set the HAVE_DOT tag to YES then doxygen will assume the dot tool is
 # available from the path. This tool is part of Graphviz (see:
-# http://www.graphviz.org/), a graph visualization toolkit from AT&T and Lucent
+# https://www.graphviz.org/), a graph visualization toolkit from AT&T and Lucent
 # Bell Labs. The other options in this section have no effect if this option is
 # set to NO
-# The default value is: YES.
+# The default value is: NO.
 
 HAVE_DOT               = YES
 
@@ -2306,35 +2488,52 @@ HAVE_DOT               = YES
 
 DOT_NUM_THREADS        = 0
 
-# When you want a differently looking font in the dot files that doxygen
-# generates you can specify the font name using DOT_FONTNAME. You need to make
-# sure dot is able to find the font, which can be done by putting it in a
-# standard location or by setting the DOTFONTPATH environment variable or by
-# setting DOT_FONTPATH to the directory containing the font.
-# The default value is: Helvetica.
+# DOT_COMMON_ATTR is common attributes for nodes, edges and labels of
+# subgraphs. When you want a differently looking font in the dot files that
+# doxygen generates you can specify fontname, fontcolor and fontsize attributes.
+# For details please see <a href=https://graphviz.org/doc/info/attrs.html>Node,
+# Edge and Graph Attributes specification</a> You need to make sure dot is able
+# to find the font, which can be done by putting it in a standard location or by
+# setting the DOTFONTPATH environment variable or by setting DOT_FONTPATH to the
+# directory containing the font. Default graphviz fontsize is 14.
+# The default value is: fontname=Helvetica,fontsize=10.
+# This tag requires that the tag HAVE_DOT is set to YES.
+
+DOT_COMMON_ATTR        = "fontname=Helvetica,fontsize=10"
+
+# DOT_EDGE_ATTR is concatenated with DOT_COMMON_ATTR. For elegant style you can
+# add 'arrowhead=open, arrowtail=open, arrowsize=0.5'. <a
+# href=https://graphviz.org/doc/info/arrows.html>Complete documentation about
+# arrows shapes.</a>
+# The default value is: labelfontname=Helvetica,labelfontsize=10.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
-DOT_FONTNAME           = Helvetica
+DOT_EDGE_ATTR          = "labelfontname=Helvetica,labelfontsize=10"
 
-# The DOT_FONTSIZE tag can be used to set the size (in points) of the font of
-# dot graphs.
-# Minimum value: 4, maximum value: 24, default value: 10.
+# DOT_NODE_ATTR is concatenated with DOT_COMMON_ATTR. For view without boxes
+# around nodes set 'shape=plain' or 'shape=plaintext' <a
+# href=https://www.graphviz.org/doc/info/shapes.html>Shapes specification</a>
+# The default value is: shape=box,height=0.2,width=0.4.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
-DOT_FONTSIZE           = 10
+DOT_NODE_ATTR          = "shape=box,height=0.2,width=0.4"
 
-# By default doxygen will tell dot to use the default font as specified with
-# DOT_FONTNAME. If you specify a different font using DOT_FONTNAME you can set
-# the path where dot can find it using this tag.
+# You can set the path where dot can find font specified with fontname in
+# DOT_COMMON_ATTR and others dot attributes.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
 DOT_FONTPATH           =
 
-# If the CLASS_GRAPH tag is set to YES then doxygen will generate a graph for
-# each documented class showing the direct and indirect inheritance relations.
-# Setting this tag to YES will force the CLASS_DIAGRAMS tag to NO.
+# If the CLASS_GRAPH tag is set to YES or GRAPH or BUILTIN then doxygen will
+# generate a graph for each documented class showing the direct and indirect
+# inheritance relations. In case the CLASS_GRAPH tag is set to YES or GRAPH and
+# HAVE_DOT is enabled as well, then dot will be used to draw the graph. In case
+# the CLASS_GRAPH tag is set to YES and HAVE_DOT is disabled or if the
+# CLASS_GRAPH tag is set to BUILTIN, then the built-in generator will be used.
+# If the CLASS_GRAPH tag is set to TEXT the direct and indirect inheritance
+# relations will be shown as texts / links.
+# Possible values are: NO, YES, TEXT, GRAPH and BUILTIN.
 # The default value is: YES.
-# This tag requires that the tag HAVE_DOT is set to YES.
 
 CLASS_GRAPH            = YES
 
@@ -2348,7 +2547,8 @@ CLASS_GRAPH            = YES
 COLLABORATION_GRAPH    = YES
 
 # If the GROUP_GRAPHS tag is set to YES then doxygen will generate a graph for
-# groups, showing the direct groups dependencies.
+# groups, showing the direct groups dependencies. See also the chapter Grouping
+# in the manual.
 # The default value is: YES.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
@@ -2371,10 +2571,32 @@ UML_LOOK               = NO
 # but if the number exceeds 15, the total amount of fields shown is limited to
 # 10.
 # Minimum value: 0, maximum value: 100, default value: 10.
-# This tag requires that the tag HAVE_DOT is set to YES.
+# This tag requires that the tag UML_LOOK is set to YES.
 
 UML_LIMIT_NUM_FIELDS   = 10
 
+# If the DOT_UML_DETAILS tag is set to NO, doxygen will show attributes and
+# methods without types and arguments in the UML graphs. If the DOT_UML_DETAILS
+# tag is set to YES, doxygen will add type and arguments for attributes and
+# methods in the UML graphs. If the DOT_UML_DETAILS tag is set to NONE, doxygen
+# will not generate fields with class member information in the UML graphs. The
+# class diagrams will look similar to the default class diagrams but using UML
+# notation for the relationships.
+# Possible values are: NO, YES and NONE.
+# The default value is: NO.
+# This tag requires that the tag UML_LOOK is set to YES.
+
+DOT_UML_DETAILS        = NO
+
+# The DOT_WRAP_THRESHOLD tag can be used to set the maximum number of characters
+# to display on a single line. If the actual line length exceeds this threshold
+# significantly it will wrapped across multiple lines. Some heuristics are apply
+# to avoid ugly line breaks.
+# Minimum value: 0, maximum value: 1000, default value: 17.
+# This tag requires that the tag HAVE_DOT is set to YES.
+
+DOT_WRAP_THRESHOLD     = 17
+
 # If the TEMPLATE_RELATIONS tag is set to YES then the inheritance and
 # collaboration graphs will show the relations between templates and their
 # instances.
@@ -2441,16 +2663,21 @@ GRAPHICAL_HIERARCHY    = YES
 
 DIRECTORY_GRAPH        = YES
 
+# The DIR_GRAPH_MAX_DEPTH tag can be used to limit the maximum number of levels
+# of child directories generated in directory dependency graphs by dot.
+# Minimum value: 1, maximum value: 25, default value: 1.
+# This tag requires that the tag DIRECTORY_GRAPH is set to YES.
+
+DIR_GRAPH_MAX_DEPTH    = 1
+
 # The DOT_IMAGE_FORMAT tag can be used to set the image format of the images
 # generated by dot. For an explanation of the image formats see the section
 # output formats in the documentation of the dot tool (Graphviz (see:
-# http://www.graphviz.org/)).
+# https://www.graphviz.org/)).
 # Note: If you choose svg you need to set HTML_FILE_EXTENSION to xhtml in order
 # to make the SVG files visible in IE 9+ (other browsers do not have this
 # requirement).
-# Possible values are: png, png:cairo, png:cairo:cairo, png:cairo:gd, png:gd,
-# png:gd:gd, jpg, jpg:cairo, jpg:cairo:gd, jpg:gd, jpg:gd:gd, gif, gif:cairo,
-# gif:cairo:gd, gif:gd, gif:gd:gd, svg, png:gd, png:gd:gd, png:cairo,
+# Possible values are: png, jpg, gif, svg, png:gd, png:gd:gd, png:cairo,
 # png:cairo:gd, png:cairo:cairo, png:cairo:gdiplus, png:gdiplus and
 # png:gdiplus:gdiplus.
 # The default value is: png.
@@ -2483,11 +2710,12 @@ DOT_PATH               =
 
 DOTFILE_DIRS           =
 
-# The MSCFILE_DIRS tag can be used to specify one or more directories that
-# contain msc files that are included in the documentation (see the \mscfile
-# command).
+# You can include diagrams made with dia in doxygen documentation. Doxygen will
+# then run dia to produce the diagram and insert it in the documentation. The
+# DIA_PATH tag allows you to specify the directory where the dia binary resides.
+# If left empty dia is assumed to be found in the default search path.
 
-MSCFILE_DIRS           =
+DIA_PATH               =
 
 # The DIAFILE_DIRS tag can be used to specify one or more directories that
 # contain dia files that are included in the documentation (see the \diafile
@@ -2496,10 +2724,10 @@ MSCFILE_DIRS           =
 DIAFILE_DIRS           =
 
 # When using plantuml, the PLANTUML_JAR_PATH tag should be used to specify the
-# path where java can find the plantuml.jar file. If left blank, it is assumed
-# PlantUML is not used or called during a preprocessing step. Doxygen will
-# generate a warning when it encounters a \startuml command in this case and
-# will not generate output for the diagram.
+# path where java can find the plantuml.jar file or to the filename of jar file
+# to be used. If left blank, it is assumed PlantUML is not used or called during
+# a preprocessing step. Doxygen will generate a warning when it encounters a
+# \startuml command in this case and will not generate output for the diagram.
 
 PLANTUML_JAR_PATH      =
 
@@ -2537,18 +2765,6 @@ DOT_GRAPH_MAX_NODES    = 50
 
 MAX_DOT_GRAPH_DEPTH    = 0
 
-# Set the DOT_TRANSPARENT tag to YES to generate images with a transparent
-# background. This is disabled by default, because dot on Windows does not seem
-# to support this out of the box.
-#
-# Warning: Depending on the platform used, enabling this option may lead to
-# badly anti-aliased labels on the edges of a graph (i.e. they become hard to
-# read).
-# The default value is: NO.
-# This tag requires that the tag HAVE_DOT is set to YES.
-
-DOT_TRANSPARENT        = NO
-
 # Set the DOT_MULTI_TARGETS tag to YES to allow dot to generate multiple output
 # files in one run (i.e. multiple -o and -T options on the command line). This
 # makes dot run faster, but since only newer versions of dot (>1.8.10) support
@@ -2561,14 +2777,34 @@ DOT_MULTI_TARGETS      = NO
 # If the GENERATE_LEGEND tag is set to YES doxygen will generate a legend page
 # explaining the meaning of the various boxes and arrows in the dot generated
 # graphs.
+# Note: This tag requires that UML_LOOK isn't set, i.e. the doxygen internal
+# graphical representation for inheritance and collaboration diagrams is used.
 # The default value is: YES.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
 GENERATE_LEGEND        = YES
 
-# If the DOT_CLEANUP tag is set to YES, doxygen will remove the intermediate dot
+# If the DOT_CLEANUP tag is set to YES, doxygen will remove the intermediate
 # files that are used to generate the various graphs.
+#
+# Note: This setting is not only used for dot files but also for msc temporary
+# files.
 # The default value is: YES.
-# This tag requires that the tag HAVE_DOT is set to YES.
 
 DOT_CLEANUP            = YES
+
+# You can define message sequence charts within doxygen comments using the \msc
+# command. If the MSCGEN_TOOL tag is left empty (the default), then doxygen will
+# use a built-in version of mscgen tool to produce the charts. Alternatively,
+# the MSCGEN_TOOL tag can also specify the name an external tool. For instance,
+# specifying prog as the value, doxygen will call the tool as prog -T
+# <outfile_format> -o <outputfile> <inputfile>. The external tool should support
+# output file formats "png", "eps", "svg", and "ismap".
+
+MSCGEN_TOOL            =
+
+# The MSCFILE_DIRS tag can be used to specify one or more directories that
+# contain msc files that are included in the documentation (see the \mscfile
+# command).
+
+MSCFILE_DIRS           =
diff --git a/docs/how-to/debugging.rst b/docs/how-to/debugging.rst
new file mode 100644
index 0000000000..7d6165d01c
--- /dev/null
+++ b/docs/how-to/debugging.rst
@@ -0,0 +1,381 @@
+.. meta::
+   :description: How to debug using HIP.
+   :keywords: AMD, ROCm, HIP, debugging, ltrace, ROCdgb, Windgb
+
+*************************************************************************
+Debugging with HIP
+*************************************************************************
+
+AMD debugging tools include *ltrace* and *ROCdgb*. External tools are available and can be found
+online. For example, if you're using Windows, you can use *Microsoft Visual Studio* and *Windgb*.
+
+You can trace and debug your code using the following tools and techniques.
+
+Tracing
+================================================
+
+You can use tracing to quickly observe the flow of an application before reviewing the detailed
+information provided by a command-line debugger. Tracing can be used to identify issues ranging
+from accidental API calls to calls made on a critical path.
+
+ltrace is a standard Linux tool that provides a message to ``stderr`` on every dynamic library call. You
+can use ltrace to visualize the runtime behavior of the entire ROCm software stack.
+
+Here's a simple command-line example that uses ltrace to trace HIP APIs and output:
+
+.. code:: console
+
+    $ ltrace -C -e "hip*" ./hipGetChanDesc
+    hipGetChanDesc->hipCreateChannelDesc(0x7ffdc4b66860, 32, 0, 0) = 0x7ffdc4b66860
+    hipGetChanDesc->hipMallocArray(0x7ffdc4b66840, 0x7ffdc4b66860, 8, 8) = 0
+    hipGetChanDesc->hipGetChannelDesc(0x7ffdc4b66848, 0xa63990, 5, 1) = 0
+    hipGetChanDesc->hipFreeArray(0xa63990, 0, 0x7f8c7fe13778, 0x7ffdc4b66848) = 0
+    PASSED!
+    +++ exited (status 0) +++
+
+
+Here's another example that uses ltrace to trace hsa APIs and output:
+
+.. code:: console
+
+    $ ltrace -C -e "hsa*" ./hipGetChanDesc
+    libamdhip64.so.4->hsa_init(0, 0x7fff325a69d0, 0x9c80e0, 0 <unfinished ...>
+    libhsa-runtime64.so.1->hsaKmtOpenKFD(0x7fff325a6590, 0x9c38c0, 0, 1) = 0
+    libhsa-runtime64.so.1->hsaKmtGetVersion(0x7fff325a6608, 0, 0, 0) = 0
+    libhsa-runtime64.so.1->hsaKmtReleaseSystemProperties(3, 0x80084b01, 0, 0) = 0
+    libhsa-runtime64.so.1->hsaKmtAcquireSystemProperties(0x7fff325a6610, 0, 0, 1) = 0
+    libhsa-runtime64.so.1->hsaKmtGetNodeProperties(0, 0x7fff325a66a0, 0, 0) = 0
+    libhsa-runtime64.so.1->hsaKmtGetNodeMemoryProperties(0, 1, 0x9c42b0, 0x936012) = 0
+    ...
+    <... hsaKmtCreateEvent resumed> )                = 0
+    libhsa-runtime64.so.1->hsaKmtAllocMemory(0, 4096, 64, 0x7fff325a6690) = 0
+    libhsa-runtime64.so.1->hsaKmtMapMemoryToGPUNodes(0x7f1202749000, 4096, 0x7fff325a6690, 0) = 0
+    libhsa-runtime64.so.1->hsaKmtCreateEvent(0x7fff325a6700, 0, 0, 0x7fff325a66f0) = 0
+    libhsa-runtime64.so.1->hsaKmtAllocMemory(1, 0x100000000, 576, 0x7fff325a67d8) = 0
+    libhsa-runtime64.so.1->hsaKmtAllocMemory(0, 8192, 64, 0x7fff325a6790) = 0
+    libhsa-runtime64.so.1->hsaKmtMapMemoryToGPUNodes(0x7f120273c000, 8192, 0x7fff325a6790, 0) = 0
+    libhsa-runtime64.so.1->hsaKmtAllocMemory(0, 4096, 4160, 0x7fff325a6450) = 0
+    libhsa-runtime64.so.1->hsaKmtMapMemoryToGPUNodes(0x7f120273a000, 4096, 0x7fff325a6450, 0) = 0
+    libhsa-runtime64.so.1->hsaKmtSetTrapHandler(1, 0x7f120273a000, 4096, 0x7f120273c000) = 0
+    <... hsa_init resumed> )                         = 0
+    libamdhip64.so.4->hsa_system_get_major_extension_table(513, 1, 24, 0x7f1202597930) = 0
+    libamdhip64.so.4->hsa_iterate_agents(0x7f120171f050, 0, 0x7fff325a67f8, 0 <unfinished ...>
+    libamdhip64.so.4->hsa_agent_get_info(0x94f110, 17, 0x7fff325a67e8, 0) = 0
+    libamdhip64.so.4->hsa_amd_agent_iterate_memory_pools(0x94f110, 0x7f1201722816, 0x7fff325a67f0, 0x7f1201722816 <unfinished ...>
+    libamdhip64.so.4->hsa_amd_memory_pool_get_info(0x9c7fb0, 0, 0x7fff325a6744, 0x7fff325a67f0) = 0
+    libamdhip64.so.4->hsa_amd_memory_pool_get_info(0x9c7fb0, 1, 0x7fff325a6748, 0x7f1200d82df4) = 0
+    ...
+    <... hsa_amd_agent_iterate_memory_pools resumed> ) = 0
+    libamdhip64.so.4->hsa_agent_get_info(0x9dbf30, 17, 0x7fff325a67e8, 0) = 0
+    <... hsa_iterate_agents resumed> )               = 0
+    libamdhip64.so.4->hsa_agent_get_info(0x9dbf30, 0, 0x7fff325a6850, 3) = 0
+    libamdhip64.so.4->hsa_agent_get_info(0x9dbf30, 0xa000, 0x9e7cd8, 0) = 0
+    libamdhip64.so.4->hsa_agent_iterate_isas(0x9dbf30, 0x7f1201720411, 0x7fff325a6760, 0x7f1201720411) = 0
+    libamdhip64.so.4->hsa_isa_get_info_alt(0x94e7c8, 0, 0x7fff325a6728, 1) = 0
+    libamdhip64.so.4->hsa_isa_get_info_alt(0x94e7c8, 1, 0x9e7f90, 0) = 0
+    libamdhip64.so.4->hsa_agent_get_info(0x9dbf30, 4, 0x9e7ce8, 0) = 0
+    ...
+    <... hsa_amd_memory_pool_allocate resumed> )     = 0
+    libamdhip64.so.4->hsa_ext_image_create(0x9dbf30, 0xa1c4c8, 0x7f10f2800000, 3 <unfinished ...>
+    libhsa-runtime64.so.1->hsaKmtAllocMemory(0, 4096, 64, 0x7fff325a6740) = 0
+    libhsa-runtime64.so.1->hsaKmtQueryPointerInfo(0x7f1202736000, 0x7fff325a65e0, 0, 0) = 0
+    libhsa-runtime64.so.1->hsaKmtMapMemoryToGPUNodes(0x7f1202736000, 4096, 0x7fff325a66e8, 0) = 0
+    <... hsa_ext_image_create resumed> )             = 0
+    libamdhip64.so.4->hsa_ext_image_destroy(0x9dbf30, 0x7f1202736000, 0x9dbf30, 0 <unfinished ...>
+    libhsa-runtime64.so.1->hsaKmtUnmapMemoryToGPU(0x7f1202736000, 0x7f1202736000, 4096, 0x9c8050) = 0
+    libhsa-runtime64.so.1->hsaKmtFreeMemory(0x7f1202736000, 4096, 0, 0) = 0
+    <... hsa_ext_image_destroy resumed> )            = 0
+    libamdhip64.so.4->hsa_amd_memory_pool_free(0x7f10f2800000, 0x7f10f2800000, 256, 0x9e76f0) = 0
+    PASSED!
+
+Debugging
+================================================
+
+You can use ROCgdb for debugging and profiling.
+
+ROCgdb is the ROCm source-level debugger for Linux and is based on GNU Project debugger (GDB).
+the GNU source-level debugger, equivalent of cuda-gdb, can be used with debugger frontends, such as eclipse, vscode, or gdb-dashboard.
+For details, see (https://github.com/ROCm-Developer-Tools/ROCgdb).
+
+Below is a sample how to use ROCgdb run and debug HIP application, rocgdb is installed with ROCM package in the folder /opt/rocm/bin.
+
+.. code:: console
+
+    $ export PATH=$PATH:/opt/rocm/bin
+    $ rocgdb ./hipTexObjPitch
+    GNU gdb (rocm-dkms-no-npi-hipclang-6549) 10.1
+    Copyright (C) 2020 Free Software Foundation, Inc.
+    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
+    ...
+    For bug reporting instructions, please see:
+    <https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
+    Find the GDB manual and other documentation resources online at:
+        <http://www.gnu.org/software/gdb/documentation/>.
+
+    ...
+    Reading symbols from ./hipTexObjPitch...
+    (gdb) break main
+    Breakpoint 1 at 0x4013d1: file /home/test/hip/tests/src/texture/hipTexObjPitch.cpp, line 98.
+    (gdb) run
+    Starting program: /home/test/hip/build/directed_tests/texture/hipTexObjPitch
+    [Thread debugging using libthread_db enabled]
+    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
+
+    Breakpoint 1, main ()
+        at /home/test/hip/tests/src/texture/hipTexObjPitch.cpp:98
+    98	    texture2Dtest<float>();
+    (gdb)c
+
+Debugging HIP applications
+--------------------------------------------------------------------------------------------
+
+The following Linux example shows how to get useful information from the debugger while running a
+simple memory copy test, which caused a segmentation fault issue.
+
+.. code:: console
+
+    test: simpleTest2<?> numElements=4194304 sizeElements=4194304 bytes
+    Segmentation fault (core dumped)
+
+    (gdb) run
+    Starting program: /home/test/hipamd/build/directed_tests/runtimeApi/memory/hipMemcpy_simple
+    [Thread debugging using libthread_db enabled]
+    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
+
+    Breakpoint 1, main (argc=1, argv=0x7fffffffdea8)
+        at /home/test/hip/tests/src/runtimeApi/memory/hipMemcpy_simple.cpp:147
+    147     int main(int argc, char* argv[]) {
+    (gdb) c
+    Continuing.
+    [New Thread 0x7ffff64c4700 (LWP 146066)]
+
+    Thread 1 "hipMemcpy_simpl" received signal SIGSEGV, Segmentation fault.
+    0x000000000020f78e in simpleTest2<float> (numElements=4194304, usePinnedHost=true)
+        at /home/test/hip/tests/src/runtimeApi/memory/hipMemcpy_simple.cpp:104
+    104             A_h1[i] = 3.14f + 1000 * i;
+    (gdb) bt
+    #0  0x000000000020f78e in simpleTest2<float> (numElements=4194304, usePinnedHost=true)
+        at /home/test/hip/tests/src/runtimeApi/memory/hipMemcpy_simple.cpp:104
+    #1  0x000000000020e96c in main (argc=<optimized out>, argv=<optimized out>)
+        at /home/test/hip/tests/src/runtimeApi/memory/hipMemcpy_simple.cpp:163
+    (gdb) info thread
+    Id   Target Id                                            Frame
+    * 1    Thread 0x7ffff64c5880 (LWP 146060) "hipMemcpy_simpl" 0x000000000020f78e in simpleTest2<float> (numElements=4194304, usePinnedHost=true)
+        at /home/test/hip/tests/src/runtimeApi/memory/hipMemcpy_simple.cpp:104
+    2    Thread 0x7ffff64c4700 (LWP 146066) "hipMemcpy_simpl" 0x00007ffff6b0850b in ioctl
+        () from /lib/x86_64-linux-gnu/libc.so.6
+    (gdb) thread 2
+    [Switching to thread 2 (Thread 0x7ffff64c4700 (LWP 146066))]
+    #0  0x00007ffff6b0850b in ioctl () from /lib/x86_64-linux-gnu/libc.so.6
+    (gdb) bt
+    #0  0x00007ffff6b0850b in ioctl () from /lib/x86_64-linux-gnu/libc.so.6
+    #1  0x00007ffff6604568 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
+    #2  0x00007ffff65fe73a in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
+    #3  0x00007ffff659e4d6 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
+    #4  0x00007ffff65807de in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
+    #5  0x00007ffff65932a2 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
+    #6  0x00007ffff654f547 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
+    #7  0x00007ffff7f76609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
+    #8  0x00007ffff6b13293 in clone () from /lib/x86_64-linux-gnu/libc.so.6
+    (gdb) thread 1
+    [Switching to thread 1 (Thread 0x7ffff64c5880 (LWP 146060))]
+    #0  0x000000000020f78e in simpleTest2<float> (numElements=4194304, usePinnedHost=true)
+        at /home/test/hip/tests/src/runtimeApi/memory/hipMemcpy_simple.cpp:104
+    104             A_h1[i] = 3.14f + 1000 * i;
+    (gdb) bt
+    #0  0x000000000020f78e in simpleTest2<float> (numElements=4194304, usePinnedHost=true)
+        at /home/test/hip/tests/src/runtimeApi/memory/hipMemcpy_simple.cpp:104
+    #1  0x000000000020e96c in main (argc=<optimized out>, argv=<optimized out>)
+        at /home/test/hip/tests/src/runtimeApi/memory/hipMemcpy_simple.cpp:163
+    (gdb)
+    ...
+
+Debugging HIP applications using Windows tools can be more informative than on Linux. Windows
+tools provides more visibility into debug codes, which makes it easier to inspect variables, watch
+multiple details, and examine call stacks.
+
+Useful environment variables
+===================================================
+
+HIP provides environment variables that allow HIP, hip-clang, or HSA drivers to prevent certain features
+and optimizations. These are not intended for production, but can be useful to diagnose
+synchronization problems in the application (or driver).
+
+Some of the more widely used environment variables are described in this section. These are
+supported on the Linux ROCm path and Windows.
+
+Kernel enqueue serialization
+---------------------------------------------------------------------------------
+
+You can control kernel command serialization from the host:
+
+``AMD_SERIALIZE_KERNEL``, for serializing kernel enqueue
+ ``AMD_SERIALIZE_KERNEL = 1``, Wait for completion before enqueue
+ ``AMD_SERIALIZE_KERNEL = 2``, Wait for completion after enqueue
+ ``AMD_SERIALIZE_KERNEL = 3``, Both
+
+Or
+
+``AMD_SERIALIZE_COPY``, for serializing copies
+ ``AMD_SERIALIZE_COPY = 1``, Wait for completion before enqueue
+ ``AMD_SERIALIZE_COPY = 2``, Wait for completion after enqueue
+ ``AMD_SERIALIZE_COPY = 3``, Both
+
+So HIP runtime can wait for GPU idle before/after any GPU command depending on the environment
+setting.
+
+Making device visible
+---------------------------------------------------------------------------------
+
+For systems with multiple devices, you can choose to make only certain device(s) visible to HIP using
+``HIP_VISIBLE_DEVICES`` (or ``CUDA_VISIBLE_DEVICES`` on an NVIDIA platform). Once enabled, HIP can
+only view devices that have indices present in the sequence. For example:
+
+.. code:: console
+
+    $ HIP_VISIBLE_DEVICES=0,1
+
+Or in the application:
+
+.. code:: cpp
+
+    if (totalDeviceNum > 2) {
+    setenv("HIP_VISIBLE_DEVICES", "0,1,2", 1);
+    assert(getDeviceNumber(false) == 3);
+    ... ...
+    }
+
+Dump code object
+---------------------------------------------------------------------------------
+
+To analyze compiler-related issues, you can use the dump code object:
+``GPU_DUMP_CODE_OBJECT``.
+
+HSA-related environment variables (Linux)
+-----------------------------------------------------------------------------------------------
+
+HSA provides environment variables that help analyze issues in drivers or hardware.
+
+* To isolate issues with hardware copy engines, you can use ``HSA_ENABLE_SDMA``.
+
+    ``HSA_ENABLE_SDMA=0`` causes host-to-device and device-to-host copies to use compute shader
+    blit kernels, rather than the dedicated DMA copy engines. Compute shader copies have low latency
+    (typically < 5 us) and can achieve approximately 80% of the bandwidth of the DMA copy engine.
+
+* To diagnose interrupt storm issues in the driver, you can use ``HSA_ENABLE_INTERRUPT``.
+
+    ``HSA_ENABLE_INTERRUPT=0`` causes completion signals to be detected with memory-based
+    polling, rather than interrupts.
+
+HIP environment variable summary
+-----------------------------------------------------------------------------------------------
+
+Here are some of the more commonly used environment variables:
+
+.. list-table::
+    * - **Environment variable**
+    - **Default value**
+    - **Usage**
+
+    * - AMD_LOG_LEVEL
+        | <sub>Enable HIP log on different Level</sub>
+      - 0
+      - 0: Disable log.
+        | 1: Enable log on error level
+        | 2: Enable log on warning and below levels
+        | 0x3: Enable log on information and below levels
+        | 0x4: Decode and display AQL packets
+
+    * - AMD_LOG_MASK
+        | <sub>Enable HIP log on different Level</sub>
+      - 0x7FFFFFFF
+      - 0x1: Log API calls
+        | 0x02: Kernel and Copy Commands and Barriers
+        | 0x4: Synchronization and waiting for commands to finish
+        | 0x8: Enable log on information and below levels
+        | 0x20: Queue commands and queue contents
+        | 0x40: Signal creation, allocation, pool
+        | 0x80: Locks and thread-safety code
+        | 0x100: Copy debug
+        | 0x200: Detailed copy debug
+        | 0x400: Resource allocation, performance-impacting events
+        | 0x800: Initialization and shutdown
+        | 0x1000: Misc debug, not yet classified
+        | 0x2000: Show raw bytes of AQL packet
+        | 0x4000: Show code creation debug
+        | 0x8000: More detailed command info, including barrier commands
+        | 0x10000: Log message location
+        | 0xFFFFFFFF: Log always even mask flag is zero
+
+    * - HIP_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES)
+        | <sub> Only devices whose index is present in the sequence are visible to HIP</sub>
+      -
+      - 0,1,2: Depending on the number of devices on the system
+
+    * - GPU_DUMP_CODE_OBJECT
+        | <sub>Dump code object</sub>
+      - 0
+      - 0: Disable
+        | 1: Enable
+
+    * - AMD_SERIALIZE_KERNEL
+        | <sub> Serialize kernel enqueue</sub>
+      - 0
+      - 1: Wait for completion before enqueue
+        | 2: Wait for completion after enqueue
+        | 3: Both
+
+    * - AMD_SERIALIZE_COPY
+        |<sub>Serialize copies</sub>
+      - 0
+      - 1: Wait for completion before enqueue
+        | 2: Wait for completion after enqueue
+        | 3: Both
+
+    * - HIP_HOST_COHERENT
+        | <sub>Coherent memory in hipHostMalloc</sub>
+      - 0
+      - 0: memory is not coherent between host and GPU
+        | 1: memory is coherent with host
+
+    * - AMD_DIRECT_DISPATCH
+        | <sub> Enable direct kernel dispatch (Currently for Linux; under development for Windows)</sub>
+      - 1
+      - 0: Disable
+        | 1: Enable
+
+    * - GPU_MAX_HW_QUEUES
+        | <sub>The maximum number of hardware queues allocated per device</sub>
+    - 4
+    - The variable controls how many independent hardware queues HIP runtime can create per process,
+        per device. If an application allocates more HIP streams than this number, then HIP runtime reuses
+        the same hardware queues for the new streams in a round-robin manner. Note that this maximum
+        number does not apply to hardware queues that are created for CU-masked HIP streams, or
+        cooperative queues for HIP Cooperative Groups (single queue per device).
+
+General debugging tips
+======================================================
+
+* ``gdb --args`` can be used to pass the executable and arguments to gdb.
+* Uou can set environment variables (``set env``) from within GDB on Linux (note that this command
+    doesn't use an equal (=) sign:
+
+    .. code:: bash
+
+        (gdb) set env AMD_SERIALIZE_KERNEL 3
+
+* The GDB backtrace shows a path in the runtime. This is because a fault is caught by the runtime, but
+    it is generated by an asynchronous command running on the GPU.
+* To determine the true location of a fault, you can force the kernels to run synchronously by setting
+    the environment variables ``AMD_SERIALIZE_KERNEL=3`` and ``AMD_SERIALIZE_COPY=3``.  This
+    forces HIP runtime to wait for the kernel to finish running before retuning. If the fault occurs when
+    a kernel is running, you can see the code that launched the kernel inside the backtrace. The thread
+    that's causing the issue is typically the one inside ``libhsa-runtime64.so``.
+* VM faults inside kernels can be caused by:
+
+    * Incorrect code (e.g., a for loop that extends past array boundaries)
+    * Memory issues, such as invalid kernel arguments (null pointers, unregistered host pointers, bad
+        pointers)
+    * Synchronization issues
+    * Compiler issues (incorrect code generation from the compiler)
+    * Runtime issues
diff --git a/docs/how-to/logging.rst b/docs/how-to/logging.rst
new file mode 100644
index 0000000000..e8dd2f3061
--- /dev/null
+++ b/docs/how-to/logging.rst
@@ -0,0 +1,235 @@
+.. meta::
+   :description: HIP provides a logging mechanism that allows you to trace HIP API and runtime codes
+   when running a HIP application.
+   :keywords: AMD, ROCm, HIP, logging
+
+**********************************************************
+Generate logs
+**********************************************************
+
+HIP provides a logging mechanism that allows you to trace HIP API and runtime codes when running a
+HIP application. In addition to being useful to our users/developers, the HIP development team uses
+these logs to improve the HIP runtime.
+
+By adjusting the logging settings and logging mask, you can get different types of information for
+different functionalities, such as HIP APIs, executed kernels, queue commands, and queue contents.
+Refer to the following sections for examples.
+
+.. tip::
+  Logging works for the release and debug versions of HIP. If you want to save logging output in a file,
+  define the file when running the application via command line. For example:
+
+  ..  code-block:: bash
+    user@user-test:~/hip/bin$ ./hipinfo > ~/hip_log.txt
+
+Logging level
+======================================
+
+HIP logging is disabled by default. You can enable it via the ``AMD_LOG_LEVEL`` environment variable.
+The value of this variable controls your logging level. Levels are defined as follows:
+
+..  code-block:: cpp
+
+  enum LogLevel {
+    LOG_NONE    = 0,
+    LOG_ERROR   = 1,
+    LOG_WARNING = 2,
+    LOG_INFO    = 3,
+    LOG_DEBUG   = 4
+  };
+
+.. tip::
+  You can call a logging function with different logging levels. All information under the value set for
+  ``AMD_LOG_LEVEL`` is printed.
+
+Logging mask
+======================================
+
+The logging mask is designed to print functionality types when you're running a HIP application.
+Once you set ``AMD_LOG_LEVEL``, the logging mask is set as the default value (``0x7FFFFFFF``). You can
+change this to any of the valid values:
+
+..  code-block:: cpp
+
+  enum LogMask {
+    LOG_API       = 0x00000001, //!< API call
+    LOG_CMD       = 0x00000002, //!< Kernel and Copy Commands and Barriers
+    LOG_WAIT      = 0x00000004, //!< Synchronization and waiting for commands to finish
+    LOG_AQL       = 0x00000008, //!< Decode and display AQL packets
+    LOG_QUEUE     = 0x00000010, //!< Queue commands and queue contents
+    LOG_SIG       = 0x00000020, //!< Signal creation, allocation, pool
+    LOG_LOCK      = 0x00000040, //!< Locks and thread-safety code.
+    LOG_KERN      = 0x00000080, //!< kernel creations and arguments, etc.
+    LOG_COPY      = 0x00000100, //!< Copy debug
+    LOG_COPY2     = 0x00000200, //!< Detailed copy debug
+    LOG_RESOURCE  = 0x00000400, //!< Resource allocation, performance-impacting events.
+    LOG_INIT      = 0x00000800, //!< Initialization and shutdown
+    LOG_MISC      = 0x00001000, //!< misc debug, not yet classified
+    LOG_AQL2      = 0x00002000, //!< Show raw bytes of AQL packet
+    LOG_CODE      = 0x00004000, //!< Show code creation debug
+    LOG_CMD2      = 0x00008000, //!< More detailed command info, including barrier commands
+    LOG_LOCATION  = 0x00010000, //!< Log message location
+    LOG_ALWAYS    = 0xFFFFFFFF, //!< Log always even mask flag is zero
+  };
+
+You can also define the logging mask via the ``AMD_LOG_MASK`` environment variable.
+
+Logging command
+======================================
+
+You can use the following code to print HIP logging information:
+
+..  code-block:: cpp
+
+  #define ClPrint(level, mask, format, ...)                                       \
+    do {                                                                          \
+      if (AMD_LOG_LEVEL >= level) {                                               \
+        if (AMD_LOG_MASK & mask || mask == amd::LOG_ALWAYS) {                     \
+          if (AMD_LOG_MASK & amd::LOG_LOCATION) {                                 \
+            amd::log_printf(level, __FILENAME__, __LINE__, format, ##__VA_ARGS__);\
+          } else {                                                                \
+            amd::log_printf(level, "", 0, format, ##__VA_ARGS__);                 \
+          }                                                                       \
+        }                                                                         \
+      }                                                                           \
+    } while (false)
+
+
+Using HIP code, call the ``ClPrint()`` function with the desired input variables. For example:
+
+..  code-block:: cpp
+
+  ClPrint(amd::LOG_INFO, amd::LOG_INIT, "Initializing HSA stack.");
+
+
+Logging examples
+======================================
+
+On **Linux**, you can enable HIP logging and retrieve logging information when you run ``hipinfo``.
+
+..  code-block:: console
+
+  user@user-test:~/hip/bin$ export AMD_LOG_LEVEL=4
+  user@user-test:~/hip/bin$ ./hipinfo
+
+  :3:rocdevice.cpp            :453 : 23647210092: Initializing HSA stack.
+  :3:comgrctx.cpp             :33  : 23647639336: Loading COMGR library.
+  :3:rocdevice.cpp            :203 : 23647687108: Numa select cpu agent[0]=0x13407c0(fine=0x13409a0,coarse=0x1340ad0) for gpu agent=0x1346150
+  :4:runtime.cpp              :82  : 23647698669: init
+  :3:hip_device_runtime.cpp   :473 : 23647698869: 5617 : [7fad295dd840] hipGetDeviceCount: Returned hipSuccess
+  :3:hip_device_runtime.cpp   :502 : 23647698990: 5617 : [7fad295dd840] hipSetDevice ( 0 )
+  :3:hip_device_runtime.cpp   :507 : 23647699042: 5617 : [7fad295dd840] hipSetDevice: Returned hipSuccess
+  --------------------------------------------------------------------------------
+  device#                           0
+  :3:hip_device.cpp           :150 : 23647699276: 5617 : [7fad295dd840] hipGetDeviceProperties ( 0x7ffdbe7db730, 0 )
+  :3:hip_device.cpp           :237 : 23647699335: 5617 : [7fad295dd840] hipGetDeviceProperties: Returned hipSuccess
+  Name:                             Device 7341
+  pciBusID:                         3
+  pciDeviceID:                      0
+  pciDomainID:                      0
+  multiProcessorCount:              11
+  maxThreadsPerMultiProcessor:      2560
+  isMultiGpuBoard:                  0
+  clockRate:                        1900 Mhz
+  memoryClockRate:                  875 Mhz
+  memoryBusWidth:                   0
+  clockInstructionRate:             1000 Mhz
+  totalGlobalMem:                   7.98 GB
+  maxSharedMemoryPerMultiProcessor: 64.00 KB
+  totalConstMem:                    8573157376
+  sharedMemPerBlock:                64.00 KB
+  canMapHostMemory:                 1
+  regsPerBlock:                     0
+  warpSize:                         32
+  l2CacheSize:                      0
+  computeMode:                      0
+  maxThreadsPerBlock:               1024
+  maxThreadsDim.x:                  1024
+  maxThreadsDim.y:                  1024
+  maxThreadsDim.z:                  1024
+  maxGridSize.x:                    2147483647
+  maxGridSize.y:                    2147483647
+  maxGridSize.z:                    2147483647
+  major:                            10
+  minor:                            12
+  concurrentKernels:                1
+  cooperativeLaunch:                0
+  cooperativeMultiDeviceLaunch:     0
+  arch.hasGlobalInt32Atomics:       1
+  ...
+  gcnArch:                          1012
+  isIntegrated:                     0
+  maxTexture1D:                     65536
+  maxTexture2D.width:               16384
+  maxTexture2D.height:              16384
+  maxTexture3D.width:               2048
+  maxTexture3D.height:              2048
+  maxTexture3D.depth:               2048
+  isLargeBar:                       0
+  :3:hip_device_runtime.cpp   :471 : 23647701557: 5617 : [7fad295dd840] hipGetDeviceCount ( 0x7ffdbe7db714 )
+  :3:hip_device_runtime.cpp   :473 : 23647701608: 5617 : [7fad295dd840] hipGetDeviceCount: Returned hipSuccess
+  :3:hip_peer.cpp             :76  : 23647701731: 5617 : [7fad295dd840] hipDeviceCanAccessPeer ( 0x7ffdbe7db728, 0, 0 )
+  :3:hip_peer.cpp             :60  : 23647701784: 5617 : [7fad295dd840] canAccessPeer: Returned hipSuccess
+  :3:hip_peer.cpp             :77  : 23647701831: 5617 : [7fad295dd840] hipDeviceCanAccessPeer: Returned hipSuccess
+  peers:
+  :3:hip_peer.cpp             :76  : 23647701921: 5617 : [7fad295dd840] hipDeviceCanAccessPeer ( 0x7ffdbe7db728, 0, 0 )
+  :3:hip_peer.cpp             :60  : 23647701965: 5617 : [7fad295dd840] canAccessPeer: Returned hipSuccess
+  :3:hip_peer.cpp             :77  : 23647701998: 5617 : [7fad295dd840] hipDeviceCanAccessPeer: Returned hipSuccess
+  non-peers:                        device#0
+
+  :3:hip_memory.cpp           :345 : 23647702191: 5617 : [7fad295dd840] hipMemGetInfo ( 0x7ffdbe7db718, 0x7ffdbe7db720 )
+  :3:hip_memory.cpp           :360 : 23647702243: 5617 : [7fad295dd840] hipMemGetInfo: Returned hipSuccess
+  memInfo.total:                    7.98 GB
+  memInfo.free:                     7.98 GB (100%)
+
+
+On **Windows**, you can set ``AMD_LOG_LEVEL`` via environment variable from the advanced system
+settings or the command prompt (when run as administrator). The following example shows debug log
+information when calling the backend runtime.
+
+..  code-block:: bash
+
+  C:\hip\bin>set AMD_LOG_LEVEL=4
+  C:\hip\bin>hipinfo
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\comgrctx.cpp:33  : 605413686305 us: 29864: [tid:0x9298] Loading COMGR library.
+  :4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\platform\runtime.cpp:83  : 605413869411 us: 29864: [tid:0x9298] init
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_context.cpp:47  : 605413869502 us: 29864: [tid:0x9298] Direct Dispatch: 0
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:543 : 605413870553 us: 29864: [tid:0x9298] hipGetDeviceCount: Returned hipSuccess :
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:556 : 605413870631 us: 29864: [tid:0x9298] ←[32m hipSetDevice ( 0 ) ←[0m
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:561 : 605413870848 us: 29864: [tid:0x9298] hipSetDevice: Returned hipSuccess :
+  --------------------------------------------------------------------------------
+  device#                           0
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device.cpp:346 : 605413871623 us: 29864: [tid:0x9298] ←[32m hipGetDeviceProperties ( 0000008AEBEFF8C8, 0 ) ←[0m
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device.cpp:348 : 605413871695 us: 29864: [tid:0x9298] hipGetDeviceProperties: Returned hipSuccess :
+  Name:                             AMD Radeon(TM) Graphics
+  pciBusID:                         3
+  pciDeviceID:                      0
+  pciDomainID:                      0
+  multiProcessorCount:              7
+  maxThreadsPerMultiProcessor:      2560
+  isMultiGpuBoard:                  0
+  clockRate:                        1600 Mhz
+  memoryClockRate:                  1333 Mhz
+  memoryBusWidth:                   0
+  totalGlobalMem:                   12.06 GB
+  totalConstMem:                    2147483647
+  sharedMemPerBlock:                64.00 KB
+  ...
+  gcnArchName:                      gfx90c:xnack-
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:541 : 605413924779 us: 29864: [tid:0x9298] ←[32m hipGetDeviceCount ( 0000008AEBEFF8A4 ) ←[0m
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:543 : 605413925075 us: 29864: [tid:0x9298] hipGetDeviceCount: Returned hipSuccess :
+  peers:                            :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:176 : 605413928643 us: 29864: [tid:0x9298] ←[32m hipDeviceCanAccessPeer ( 0000008AEBEFF890, 0, 0 ) ←[0m
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:177 : 605413928743 us: 29864: [tid:0x9298] hipDeviceCanAccessPeer: Returned hipSuccess :
+  non-peers:                        :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:176 : 605413930830 us: 29864: [tid:0x9298] ←[32m hipDeviceCanAccessPeer ( 0000008AEBEFF890, 0, 0 ) ←[0m
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:177 : 605413930882 us: 29864: [tid:0x9298] hipDeviceCanAccessPeer: Returned hipSuccess :
+  device#0
+  ...
+  :4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.cpp:430 : 605414517802 us: 29864: [tid:0x9298] Free-:     8000 bytes, VM[ 3007c8000,  3007d0000]
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\devprogram.cpp:2979: 605414517893 us: 29864: [tid:0x9298] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferToImage
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\devprogram.cpp:2979: 605414518259 us: 29864: [tid:0x9298] For Init/Fini: Kernel Name: __amd_rocclr_copyBuffer
+  ...
+  :4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.cpp:206 : 605414523422 us: 29864: [tid:0x9298] Alloc: 100000 bytes, ptr[00000003008D0000-00000003009D0000], obj[00000003007D0000-00000003047D0000]
+  :4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.cpp:206 : 605414523767 us: 29864: [tid:0x9298] Alloc: 100000 bytes, ptr[00000003009D0000-0000000300AD0000], obj[00000003007D0000-00000003047D0000]
+  :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_memory.cpp:681 : 605414524092 us: 29864: [tid:0x9298] hipMemGetInfo: Returned hipSuccess :
+  memInfo.total:                    12.06 GB
+  memInfo.free:                     11.93 GB (99%)
diff --git a/docs/how_to_guides/debugging.md b/docs/how_to_guides/debugging.md
deleted file mode 100644
index 2860740681..0000000000
--- a/docs/how_to_guides/debugging.md
+++ /dev/null
@@ -1,301 +0,0 @@
-# HIP Debugging
-There are some techniques provided in HIP for developers to trace and debug codes during execution, this section describes some details and practical suggestions on debugging.
-
-## Debugging tools
-
-### Using ltrace
-ltrace is a standard linux tool which provides a message to stderr on every dynamic library call.
-Since ROCr and the ROCt (the ROC thunk, which is the thin user-space interface to the ROC kernel driver) are both dynamic libraries, this provides an easy way to trace the activity in these libraries.
-Tracing can be a powerful way to quickly observe the flow of the application before diving into the details with a command-line debugger.
-ltrace is a helpful tool to visualize the runtime behavior of the entire ROCm software stack.
-The trace can also show performance issues related to accidental calls to expensive API calls on the critical path.
-
-Here's a simple sample with command-line to trace hip APIs and output:
-
-```console
-$ ltrace -C -e "hip*" ./hipGetChanDesc
-hipGetChanDesc->hipCreateChannelDesc(0x7ffdc4b66860, 32, 0, 0) = 0x7ffdc4b66860
-hipGetChanDesc->hipMallocArray(0x7ffdc4b66840, 0x7ffdc4b66860, 8, 8) = 0
-hipGetChanDesc->hipGetChannelDesc(0x7ffdc4b66848, 0xa63990, 5, 1) = 0
-hipGetChanDesc->hipFreeArray(0xa63990, 0, 0x7f8c7fe13778, 0x7ffdc4b66848) = 0
-PASSED!
-+++ exited (status 0) +++
-```
-
-Another sample below with command-line only trace hsa APIs and output:
-
-```console
-$ ltrace -C -e "hsa*" ./hipGetChanDesc
-libamdhip64.so.4->hsa_init(0, 0x7fff325a69d0, 0x9c80e0, 0 <unfinished ...>
-libhsa-runtime64.so.1->hsaKmtOpenKFD(0x7fff325a6590, 0x9c38c0, 0, 1) = 0
-libhsa-runtime64.so.1->hsaKmtGetVersion(0x7fff325a6608, 0, 0, 0) = 0
-libhsa-runtime64.so.1->hsaKmtReleaseSystemProperties(3, 0x80084b01, 0, 0) = 0
-libhsa-runtime64.so.1->hsaKmtAcquireSystemProperties(0x7fff325a6610, 0, 0, 1) = 0
-libhsa-runtime64.so.1->hsaKmtGetNodeProperties(0, 0x7fff325a66a0, 0, 0) = 0
-libhsa-runtime64.so.1->hsaKmtGetNodeMemoryProperties(0, 1, 0x9c42b0, 0x936012) = 0
-...
-<... hsaKmtCreateEvent resumed> )                = 0
-libhsa-runtime64.so.1->hsaKmtAllocMemory(0, 4096, 64, 0x7fff325a6690) = 0
-libhsa-runtime64.so.1->hsaKmtMapMemoryToGPUNodes(0x7f1202749000, 4096, 0x7fff325a6690, 0) = 0
-libhsa-runtime64.so.1->hsaKmtCreateEvent(0x7fff325a6700, 0, 0, 0x7fff325a66f0) = 0
-libhsa-runtime64.so.1->hsaKmtAllocMemory(1, 0x100000000, 576, 0x7fff325a67d8) = 0
-libhsa-runtime64.so.1->hsaKmtAllocMemory(0, 8192, 64, 0x7fff325a6790) = 0
-libhsa-runtime64.so.1->hsaKmtMapMemoryToGPUNodes(0x7f120273c000, 8192, 0x7fff325a6790, 0) = 0
-libhsa-runtime64.so.1->hsaKmtAllocMemory(0, 4096, 4160, 0x7fff325a6450) = 0
-libhsa-runtime64.so.1->hsaKmtMapMemoryToGPUNodes(0x7f120273a000, 4096, 0x7fff325a6450, 0) = 0
-libhsa-runtime64.so.1->hsaKmtSetTrapHandler(1, 0x7f120273a000, 4096, 0x7f120273c000) = 0
-<... hsa_init resumed> )                         = 0
-libamdhip64.so.4->hsa_system_get_major_extension_table(513, 1, 24, 0x7f1202597930) = 0
-libamdhip64.so.4->hsa_iterate_agents(0x7f120171f050, 0, 0x7fff325a67f8, 0 <unfinished ...>
-libamdhip64.so.4->hsa_agent_get_info(0x94f110, 17, 0x7fff325a67e8, 0) = 0
-libamdhip64.so.4->hsa_amd_agent_iterate_memory_pools(0x94f110, 0x7f1201722816, 0x7fff325a67f0, 0x7f1201722816 <unfinished ...>
-libamdhip64.so.4->hsa_amd_memory_pool_get_info(0x9c7fb0, 0, 0x7fff325a6744, 0x7fff325a67f0) = 0
-libamdhip64.so.4->hsa_amd_memory_pool_get_info(0x9c7fb0, 1, 0x7fff325a6748, 0x7f1200d82df4) = 0
-...
-<... hsa_amd_agent_iterate_memory_pools resumed> ) = 0
-libamdhip64.so.4->hsa_agent_get_info(0x9dbf30, 17, 0x7fff325a67e8, 0) = 0
-<... hsa_iterate_agents resumed> )               = 0
-libamdhip64.so.4->hsa_agent_get_info(0x9dbf30, 0, 0x7fff325a6850, 3) = 0
-libamdhip64.so.4->hsa_agent_get_info(0x9dbf30, 0xa000, 0x9e7cd8, 0) = 0
-libamdhip64.so.4->hsa_agent_iterate_isas(0x9dbf30, 0x7f1201720411, 0x7fff325a6760, 0x7f1201720411) = 0
-libamdhip64.so.4->hsa_isa_get_info_alt(0x94e7c8, 0, 0x7fff325a6728, 1) = 0
-libamdhip64.so.4->hsa_isa_get_info_alt(0x94e7c8, 1, 0x9e7f90, 0) = 0
-libamdhip64.so.4->hsa_agent_get_info(0x9dbf30, 4, 0x9e7ce8, 0) = 0
-...
-<... hsa_amd_memory_pool_allocate resumed> )     = 0
-libamdhip64.so.4->hsa_ext_image_create(0x9dbf30, 0xa1c4c8, 0x7f10f2800000, 3 <unfinished ...>
-libhsa-runtime64.so.1->hsaKmtAllocMemory(0, 4096, 64, 0x7fff325a6740) = 0
-libhsa-runtime64.so.1->hsaKmtQueryPointerInfo(0x7f1202736000, 0x7fff325a65e0, 0, 0) = 0
-libhsa-runtime64.so.1->hsaKmtMapMemoryToGPUNodes(0x7f1202736000, 4096, 0x7fff325a66e8, 0) = 0
-<... hsa_ext_image_create resumed> )             = 0
-libamdhip64.so.4->hsa_ext_image_destroy(0x9dbf30, 0x7f1202736000, 0x9dbf30, 0 <unfinished ...>
-libhsa-runtime64.so.1->hsaKmtUnmapMemoryToGPU(0x7f1202736000, 0x7f1202736000, 4096, 0x9c8050) = 0
-libhsa-runtime64.so.1->hsaKmtFreeMemory(0x7f1202736000, 4096, 0, 0) = 0
-<... hsa_ext_image_destroy resumed> )            = 0
-libamdhip64.so.4->hsa_amd_memory_pool_free(0x7f10f2800000, 0x7f10f2800000, 256, 0x9e76f0) = 0
-PASSED!
-```
-
-### Using ROCgdb
-HIP developers on ROCm can use AMD's ROCgdb for debugging and profiling.
-ROCgdb is the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger, equivalent of cuda-gdb, can be used with debugger frontends, such as eclipse, vscode, or gdb-dashboard.
-For details, see (https://github.com/ROCm-Developer-Tools/ROCgdb).
-
-Below is a sample how to use ROCgdb run and debug HIP application, rocgdb is installed with ROCM package in the folder /opt/rocm/bin.
-
-```console
-$ export PATH=$PATH:/opt/rocm/bin
-$ rocgdb ./hipTexObjPitch
-GNU gdb (rocm-dkms-no-npi-hipclang-6549) 10.1
-Copyright (C) 2020 Free Software Foundation, Inc.
-License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
-...
-For bug reporting instructions, please see:
-<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
-Find the GDB manual and other documentation resources online at:
-    <http://www.gnu.org/software/gdb/documentation/>.
-
-...
-Reading symbols from ./hipTexObjPitch...
-(gdb) break main
-Breakpoint 1 at 0x4013d1: file /home/<your_awesome_name>/hip-tests/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp, line 56.
-(gdb) run
-Starting program: MatrixTranspose
-[Thread debugging using libthread_db enabled]
-Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
-
-Breakpoint 1, main ()
-    at MatrixTranspose.cpp:56
-56	    int main() {
-(gdb) c
-
-```
-
-### Other Debugging Tools
-There are also other debugging tools available online developers can google and choose the one best suits the debugging requirements. For example, Microsoft Visual Studio and Windgb tools are options on Windows.
-
-## Debugging HIP Applications
-
-Below is an example on Linux to show how to get useful information from the debugger while running a simple hip application, which caused an issue of segmentation fault.
-
-Simple HIP Program:
-
-```cpp
-#include <hip/hip_runtime.h>
-#include <iostream>
-#include <vector>
-
-__global__ void kernel_add(int* a, int b) {
-  int i = threadIdx.x;
-  a[i] += b;
-}
-
-int main() {
-  constexpr size_t size = 1024;
-  int* ptr;
-  hipMalloc(&ptr, sizeof(int) * size);
-  hipMemset(ptr, 0, sizeof(int) * size);
-  std::vector<int> input(size, 0);
-  size_t i = 100;
-  std::for_each(input.begin(), input.end(), [&](int& a) { a = i; });
-  hipMemcpy(ptr, input.data(), sizeof(int) * size, hipMemcpyHostToDevice);
-  kernel_add<<<1, size>>>(ptr, 10);
-  std::vector<int> output = input;
-  hipMemcpy(output.data(), ptr, sizeof(int) * size, hipMemcpyDeviceToHost);
-  std::cout << ((std::all_of(output.begin(), output.end(), [&](int a) { return a == (i + 10); }))
-                    ? "passed"
-                    : "failed")
-            << std::endl;
-  hipFree(ptr);
-}
-```
-
-Compile and run command:
-
-```console
-hipcc app.cpp -ggdb -o app
-rocgdb ./app
-```
-
-```console
-(gdb) b main
-Breakpoint 1 at 0x21275e: file app.cpp, line 14.
-
-(gdb) run
-Starting program: /home/<your_awesome_name>/app
-warning: os_agent_id 31475: `Device 1002:164e' architecture not supported.
-[Thread debugging using libthread_db enabled]
-Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
-
-Breakpoint 1, hipMalloc<int> (devPtr=0x7fffffffe098, size=4096) at /opt/rocm/include/hip/hip_runtime_api.h:8487
-8487        return hipMalloc((void**)devPtr, size);
-
-(gdb) bt
-#0  hipMalloc<int> (devPtr=0x7fffffffe098, size=4096) at /opt/rocm/include/hip/hip_runtime_api.h:8487
-#1  main () at app.cpp:14
-
-(gdb) n
-[New Thread 0x7fffeb7ff640 (LWP 1524879)]
-[New Thread 0x7fffeaffe640 (LWP 1524880)]
-[Thread 0x7fffeaffe640 (LWP 1524880) exited]
-main () at app.cpp:15
-15        hipMemset(ptr, 0, sizeof(int) * size);
-
-(gdb) info threads
-  Id   Target Id                                 Frame
-* 1    Thread 0x7ffff7e6ba80 (LWP 1524135) "app" main () at app.cpp:15
-  2    Thread 0x7fffeb7ff640 (LWP 1524879) "app" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
-
-(gdb) thread 2
-[Switching to thread 2 (Thread 0x7fffeb7ff640 (LWP 1524879))]
-#0  __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
-36      ../sysdeps/unix/sysv/linux/ioctl.c: No such file or directory.
-
-(gdb) bt
-#0  __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
-#1  0x00007fffeb8fda80 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
-#2  0x00007fffeb8f6912 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
-#3  0x00007fffeb883021 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
-#4  0x00007fffeb85e026 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
-#5  0x00007fffeb874b6a in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
-#6  0x00007fffeb828fdb in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
-#7  0x00007ffff5c94b43 in start_thread (arg=<optimised out>) at ./nptl/pthread_create.c:442
-#8  0x00007ffff5d26a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
-...
-```
-
-A complete guide to `rocgdb` can be found [here](https://rocm.docs.amd.com/projects/ROCgdb/en/latest/).
-
-On Windows, debugging HIP applications on IDE like Microsoft Visual Studio tools, are more informative and visible to debug codes, inspect  variables, watch multiple details and examine the call stacks.
-
-## Useful Environment Variables
-
-HIP provides some environment variables which allow HIP, hip-clang, or HSA driver on Linux to disable some feature or optimization.
-These are not intended for production but can be useful diagnose synchronization problems in the application (or driver).
-
-Some of the most useful environment variables are described here. They are supported on the ROCm path on Linux and Windows as well.
-
-### Kernel Enqueue Serialization
-Developers can control kernel command serialization from the host using the environment variable,
-
-AMD_SERIALIZE_KERNEL, for serializing kernel enqueue.
- AMD_SERIALIZE_KERNEL = 1, Wait for completion before enqueue,
- AMD_SERIALIZE_KERNEL = 2, Wait for completion after enqueue,
- AMD_SERIALIZE_KERNEL = 3, Both.
-
-Or
-AMD_SERIALIZE_COPY, for serializing copies.
-
- AMD_SERIALIZE_COPY = 1, Wait for completion before enqueue,
- AMD_SERIALIZE_COPY = 2, Wait for completion after enqueue,
- AMD_SERIALIZE_COPY = 3, Both.
-
-So HIP runtime can wait for GPU idle before/after any GPU command depending on the environment setting.
-
-### Making Device visible
-For system with multiple devices, it's possible to make only certain device(s) visible to HIP via setting environment variable,
-HIP_VISIBLE_DEVICES(or CUDA_VISIBLE_DEVICES on Nvidia platform), only devices whose index is present in the sequence are visible to HIP.
-
-For example,
-```console
-$ HIP_VISIBLE_DEVICES=0,1
-```
-
-or in the application,
-```cpp
-if (totalDeviceNum > 2) {
-  setenv("HIP_VISIBLE_DEVICES", "0,1,2", 1);
-  assert(getDeviceNumber(false) == 3);
-  ... ...
-}
-```
-
-### Dump code object
-Developers can dump code object to analyze compiler related issues via setting environment variable,
-GPU_DUMP_CODE_OBJECT
-
-### HSA related environment variables on Linux
-On Linux with open source, HSA provides some environment variables help to analyze issues in driver or hardware, for example,
-
-HSA_ENABLE_SDMA=0
-It causes host-to-device and device-to-host copies to use compute shader blit kernels rather than the dedicated DMA copy engines.
-Compute shader copies have low latency (typically < 5us) and can achieve approximately 80% of the bandwidth of the DMA copy engine.
-This environment variable is useful to isolate issues with the hardware copy engines.
-
-HSA_ENABLE_INTERRUPT=0
-Causes completion signals to be detected with memory-based polling rather than interrupts.
-This environment variable can be useful to diagnose interrupt storm issues in the driver.
-
-### Summary of environment variables in HIP
-
-The following is the summary of the most useful environment variables in HIP.
-
-| **Environment variable**                                                                                       | **Default value** | **Usage** |
-| ---------------------------------------------------------------------------------------------------------------| ----------------- | --------- |
-| AMD_LOG_LEVEL <br><sub> Enable HIP log on different Level. </sub> |  0  | 0: Disable log. <br> 1: Enable log on error level. <br> 2: Enable log on warning and below levels. <br> 0x3: Enable log on information and below levels. <br> 0x4: Decode and display AQL packets. |
-| AMD_LOG_MASK <br><sub> Enable HIP log on different Level. </sub> |  0x7FFFFFFF  | 0x1: Log API calls. <br> 0x02: Kernel and Copy Commands and Barriers. <br> 0x4: Synchronization and waiting for commands to finish. <br> 0x8: Enable log on information and below levels. <br> 0x20: Queue commands and queue contents. <br> 0x40:Signal creation, allocation, pool. <br> 0x80: Locks and thread-safety code. <br> 0x100: Copy debug. <br> 0x200: Detailed copy debug. <br> 0x400: Resource allocation, performance-impacting events. <br> 0x800: Initialization and shutdown. <br> 0x1000: Misc debug, not yet classified. <br> 0x2000: Show raw bytes of AQL packet. <br> 0x4000: Show code creation debug. <br> 0x8000: More detailed command info, including barrier commands. <br> 0x10000: Log message location. <br> 0xFFFFFFFF: Log always even mask flag is zero. |
-| HIP_VISIBLE_DEVICES(or CUDA_VISIBLE_DEVICES) <br><sub> Only devices whose index is present in the sequence are visible to HIP. </sub> |   | 0,1,2: Depending on the number of devices on the system.  |
-| GPU_DUMP_CODE_OBJECT <br><sub> Dump code object. </sub> |  0  | 0: Disable. <br> 1: Enable. |
-| AMD_SERIALIZE_KERNEL <br><sub> Serialize kernel enqueue. </sub> |  0  | 1: Wait for completion before enqueue. <br> 2: Wait for completion after enqueue. <br> 3: Both. |
-| AMD_SERIALIZE_COPY <br><sub> Serialize copies. </sub> |  0  | 1: Wait for completion before enqueue. <br> 2: Wait for completion after enqueue. <br> 3: Both. |
-| HIP_HOST_COHERENT <br><sub> Coherent memory in hipHostMalloc. </sub> |  0  |  0: memory is not coherent between host and GPU. <br> 1: memory is coherent with host. |
-| AMD_DIRECT_DISPATCH <br><sub> Enable direct kernel dispatch (Currently for Linux, under development on Windows). </sub> | 1  | 0: Disable. <br> 1: Enable. |
-| GPU_MAX_HW_QUEUES <br><sub> The maximum number of hardware queues allocated per device. </sub> | 4  | The variable controls how many independent hardware queues HIP runtime can create per process, per device. If application allocates more HIP streams than this number, then HIP runtime will reuse the same hardware queues for the new streams in round robin manner. Please note, this maximum number does not apply to either hardware queues that are created for CU masked HIP streams, or cooperative queue for HIP Cooperative Groups (there is only one single queue per device). |
-| HIP_LAUNCH_BLOCKING <br><sub> Used for serialization on kernel execution. </sub> | 0 | 0: Disable. Kernel executes normally. <br> 1: Enable. Serializes kernel enqueue, behaves the same as AMD_SERIALIZE_KERNEL. |
-
-## General Debugging Tips
-- 'gdb --args' can be used to conveniently pass the executable and arguments to gdb.
-- From inside GDB on Linux, you can set environment variables "set env".  Note the command does not use an '=' sign:
-
-```
-(gdb) set env AMD_SERIALIZE_KERNEL 3
-```
-- The fault will be caught by the runtime but was actually generated by an asynchronous command running on the GPU. So, the GDB backtrace will show a path in the runtime.
-- To determine the true location of the fault, force the kernels to execute synchronously by seeing the environment variables AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3.  This will force HIP runtime to wait for the kernel to finish executing before retuning.  If the fault occurs during the execution of a kernel, you can see the code which launched the kernel inside the backtrace.  A bit of guesswork is required to determine which thread is actually causing the issue - typically it will the thread which is waiting inside the `libhsa-runtime64.so`.
-- VM faults inside kernels can be caused by:
-   - incorrect code (ie a for loop which extends past array boundaries),
-   - memory issues  - kernel arguments which are invalid (null pointers, unregistered host pointers, bad pointers),
-   - synchronization issues,
-   - compiler issues (incorrect code generation from the compiler),
-   - runtime issues.
-
diff --git a/docs/how_to_guides/install.md b/docs/how_to_guides/install.md
deleted file mode 100644
index dc496c02a8..0000000000
--- a/docs/how_to_guides/install.md
+++ /dev/null
@@ -1,53 +0,0 @@
-# Installing HIP
-
-HIP can be installed either on AMD ROCm platform with HIP-Clang compiler, or a CUDA platform with nvcc installed.
-
-Note: The version definition for the HIP runtime is different from CUDA. Users can use hipRuntimeGerVersion function, on the AMD platform it returns the HIP runtime version, while on the NVIDIA platform, it returns the CUDA runtime version. There is no mapping or correlation between the HIP version and CUDA version.
-
-## Prerequisites
-On AMD platform, see Prerequisite Actions in ROCm_Installation_Guide (https://docs.amd.com/) for the release.
-On NVIDIA platform, check system requirements in NVIDIA CUDA Installation Guide at https://docs.nvidia.com/cuda/cuda-installation-guide-linux/.
-
-
-## AMD Platform
-
-HIP is part of ROCM packages, it will be automatically installed following the ROCm Installation Guide on AMD public documentation site (https://docs.amd.com/) for the coresponding ROCm release.
-
-By default, HIP is installed into /opt/rocm/hip.
-
-
-## NVIDIA Platform
-
-
-* Install Nvidia Driver
-```
-sudo apt-get install ubuntu-drivers-common && sudo ubuntu-drivers autoinstall
-sudo reboot
-```
-Or download the latest cuda-toolkit at https://developer.nvidia.com/cuda-downloads
-The driver will be installed automatically.
-
-* Add the ROCm package server to your system as per the OS-specific guide in ROCm Installation Guide (https://docs.amd.com/) for the release.
-* Install the "hip-runtime-nvidia" and "hip-dev" package. This will install CUDA SDK and the HIP porting layer.
-```
-apt-get install hip-runtime-nvidia hip-dev
-```
-* Default paths:
-   * By default HIP looks for CUDA SDK in /usr/local/cuda.
-   * By default HIP is installed into /opt/rocm/hip.
-   * Optionally, consider adding /opt/rocm/bin to your path to make it easier to use the tools.
-
-
-# Verify your installation
-
-Run hipconfig (instructions below assume default installation path):
-```
-/opt/rocm/bin/hipconfig --full
-```
-
-# How to build HIP from source
-
-Developers can build HIP from source on either AMD or NVIDIA platforms, see
-detailed instructions at  [building HIP from source](../developer_guide/build.md).
-
-
diff --git a/docs/index.md b/docs/index.md
index bcc9d7fdb9..0fe79b64f2 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,4 +1,4 @@
-# HIP Documentation
+# HIP documentation
 
 HIP is a C++ runtime API and kernel language that allows developers to create
 portable applications for AMD and NVIDIA GPUs from single source code.
@@ -8,30 +8,58 @@ portable applications for AMD and NVIDIA GPUs from single source code.
 ::::{grid} 1 1 2 2
 :gutter: 1
 
-:::{grid-item-card} User Guide
-- {doc}`/user_guide/programming_manual`
-- {doc}`/user_guide/hip_rtc`
-- {doc}`/user_guide/faq`
+:::{grid-item-card} Install
+
+* [Install HIP](./install/install.rst)
+* [Build HIP from source](./install/build.rst)
+
+:::
+
+:::{grid-item-card} Reference
+
+* {doc}`/doxygen/html/index`
+* [Deprecated APIs](./reference/deprecated_api_list.rst)
+
 :::
 
-:::{grid-item-card} How to Guides
-- {doc}`/how_to_guides/install`
-- {doc}`/how_to_guides/debugging`
+:::{grid-item-card} How to
+
+* [Debug with HIP](./how-to/debugging.rst)
+* [Generate logs](./how-to/logging.rst)
+
 :::
 
+:::{grid-item-card} Conceptual
+
+:::
+
+::::
+
+## Legacy documentation
+
+These documents have not yet been ported over to the Diátaxis framework.
+
+::::{grid} 1 1 2 2
+:gutter: 1
+
 :::{grid-item-card} Reference
-- {doc}`/doxygen/html/index`
-  - {doc}`/doxygen/html/modules`
-- {doc}`/reference/kernel_language`
-- {doc}`/reference/math_api`
-- {doc}`/reference/terms`
-- {doc}`/reference/deprecated_api_list`
+
+* [C++ kernel language](./old/reference/kernel_language.rst)
+* [Table Comparing Syntax for Different Compute APIs](./old/reference/terms.md)
+
 :::
 
-:::{grid-item-card} Developer Guide
-- {doc}`/developer_guide/build`
-- {doc}`/developer_guide/logging`
-- {doc}`/developer_guide/contributing`
+:::{grid-item-card} User Guide
+
+* [HIP Porting Guide](./old/user_guide/hip_porting_guide.md)
+* [HIP Porting Driver API Guide](./old/user_guide/hip_porting_driver_api.md)
+
 :::
 
 ::::
+
+We welcome collaboration! If you’d like to contribute to our documentation, you can find instructions
+on our {doc}`Contribute to ROCm docs <rocm:conribute/index>` page. Known issues are listed on
+[GitHub](https://github.com/RadeonOpenCompute/ROCm/labels/Verified%20Issue).
+
+If you want to contribute to the HIP project, refer to our [Contributor guidelines](./about/contributing.md).
diff --git a/docs/install/build.rst b/docs/install/build.rst
new file mode 100644
index 0000000000..ebd1b84a8d
--- /dev/null
+++ b/docs/install/build.rst
@@ -0,0 +1,256 @@
+*******************************************
+Build HIP from source
+*******************************************
+
+Building the HIP runtime
+==========================================================
+
+Set the repository branch using the variable: ``ROCM_BRANCH``. For example, for ROCm 6.1, use:
+
+.. code:: shell
+
+   export ROCM_BRANCH=rocm-6.1.x
+
+.. tab-set::
+
+   .. tab-item:: AMD
+      :sync: amd
+
+      #. Get HIP source code.
+
+         .. note::
+            Starting in ROCM 5.6, CLR is a new repository that includes the former ROCclr, HIPAMD and
+            OpenCl repositories. OpenCL provides headers that ROCclr runtime depends on.
+
+         .. note::
+            Starting in ROCM 6.1, a new repository ``hipother`` is added to ROCm, which is branched out from HIP.
+            ``hipother`` provides files required to support the HIP back-end implementation on some non-AMD platforms,
+            like NVIDIA.
+
+         .. code:: shell
+
+            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
+            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
+            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git
+            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/HIPCC.git
+
+         CLR (Common Language Runtime) or ROCclr is a virtual device interface that various AMD runtimes interact with.
+         HIPAMD provides implementation specifically for HIP on te AMD platform.
+         OpenCL provides headers that ROCclr runtime currently depends on.
+         hipother provides headers and implementation specifically for non-AMD HIP platforms, like NVIDIA.
+
+      #. Set the environment variables.
+
+         .. code:: shell
+
+            export CLR_DIR="$(readlink -f clr)"
+            export HIP_DIR="$(readlink -f hip)"
+            export HIP_OTHER="$(readlink -f hipother)"
+            export HIPCC_DIR="$(readlink -f HIPCC)"
+
+      #. Build the HIPCC runtime.
+
+         .. code:: shell
+
+            cd "$HIPCC_DIR"
+            mkdir -p build; cd build
+            cmake ..
+            make -j4
+
+      #. Build HIP.
+
+         .. code:: shell
+
+            cd "$CLR_DIR"
+            mkdir -p build; cd build
+            cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIPCC_BIN_DIR=$HIPCC_DIR/build -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
+
+            make -j$(nproc)
+            sudo make install
+
+         .. note::
+
+            Note, if you don't specify ``CMAKE_INSTALL_PREFIX``, the HIP runtime is installed at
+            ``<ROCM_PATH>/hip``. The default version of HIP is the latest release.
+
+         Default paths and environment variables:
+
+            * HIP is installed into ``<ROCM_PATH>/hip``. This can be overridden by setting the ``HIP_PATH``
+               environment variable.
+            * HSA is in ``<ROCM_PATH>/hsa``. This can be overridden by setting the ``HSA_PATH``
+               environment variable.
+            * Clang is in ``<ROCM_PATH>/llvm/bin``. This can be overridden by setting the
+               ``HIP_CLANG_PATH`` environment variable.
+            * The device library is in ``<ROCM_PATH>/lib``. This can be overridden by setting the
+               ``DEVICE_LIB_PATH`` environment variable.
+            * Optionally, you can add ``<ROCM_PATH>/bin`` to your ``PATH``, which can make it easier to
+               use the tools.
+            * Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation.
+
+         After you run the ``make install`` command, make sure ``HIP_PATH`` points to ``$PWD/install/hip``.
+
+         #. Generate a profiling header after adding/changing a HIP API.
+
+            When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header.
+            This header is used by ROCm tools to track HIP APIs, such as``rocprofiler`` and ``roctracer``.
+
+            To generate the header after your change, use the ``hip_prof_gen.py`` tool located in
+            ``hipamd/src``.
+
+            Usage:
+
+            .. code:: shell
+
+               `hip_prof_gen.py [-v] <input HIP API .h file> <patched srcs path> <previous output> [<output>]`
+
+            Flags:
+
+               * ``-v``: Verbose messages
+               * ``-r``: Process source directory recursively
+               * ``-t``: API types matching check
+               * ``--priv``: Private API check
+               * ``-e``: On error exit mode
+               * ``-p``: ``HIP_INIT_API`` macro patching mode
+
+            Example usage:
+
+            .. code:: shell
+
+               hip_prof_gen.py -v -p -t --priv <hip>/include/hip/hip_runtime_api.h \
+               <hipamd>/src <hipamd>/include/hip/amd_detail/hip_prof_str.h \
+               <hipamd>/include/hip/amd_detail/hip_prof_str.h.new
+
+   .. tab-item:: NVIDIA
+      :sync: nvidia
+
+      #. Get the HIP source code.
+
+         .. code:: shell
+
+            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
+            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
+            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git
+            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/HIPCC.git
+
+      #. Set the environment variables.
+
+         .. code:: shell
+
+            export CLR_DIR="$(readlink -f clr)"
+            export HIP_DIR="$(readlink -f hip)"
+            export HIP_OTHER="$(readlink -f hipother)"
+            export HIPCC_DIR="$(readlink -f HIPCC)"
+
+      #. Build the HIPCC runtime.
+
+         .. code:: shell
+
+            cd "$HIPCC_DIR"
+            mkdir -p build; cd build
+            cmake ..
+            make -j4
+
+      #. Build HIP.
+
+         .. code:: shell
+
+            cd "$CLR_DIR"
+            mkdir -p build; cd build
+            cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIPCC_BIN_DIR=$HIPCC_DIR/build -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
+            make -j$(nproc)
+            sudo make install
+
+Build HIP tests
+=================================================
+
+.. tab-set::
+
+   .. tab-item:: AMD
+      :sync: amd
+
+      * Build HIP directed tests.
+
+         .. code:: shell
+
+            sudo make install
+            make -j$(nproc) build_tests
+
+         By default, all HIP directed tests are built and generated in
+         ``$CLR_DIR/build/hipamd/directed_tests``.
+
+         * Run all HIP ``directed_tests``.
+
+            .. code:: shell
+
+               ctest
+
+            or
+
+            .. code:: shell
+
+               make test
+
+
+         * Build and run a single directed test.
+
+            .. code:: shell
+
+               make directed_tests.texture.hipTexObjPitch
+               cd $CLR_DIR/build/hipamd/directed_tests/texture
+               ./hipTexObjPitch
+
+         .. note::
+            The integrated HIP directed tests will be deprecated in a future release.
+
+      * Build HIP catch tests.
+
+         HIP catch tests are separate from the HIP project and use Catch2.
+
+         * Get HIP tests source code.
+
+            .. code:: shell
+
+               git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/hip-tests.git
+
+         * Build HIP tests from source.
+
+            .. code:: shell
+
+               export HIPTESTS_DIR="$(readlink -f hip-tests)"
+               cd "$HIPTESTS_DIR"
+               mkdir -p build; cd build
+               export HIP_PATH=$CLR_DIR/build/install  # or any path where HIP is installed; for example: ``/opt/rocm``
+               cmake ../catch/ -DHIP_PLATFORM=amd
+               make -j$(nproc) build_tests
+               ctest # run tests
+
+            HIP catch tests are built in ``$HIPTESTS_DIR/build``.
+
+            To run any single catch test, use this example:
+
+            .. code:: shell
+
+               cd $HIPTESTS_DIR/build/catch_tests/unit/texture
+               ./TextureTest
+
+         * Build a HIP Catch2 standalone test.
+
+            .. code:: shell
+
+               cd "$HIPTESTS_DIR"
+               hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \
+               -I ./catch/include ./catch/hipTestMain/standalone_main.cc \
+               -I ./catch/external/Catch2 -o hipPointerGetAttributes
+               ./hipPointerGetAttributes
+               ...
+
+               All tests passed
+
+   .. tab-item:: NVIDIA
+      :sync: nvidia
+
+      The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform.
+      However, you must first set ``-DHIP_PLATFORM=nvidia``.
+
+      * Run HIP. Compile and run the
+      `square sample <https://github.com/ROCm-Developer-Tools/hip-tests/tree/rocm-5.5.x/samples/0_Intro/square>`_.
diff --git a/docs/install/install.rst b/docs/install/install.rst
new file mode 100644
index 0000000000..44b0962693
--- /dev/null
+++ b/docs/install/install.rst
@@ -0,0 +1,79 @@
+*******************************************
+Install HIP
+*******************************************
+
+HIP can be installed on AMD (ROCm with HIP-Clang) and NVIDIA (CUDA with nvcc) platforms.
+
+Note: The version definition for the HIP runtime is different from CUDA. On an AMD platform, the
+``hipRuntimeGerVersion`` function returns the HIP runtime version; on an NVIDIA platform, this function
+returns the CUDA runtime version.
+
+Prerequisites
+=======================================
+
+.. tab-set::
+
+   .. tab-item:: AMD
+      :sync: amd
+
+      Refer to the Prerequisites section in the ROCm install guides:
+
+         * :doc:`rocm:/install/linux/install`
+         * :doc:`rocm:/install/windows/install`
+
+   .. tab-item:: NVIDIA
+      :sync: nvidia
+
+      Check the system requirements in the
+      `NVIDIA CUDA Installation Guide <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/>`_.
+
+Installation
+=======================================
+
+.. tab-set::
+
+   .. tab-item:: AMD
+      :sync: amd
+
+      HIP is automatically installed during the ROCm installation. If you haven't yet installed ROCm, you
+      can find installation instructions here:
+
+         * :doc:`rocm:/install/linux/install`
+         * :doc:`rocm:/install/windows/install`
+
+By default, HIP is installed into /opt/rocm/hip.
+
+   .. tab-item:: NVIDIA
+      :sync: nvidia
+
+      #. Install the NVIDIA driver.
+
+         .. code:: shell
+
+            sudo apt-get install ubuntu-drivers-common && sudo ubuntu-drivers autoinstall
+            sudo reboot
+
+         Alternatively, you can download the latest
+         `CUDA Toolkit <https://developer.nvidia.com/cuda-downloads>`_.
+
+      #. Install the ``hip-runtime-nvidia`` and ``hip-dev`` packages. This installs the CUDA SDK and HIP
+         porting layer.
+
+         .. code:: shell
+
+            apt-get install hip-runtime-nvidia hip-dev
+
+         The default paths are:
+            * CUDA SDK: ``/usr/local/cuda``
+            * HIP: ``/opt/rocm/hip``
+
+         You can optionally add ``/opt/rocm/bin`` to your path, which can make it easier to use the tools.
+
+Verify your installation
+==========================================================
+
+Run ``hipconfig`` in your installation path.
+
+.. code:: shell
+
+   /opt/rocm/bin/hipconfig --full
diff --git a/docs/reference/glossary.md b/docs/old/glossary.md
similarity index 100%
rename from docs/reference/glossary.md
rename to docs/old/glossary.md
diff --git a/docs/old/reference/kernel_language.rst b/docs/old/reference/kernel_language.rst
new file mode 100644
index 0000000000..f6d3c4ce63
--- /dev/null
+++ b/docs/old/reference/kernel_language.rst
@@ -0,0 +1,1117 @@
+.. meta::
+  :description: This chapter describes the built-in variables and functions that are accessible from the
+                HIP kernel. It's intended for users who are familiar with CUDA kernel syntax and want to
+                learn how HIP differs from CUDA.
+  :keywords: AMD, ROCm, HIP, CUDA, c++ language extensions, HIP functions
+
+********************************************************************************
+C++ Language Extensions
+********************************************************************************
+
+HIP provides a C++ syntax that is suitable for compiling most code that commonly appears in
+compute kernels (classes, namespaces, operator overloading, and templates). HIP also defines other
+language features that are designed to target accelerators, such as:
+
+* A kernel-launch syntax that uses standard C++ (this resembles a function call and is portable to all
+  HIP targets)
+* Short-vector headers that can serve on a host or device
+* Math functions that resemble those in ``math.h``, which is included with standard C++ compilers
+* Built-in functions for accessing specific GPU hardware capabilities
+
+.. note::
+
+  This chapter describes the built-in variables and functions that are accessible from the HIP kernel. It's
+  intended for users who are familiar with CUDA kernel syntax and want to learn how HIP differs from
+  CUDA.
+
+Features are labeled with one of the following keywords:
+
+* **Supported**: HIP supports the feature with a CUDA-equivalent function
+* **Not supported**: HIP does not support the feature
+* **Under development**: The feature is under development and not yet available
+
+Function-type qualifiers
+========================================================
+
+``__device__``
+-----------------------------------------------------------------------
+
+Supported  ``__device__`` functions are:
+
+  * Run on the device
+  * Called from the device only
+
+You can combine ``__device__`` with the host keyword (:ref:`host_attr`).
+
+``__global__``
+-----------------------------------------------------------------------
+
+Supported ``__global__`` functions are:
+
+  * Run on the device
+  * Called (launched) from the host
+
+HIP ``__global__`` functions must have a ``void`` return type. The first parameter in a HIP ``__global__``
+function must have the type ``hipLaunchParm``. Refer to :ref:`kernel-launch-example` to see usage.
+
+HIP doesn't support dynamic-parallelism, which means that you can't call ``__global__`` functions from
+the device.
+
+.. _host_attr:
+
+``__host__``
+-----------------------------------------------------------------------
+
+Supported ``__host__`` functions are:
+
+  * Run on the host
+  * Called from the host
+
+You can combine ``__host__`` with ``__device__``; in this case, the function compiles for the host and the
+device. Note that these functions can't use the HIP grid coordinate functions (e.g., ``threadIdx.x``). If
+you need to use HIP grid coordinate functions, you can pass the necessary coordinate information as
+an argument.
+
+You can't combine ``__host__`` with ``__global__``.
+
+HIP parses the ``__noinline__`` and ``__forceinline__`` keywords and converts them into the appropriate
+Clang attributes.
+
+Calling ``__global__`` functions
+=============================================================
+
+`__global__` functions are often referred to as *kernels*. When you call a global function, you're
+*launching a kernel*. When launching a kernel, you must specify a run configuration that includes the
+grid and block dimensions. The run configuration can also include other information for the launch,
+such as the amount of additional shared memory to allocate and the stream where you want to run the
+kernel.
+
+HIP introduces a standard C++ calling convention (``hipLaunchKernelGGL``) to pass the run
+configuration to the kernel. However, you can also use the CUDA ``<<< >>>`` syntax.
+
+When using ``hipLaunchKernelGGL``, your first five parameters must be:
+
+  * **symbol kernelName**: The name of the kernel you want to launch. To support template kernels
+    that contain ``","``, use the ``HIP_KERNEL_NAME`` macro (HIPIFY tools insert this automatically).
+  * **dim3 gridDim**: 3D-grid dimensions that specify the number of blocks to launch.
+  * **dim3 blockDim**: 3D-block dimensions that specify the number of threads in each block.
+  * **size_t dynamicShared**: The amount of additional shared memory that you want to allocate
+    when launching the kernel (see :ref:`shared-variable-type`).
+  * **hipStream_t**: The stream where you want to run the kernel. A value of ``0`` corresponds to the
+    NULL stream (see :ref:`synchronization-functions`).
+
+You can include your kernel arguments after these parameters.
+
+.. code:: cpp
+
+  // Example hipLaunchKernelGGL pseudocode:
+  __global__ MyKernel(hipLaunchParm lp, float *A, float *B, float *C, size_t N)
+  {
+  ...
+  }
+
+  MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
+
+  // Alternatively, you can launch the kernel using:
+  // hipLaunchKernelGGL(MyKernel, dim3(gridDim), dim3(groupDim), 0/*dynamicShared*/, 0/*stream), a, b, c, n);
+
+You can use HIPIFY tools to convert CUDA launch syntax to ``hipLaunchKernelGGL``. This includes the
+conversion of optional ``<<< >>>`` arguments into the five required ``hipLaunchKernelGGL``
+parameters.
+
+.. note::
+
+  HIP doesn't support dimension sizes of :math:`gridDim * blockDim \ge 2^{32}` when launching a kernel.
+
+.. kernel-launch-example:
+
+Kernel launch example
+==========================================================
+
+.. code:: cpp
+
+  // Example showing device function, __device__ __host__
+  // <- compile for both device and host
+  float PlusOne(float x)
+  {
+    return x + 1.0;
+  }
+
+  __global__
+  void
+  MyKernel (hipLaunchParm lp, /*lp parm for execution configuration */
+            const float *a, const float *b, float *c, unsigned N)
+  {
+    unsigned gid = threadIdx.x; // <- coordinate index function
+    if (gid < N) {
+      c[gid] = a[gid] + PlusOne(b[gid]);
+    }
+  }
+  void callMyKernel()
+  {
+    float *a, *b, *c; // initialization not shown...
+    unsigned N = 1000000;
+    const unsigned blockSize = 256;
+
+    MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
+    // Alternatively, kernel can be launched by
+    // hipLaunchKernelGGL(MyKernel, dim3(N/blockSize), dim3(blockSize), 0, 0,  a,b,c,N);
+  }
+
+Variable type qualifiers
+========================================================
+
+``__constant__``
+-----------------------------------------------------------------------------
+
+The host writes constant memory before launching the kernel. This memory is read-only from the GPU
+while the kernel is running. The functions for accessing constant memory are:
+
+* ``hipGetSymbolAddress()``
+* ``hipGetSymbolSize()``
+* ``hipMemcpyToSymbol()``
+* ``hipMemcpyToSymbolAsync()``
+* ``hipMemcpyFromSymbol()``
+* ``hipMemcpyFromSymbolAsync()``
+
+.. _shared-variable-type:
+
+``__shared__``
+-----------------------------------------------------------------------------
+
+To allow the host to dynamically allocate shared memory, you can specify ``extern __shared__`` as a
+launch parameter.
+
+.. note::
+  Prior to the HIP-Clang compiler, dynamic shared memory had to be declared using the
+  ``HIP_DYNAMIC_SHARED`` macro in order to ensure accuracy. This is because using static shared
+  memory in the same kernel could've resulted in overlapping memory ranges and data-races. The
+  HIP-Clang compiler provides support for ``extern`` shared declarations, so ``HIP_DYNAMIC_SHARED``
+  is no longer required.
+
+``__managed__``
+-----------------------------------------------------------------------------
+
+Managed memory, including the `__managed__` keyword, is supported in HIP combined host/device
+compilation.
+
+``__restrict__``
+-----------------------------------------------------------------------------
+
+``__restrict__`` tells the compiler that the associated memory pointer not to alias with any other pointer
+in the kernel or function. This can help the compiler generate better code. In most use cases, every
+pointer argument should use this keyword in order to achieve the benefit.
+
+Built-in variables
+====================================================
+
+Coordinate built-ins
+-----------------------------------------------------------------------------
+
+The kernel uses coordinate built-ins (``thread*``, ``block*``, ``grid*``) to determine the coordinate index
+and bounds for the active work item.
+
+Built-ins are defined in ``amd_hip_runtime.h``, rather than being implicitly defined by the compiler.
+
+Coordinate variable definitions for built-ins are the same for HIP and CUDA. For example: ``threadIdx.x``,
+``blockIdx.y``, and ``gridDim.y``. The products ``gridDim.x * blockDim.x``, ``gridDim.y * blockDim.y``, and
+``gridDim.z * blockDim.z`` are always less than ``2^32``.
+
+Coordinate built-ins are implemented as structures for improved performance. When used with
+``printf``, they must be explicitly cast to integer types.
+
+warpSize
+-----------------------------------------------------------------------------
+The ``warpSize`` variable type is ``int``. It contains the warp size (in threads) for the target device.
+``warpSize`` should only be used in device functions that develop portable wave-aware code.
+
+.. note::
+  NVIDIA devices return 32 for this variable; AMD devices return 64 for gfx9 and 32 for gfx10
+  and above.
+
+Vector types
+====================================================
+
+The following vector types are defined in ``hip_runtime.h``. They are not automatically provided by the
+compiler.
+
+Short vector types
+--------------------------------------------------------------------------------------------
+
+Short vector types derive from basic integer and floating-point types. These structures are defined in
+``hip_vector_types.h``. The first, second, third, and fourth components of the vector are defined by the
+``x``, ``y``, ``z``, and ``w`` fields, respectively. All short vector types support a constructor function of the
+form ``make_<type_name>()``. For example, ``float4 make_float4(float x, float y, float z, float w)`` creates
+a vector with type ``float4`` and value ``(x,y,z,w)``.
+
+HIP supports the following short vector formats:
+
+* Signed Integers:
+
+  * ``char1``, ``char2``, ``char3``, ``char4``
+  * ``short1``, ``short2``, ``short3``, ``short4``
+  * ``int1``, ``int2``, ``int3``, ``int4``
+  * ``long1``, ``long2``, ``long3``, ``long4``
+  * ``longlong1``, ``longlong2``, ``longlong3``, ``longlong4``
+
+* Unsigned Integers:
+
+  * ``uchar1``, ``uchar2``, ``uchar3``, ``uchar4``
+  * ``ushort1``, ``ushort2``, ``ushort3``, ``ushort4``
+  * ``uint1``, ``uint2``, ``uint3``, ``uint4``
+  * ``ulong1``, ``ulong2``, ``ulong3``, ``ulong4``
+  * ``ulonglong1``, ``ulonglong2``, ``ulonglong3``, ``ulonglong4``
+
+* Floating Points:
+
+  * ``float1``, ``float2``, ``float3``, ``float4``
+  * ``double1``, ``double2``, ``double3``, ``double4``
+
+.. _dim3:
+
+dim3
+--------------------------------------------------------------------------------------------
+
+``dim3`` is a three-dimensional integer vector type that is commonly used to specify grid and group
+dimensions.
+
+The dim3 constructor accepts between zero and three arguments. By default, it initializes unspecified
+dimensions to 1.
+
+.. code:: cpp
+
+  typedef struct dim3 {
+    uint32_t x;
+    uint32_t y;
+    uint32_t z;
+
+    dim3(uint32_t _x=1, uint32_t _y=1, uint32_t _z=1) : x(_x), y(_y), z(_z) {};
+  };
+
+
+Memory fence instructions
+====================================================
+
+HIP supports ``__threadfence()`` and ``__threadfence_block()``. If you're using ``threadfence_system()`` in
+the HIP-Clang path, you can use the following workaround:
+
+#. Build HIP with the ``HIP_COHERENT_HOST_ALLOC`` environment variable enabled.
+#. Modify kernels that use ``__threadfence_system()`` as follows:
+
+  * Ensure the kernel operates only on fine-grained system memory, which should be allocated with
+    ``hipHostMalloc()``.
+  * Remove ``memcpy`` for all allocated fine-grained system memory regions.
+
+.. _synchronization-functions:
+
+Synchronization functions
+====================================================
+The ``__syncthreads()`` built-in function is supported in HIP. The ``__syncthreads_count(int)``,
+``__syncthreads_and(int)``, and ``__syncthreads_or(int)`` functions are under development.
+
+Math functions
+====================================================
+HIP-Clang supports a set of math operations that are callable from the device. These are described in
+the following sections.
+
+.. doxygengroup:: Math
+   :inner:
+
+Texture functions
+===============================================
+
+The supported texture functions are listed in
+`texture_fetch_functions.h <https://github.com/ROCm-Developer-Tools/HIP/blob/main/include/hip/hcc_detail/texture_fetch_functions.h)`_
+and `texture_indirect_functions.h <https://github.com/ROCm-Developer-Tools/HIP/blob/main/include/hip/hcc_detail/texture_indirect_functions.h>`_.
+
+Texture functions are not supported on some devices. To determine if texture functions are supported
+on your device, use ``Macro __HIP_NO_IMAGE_SUPPORT == 1``. You can query the attribute
+``hipDeviceAttributeImageSupport`` to check if texture functions are supported in the host runtime
+code.
+
+Surface functions
+===============================================
+
+Surface functions are not supported.
+
+Timer functions
+===============================================
+
+To read a high-resolution timer from the device, HIP provides the following built-in functions:
+
+* Returning the incremental counter value for every clock cycle on a device:
+
+  .. code:: cpp
+
+    clock_t clock()
+    long long int clock64()
+
+  The difference between the values that are returned represents the cycles used.
+
+* Returning the wall clock count at a constant frequency on the device:
+
+  .. code:: cpp
+
+    long long int wall_clock64()
+
+  This can be queried using the HIP API with the ``hipDeviceAttributeWallClockRate`` attribute of the
+  device in HIP application code. For example:
+
+  .. code:: cpp
+
+    int wallClkRate = 0; //in kilohertz
+    HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
+
+  Where ``hipDeviceAttributeWallClockRate`` is a device attribute. Note that wall clock frequency is a
+  per-device attribute.
+
+Atomic functions
+===============================================
+
+Atomic functions are run as read-modify-write (RMW) operations that reside in global or shared
+memory. No other device or thread can observe or modify the memory location during an atomic
+operation. If multiple instructions from different devices or threads target the same memory location,
+the instructions are serialized in an undefined order.
+
+To support system scope atomic operations, you can use the HIP APIs that contain the ``_system`` suffix.
+For example:
+* ``atomicAnd``: This function is atomic and coherent within the GPU device running the function
+* ``atomicAnd_system``: This function extends the atomic operation from the GPU device to other CPUs
+and GPU devices in the system
+
+HIP supports the following atomic operations.
+
+.. list-table::
+    * - **Function**
+    - **Supported in HIP**
+    - **Supported in CUDA**
+
+    * - int atomicAdd(int* address, int val)
+      - ✓
+      - ✓
+
+    * - int atomicAdd_system(int* address, int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicAdd(unsigned int* address,unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicAdd_system(unsigned int* address, unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicAdd(unsigned long long* address,unsigned long long val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicAdd_system(unsigned long long* address, unsigned long long val)
+      - ✓
+      - ✓
+
+    * - float atomicAdd(float* address, float val)
+      - ✓
+      - ✓
+
+    * - float atomicAdd_system(float* address, float val)
+      - ✓
+      - ✓
+
+    * - double atomicAdd(double* address, double val)
+      - ✓
+      - ✓
+
+    * - double atomicAdd_system(double* address, double val)
+      - ✓
+      - ✓
+
+    * - float unsafeAtomicAdd(float* address, float val)
+      - ✓
+      - ✗
+
+    * - float safeAtomicAdd(float* address, float val)
+      - ✓
+      - ✗
+
+    * - double unsafeAtomicAdd(double* address, double val)
+      - ✓
+      - ✗
+
+    * - double safeAtomicAdd(double* address, double val)
+      - ✓
+      - ✗
+
+    * - int atomicSub(int* address, int val)
+      - ✓
+      - ✓
+
+    * - int atomicSub_system(int* address, int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicSub(unsigned int* address,unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicSub_system(unsigned int* address, unsigned int val)
+      - ✓
+      - ✓
+
+    * - int atomicExch(int* address, int val)
+      - ✓
+      - ✓
+
+    * - int atomicExch_system(int* address, int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicExch(unsigned int* address,unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicExch_system(unsigned int* address, unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicExch(unsigned long long int* address,unsigned long long int val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val)
+      - ✓
+      - ✓
+
+    * - float atomicExch(float* address, float val)
+      - ✓
+      - ✓
+
+    * - int atomicMin(int* address, int val)
+      - ✓
+      - ✓
+
+    * - int atomicMin_system(int* address, int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicMin(unsigned int* address,unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicMin_system(unsigned int* address, unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicMin(unsigned long long* address,unsigned long long val)
+      - ✓
+      - ✓
+
+    * - int atomicMax(int* address, int val)
+      - ✓
+      - ✓
+
+    * - int atomicMax_system(int* address, int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicMax(unsigned int* address,unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicMax_system(unsigned int* address, unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicMax(unsigned long long* address,unsigned long long val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicInc(unsigned int* address)
+      - ✗
+      - ✓
+
+    * - unsigned int atomicDec(unsigned int* address)
+      - ✗
+      - ✓
+
+    * - int atomicCAS(int* address, int compare, int val)
+      - ✓
+      - ✓
+
+    * - int atomicCAS_system(int* address, int compare, int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicCAS(unsigned int* address,unsigned int compare,unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicCAS_system(unsigned int* address, unsigned int compare, unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicCAS(unsigned long long* address,unsigned long long compare,unsigned long long val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicCAS_system(unsigned long long* address, unsigned long long compare, unsigned long long val)
+      - ✓
+      - ✓
+
+    * - int atomicAnd(int* address, int val)
+      - ✓
+      - ✓
+
+    * - int atomicAnd_system(int* address, int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicAnd(unsigned int* address,unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicAnd_system(unsigned int* address, unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicAnd(unsigned long long* address,unsigned long long val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicAnd_system(unsigned long long* address, unsigned long long val)
+      - ✓
+      - ✓
+
+    * - int atomicOr(int* address, int val)
+      - ✓
+      - ✓
+
+    * - int atomicOr_system(int* address, int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicOr(unsigned int* address,unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicOr_system(unsigned int* address, unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicOr_system(unsigned int* address, unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicOr(unsigned long long int* address,unsigned long long val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicOr_system(unsigned long long* address, unsigned long long val)
+      - ✓
+      - ✓
+
+    * - int atomicXor(int* address, int val)
+      - ✓
+      - ✓
+
+    * - int atomicXor_system(int* address, int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicXor(unsigned int* address,unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned int atomicXor_system(unsigned int* address, unsigned int val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicXor(unsigned long long* address,unsigned long long val)
+      - ✓
+      - ✓
+
+    * - unsigned long long atomicXor_system(unsigned long long* address, unsigned long long val)
+      - ✓
+      - ✓
+
+Unsafe floating-point atomic RMW operations
+----------------------------------------------------------------------------------------------------------------
+Some HIP devices support fast atomic RMW operations on floating-point values. For example,
+``atomicAdd`` on single- or double-precision floating-point values may generate a hardware RMW
+instruction that is faster than emulating the atomic operation using an atomic compare-and-swap
+(CAS) loop.
+
+On some devices, fast atomic RMW instructions can produce results that differ from the same
+functions implemented with atomic CAS loops. For example, some devices will use different rounding
+or denormal modes, and some devices produce incorrect answers if fast floating-point atomic RMW
+instructions target fine-grained memory allocations.
+
+The HIP-Clang compiler offers a compile-time option, so you can choose fast--but potentially
+unsafe--atomic instructions for your code. On devices that support these instructions, you can include
+the ``-munsafe-fp-atomics`` option. This flag indicates to the compiler that all floating-point atomic
+function calls are allowed to use an unsafe version, if one exists. For example, on some devices, this
+flag indicates to the compiler that no floating-point ``atomicAdd`` function can target fine-grained
+memory.
+
+If you want to avoid using unsafe use a floating-point atomic RMW operations, you can use the
+``-mno-unsafe-fp-atomics`` option. Note that the compiler default is to not produce unsafe
+floating-point atomic RMW instructions, so the ``-mno-unsafe-fp-atomics`` option is not necessarily
+required. However, passing this option to the compiler is good practice.
+
+When you pass ``-munsafe-fp-atomics`` or ``-mno-unsafe-fp-atomics`` to the compiler's command line,
+the option is applied globally for the entire compilation. Note that if some of the atomic RMW function
+calls cannot safely use the faster floating-point atomic RMW instructions, you must use
+``-mno-unsafe-fp-atomics`` in order to ensure that your atomic RMW function calls produce correct
+results.
+
+HIP has four extra functions that you can use to more precisely control which floating-point atomic
+RMW functions produce unsafe atomic RMW instructions:
+
+* ``float unsafeAtomicAdd(float* address, float val)``
+* ``double unsafeAtomicAdd(double* address, double val)`` (Always produces fast atomic RMW
+  instructions on devices that have them, even when ``-mno-unsafe-fp-atomics`` is used)
+* `float safeAtomicAdd(float* address, float val)`
+* ``double safeAtomicAdd(double* address, double val)`` (Always produces safe atomic RMW
+  operations, even when ``-munsafe-fp-atomics`` is used)
+
+.. _warp-cross-lane:
+
+Warp cross-lane functions
+========================================================
+
+Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp
+lanes are run in lockstep, meaning that additional synchronization is unnecessary. The instructions
+don't use shared memory.
+
+Note that NVIDIA and AMD devices have different warp sizes. You can use ``warpSize`` built-ins in you
+portable code to query the warp size.
+
+.. tip::
+  Be sure to review HIP code generated from the CUDA path to ensure that it doesn't assume a
+  ``waveSize`` of 32. "Wave-aware" code that assumes a ``waveSize`` of 32 can run on a wave-64
+  machine, but it only utilizes half of the machine's resources.
+
+To get the default warp size of a GPU device, use ``hipGetDeviceProperties`` in you host functions.
+
+.. code:: cpp
+
+	cudaDeviceProp props;
+	cudaGetDeviceProperties(&props, deviceID);
+    int w = props.warpSize;
+    // implement portable algorithm based on w (rather than assume 32 or 64)
+
+Only use ``warpSize`` built-ins in device functions, and don't assume ``warpSize`` to be a compile-time
+constant.
+
+Note that assembly kernels may be built for a warp size that is different from the default.
+
+Warp vote and ballot functions
+-------------------------------------------------------------------------------------------------------------
+
+.. code:: cpp
+
+  int __all(int predicate)
+  int __any(int predicate)
+  uint64_t __ballot(int predicate)
+
+Threads in a warp are referred to as *lanes* and are numbered from 0 to :math:` warpSize - 1`. Each
+warp lane contributes 1 minus the bit value (the predicate), which is efficiently broadcast to all lanes in
+the warp.
+
+The 32-bit int predicate from each lane reduces to a 1-bit value of 0 ``(predicate = 0)`` or 1
+``(predicate != 0)``. To get a summary view of the predicates that are contributed by other warp lanes, you
+can use:
+
+* ``__any()``: Returns 1 if any warp lane contributes a nonzero predicate, otherwise it returns 0
+* ``__all()``: Returns 1 if all other warp lanes contribute nonzero predicates, otherwise it returns 0
+
+To determine if the target platform supports the any/all instruction, you can use the ``hasWarpVote``
+device property or the ``HIP_ARCH_HAS_WARP_VOTE`` compiler definition.
+
+HIP's ``__ballot`` function provides a bit mask that contains the 1-bit predicate value from each lane.
+The nth bit of this result contains the 1 bit contributed by the nth warp lane. Note that ``__ballot``
+supports a 64-bit return value (versus CUDA's 32 bits). Code ported from CUDA should support these
+larger warp sizes.
+
+To determine if the target platform supports the ballot instruction, you ca use the ``hasWarpBallot``
+device property or the ``HIP_ARCH_HAS_WARP_BALLOT`` compiler definition.
+
+Warp shuffle functions
+-------------------------------------------------------------------------------------------------------------
+
+The default width is ``warpSize`` (see :ref:`warp-cross-lane`). Half-float shuffles are not supported.
+
+.. code:: cpp
+
+  int   __shfl      (int var,   int srcLane, int width=warpSize);
+  float __shfl      (float var, int srcLane, int width=warpSize);
+  int   __shfl_up   (int var,   unsigned int delta, int width=warpSize);
+  float __shfl_up   (float var, unsigned int delta, int width=warpSize);
+  int   __shfl_down (int var,   unsigned int delta, int width=warpSize);
+  float __shfl_down (float var, unsigned int delta, int width=warpSize);
+  int   __shfl_xor  (int var,   int laneMask, int width=warpSize);
+  float __shfl_xor  (float var, int laneMask, int width=warpSize);
+
+Cooperative groups functions
+==============================================================
+
+You can use cooperative groups to synchronize groups of threads. Cooperative groups also provide a
+way of communicating between groups of threads at a granularity that is different from the block.
+
+HIP supports the following kernel language cooperative groups types and functions:
+
+.. list-table::
+    * - **Function**
+    - **Supported in HIP**
+    - **Supported in CUDA**
+
+    * - void thread_group.sync();
+      - ✓
+      - ✓
+
+    * - unsigned thread_group.size();
+      - ✓
+      - ✓
+
+    * - unsigned thread_group.thread_rank()
+      - ✓
+      - ✓
+
+    * - bool thread_group.is_valid();
+      - ✓
+      - ✓
+
+    * - grid_group this_grid()
+      - ✓
+      - ✓
+
+    * - void grid_group.sync()
+      - ✓
+      - ✓
+
+    * - unsigned grid_group.size()
+      - ✓
+      - ✓
+
+    * - unsigned grid_group.thread_rank()
+      - ✓
+      - ✓
+
+    * - bool grid_group.is_valid()
+      - ✓
+      - ✓
+
+    * - multi_grid_group this_multi_grid()
+      - ✓
+      - ✓
+
+    * - void multi_grid_group.sync()
+      - ✓
+      - ✓
+
+    * - unsigned multi_grid_group.size()
+      - ✓
+      - ✓
+
+    * - unsigned multi_grid_group.thread_rank()
+      - ✓
+      - ✓
+
+    * - bool multi_grid_group.is_valid()
+      - ✓
+      - ✓
+
+    * - unsigned multi_grid_group.num_grids()
+      - ✓
+      - ✓
+
+    * - unsigned multi_grid_group.grid_rank()
+      - ✓
+      - ✓
+
+    * - thread_block this_thread_block()
+      - ✓
+      - ✓
+
+    * - multi_grid_group this_multi_grid()
+      - ✓
+      - ✓
+
+    * - void multi_grid_group.sync()
+      - ✓
+      - ✓
+
+    * - void thread_block.sync()
+      - ✓
+      - ✓
+
+    * - unsigned thread_block.size()
+      - ✓
+      - ✓
+
+    * - unsigned thread_block.thread_rank()
+      - ✓
+      - ✓
+
+    * - bool thread_block.is_valid()
+      - ✓
+      - ✓
+
+    * - dim3 thread_block.group_index()
+      - ✓
+      - ✓
+
+    * - dim3 thread_block.thread_index()
+      - ✓
+      - ✓
+
+Warp matrix functions
+============================================================
+
+Warp matrix functions allow a warp to cooperatively operate on small matrices that have elements
+spread over lanes in an unspecified manner.
+
+HIP does not support kernel language warp matrix types or functions.
+
+.. list-table::
+    * - **Function**
+    - **Supported in HIP**
+    - **Supported in CUDA**
+
+    * - void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda)
+      - ✗
+      - ✓
+
+    * - void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda, layout_t layout)
+      - ✗
+      - ✓
+
+    * - void store_matrix_sync(T* mptr, fragment<...> &a,  unsigned lda, layout_t layout)
+      - ✗
+      - ✓
+
+    * - void fill_fragment(fragment<...> &a, const T &value)
+      - ✗
+      - ✓
+
+    * - void mma_sync(fragment<...> &d, const fragment<...> &a, const fragment<...> &b, const fragment<...> &c , bool sat)
+      - ✗
+      - ✓
+
+Independent thread scheduling
+============================================================
+
+Certain architectures that support CUDA allow threads to progress independently of each other. This
+independent thread scheduling makes intra-warp synchronization possible.
+
+HIP does not support this type of scheduling.
+
+Profiler Counter Function
+============================================================
+
+The CUDA `__prof_trigger()` instruction is not supported.
+
+Assert
+============================================================
+
+The assert function is supported in HIP.
+Assert function is used for debugging purpose, when the input expression equals to zero, the execution will be stopped.
+.. code:: cpp
+
+  void assert(int input)
+
+There are two kinds of implementations for assert functions depending on the use sceneries,
+- One is for the host version of assert, which is defined in assert.h,
+- Another is the device version of assert, which is implemented in hip/hip_runtime.h.
+Users need to include assert.h to use assert. For assert to work in both device and host functions, users need to include "hip/hip_runtime.h".
+
+Printf
+============================================================
+
+Printf function is supported in HIP.
+The following is a simple example to print information in the kernel.
+
+.. code:: cpp
+
+  #include <hip/hip_runtime.h>
+  
+  __global__ void run_printf() { printf("Hello World\n"); }
+  
+  int main() {
+    run_printf<<<dim3(1), dim3(1), 0, 0>>>();
+  }
+
+
+Device-Side Dynamic Global Memory Allocation
+============================================================
+
+Device-side dynamic global memory allocation is under development.  HIP now includes a preliminary
+implementation of malloc and free that can be called from device functions.
+
+`__launch_bounds__`
+============================================================
+
+GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simulaneously running. Thus GPUs have a complex relationship between resource usage and performance.
+
+__launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code.  It is a function attribute that must be attached to a __global__ function:
+
+.. code:: cpp
+
+  __global__ void `__launch_bounds__`(MAX_THREADS_PER_BLOCK, MIN_WARPS_PER_EXECUTION_UNIT)
+  MyKernel(hipGridLaunch lp, ...)
+  ...
+
+__launch_bounds__ supports two parameters:
+- MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the .maxntid PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time.
+The threads-per-block is the product of (blockDim.x * blockDim.y * blockDim.z).
+- MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler's resource usage.
+
+When launch kernel with HIP APIs, for example, hipModuleLaunchKernel(), HIP will do validation to make sure input kernel dimension size is not larger than specified launch_bounds.
+In case exceeded, HIP would return launch failure, if AMD_LOG_LEVEL is set with proper value (for details, please refer to docs/markdown/hip_logging.md), detail information will be shown in the error log message, including
+launch parameters of kernel dim size, launch bounds, and the name of the faulting kernel. It's helpful to figure out which is the faulting kernel, besides, the kernel dim size and launch bounds values will also assist in debugging such failures.
+
+Compiler Impact
+--------------------------------------------------------------------------------------------
+
+The compiler uses these parameters as follows:
+- The compiler uses the hints only to manage register usage, and does not automatically reduce shared memory or other resources.
+- Compilation fails if compiler cannot generate a kernel which meets the requirements of the specified launch bounds.
+- From MAX_THREADS_PER_BLOCK, the compiler derives the maximum number of warps/block that can be used at launch time.
+Values of MAX_THREADS_PER_BLOCK less than the default allows the compiler to use a larger pool of registers : each warp uses registers, and this hint constains the launch to a warps/block size which is less than maximum.
+- From MIN_WARPS_PER_EXECUTION_UNIT, the compiler derives a maximum number of registers that can be used by the kernel (to meet the required #simultaneous active blocks).
+If MIN_WARPS_PER_EXECUTION_UNIT is 1, then the kernel can use all registers supported by the multiprocessor.
+- The compiler ensures that the registers used in the kernel is less than both allowed maximums, typically by spilling registers (to shared or global memory), or by using more instructions.
+- The compiler may use hueristics to increase register usage, or may simply be able to avoid spilling. The MAX_THREADS_PER_BLOCK is particularly useful in this cases, since it allows the compiler to use more registers and avoid situations where the compiler constrains the register usage (potentially spilling) to meet the requirements of a large block size that is never used at launch time.
+
+CU and EU Definitions
+--------------------------------------------------------------------------------------------
+
+A compute unit (CU) is responsible for executing the waves of a work-group. It is composed of one or more execution units (EU) which are responsible for executing waves. An EU can have enough resources to maintain the state of more than one executing wave. This allows an EU to hide latency by switching between waves in a similar way to symmetric multithreading on a CPU. In order to allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be limited. Limiting such resources can allow greater latency hiding, but can result in having to spill some register state to memory. This attribute allows an advanced developer to tune the number of waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a certain number will fit to help hide latency, and can also be used to ensure no more than a certain number will fit to limit cache thrashing.
+
+Porting from CUDA `__launch_bounds`
+--------------------------------------------------------------------------------------------
+
+CUDA defines a __launch_bounds which is also designed to control occupancy:
+.. code:: cpp
+
+  __launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
+
+- The second parameter __launch_bounds parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors (this conversion is performed automatically by HIPIFY tools).
+.. code:: cpp
+
+  MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / 32
+
+The key differences in the interface are:
+- Warps (rather than blocks):
+The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding.  Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control.
+- Execution Units  (rather than multiProcessor):
+The use of execution units rather than multiprocessors provides support for architectures with multiple execution units/multi-processor. For example, the AMD GCN architecture has 4 execution units per multiProcessor.  The hipDeviceProps has a field executionUnitsPerMultiprocessor.
+Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired.
+
+maxregcount
+--------------------------------------------------------------------------------------------
+
+Unlike nvcc, HIP-Clang does not support the "--maxregcount" option.  Instead, users are encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and portable than
+micro-architecture details like registers, and also the directive allows per-kernel control rather than an entire file.  hip_launch_bounds works on both HIP-Clang and nvcc targets.
+
+Asynchronous Functions
+============================================================
+
+Memory stream
+--------------------------------------------------------------------------------------------
+
+.. doxygengroup:: Stream
+   :content-only:
+
+.. doxygengroup:: StreamO
+   :content-only:
+
+Peer to peer
+--------------------------------------------------------------------------------------------
+
+.. doxygengroup:: PeerToPeer
+   :content-only:
+
+Memory management
+--------------------------------------------------------------------------------------------
+
+.. doxygengroup:: Memory
+   :content-only:
+
+External Resource Interoperability
+--------------------------------------------------------------------------------------------
+
+.. doxygengroup:: External
+   :content-only:
+
+Register Keyword
+============================================================
+
+The register keyword is deprecated in C++, and is silently ignored by both nvcc and HIP-Clang.  You can pass the option `-Wdeprecated-register` the compiler warning message.
+
+Pragma Unroll
+============================================================
+
+Unroll with a bounds that is known at compile-time is supported.  For example:
+
+.. code:: cpp
+
+  #pragma unroll 16 /* hint to compiler to unroll next loop by 16 */
+  for (int i=0; i<16; i++) ...
+
+.. code:: cpp
+
+  #pragma unroll 1  /* tell compiler to never unroll the loop */
+  for (int i=0; i<16; i++) ...
+
+.. code:: cpp
+
+  #pragma unroll /* hint to compiler to completely unroll next loop. */
+  for (int i=0; i<16; i++) ...
+
+In-Line Assembly
+============================================================
+
+GCN ISA In-line assembly, is supported. For example:
+
+.. code:: cpp
+
+  asm volatile ("v_mac_f32_e32 %0, %2, %3" : "=v" (out[i]) : "0"(out[i]), "v" (a), "v" (in[i]));
+
+We insert the GCN isa into the kernel using `asm()` Assembler statement.
+`volatile` keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations.
+`v_mac_f32_e32` is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/)
+Index for the respective operand in the ordered fashion is provided by `%` followed by position in the list of operands
+`"v"` is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list)
+Output Constraints are specified by an `"="` prefix as shown above ("=v"). This indicate that assemby will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of `"0"` says to use the assigned register for output as an input as well (it being the 0'th constraint).
+
+## C++ Support
+The following C++ features are not supported:
+- Run-time-type information (RTTI)
+- Try/catch
+- Virtual functions
+Virtual functions are not supported if objects containing virtual function tables are passed between GPU's of different offload arch's, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.
+
+Kernel Compilation
+============================================================
+hipcc now supports compiling C++/HIP kernels to binary code objects.
+The file format for binary is `.co` which means Code Object. The following command builds the code object using `hipcc`.
+
+.. code:: bash
+
+  hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]
+
+  [TARGET GPU] = GPU architecture
+  [INPUT FILE] = Name of the file containing kernels
+  [OUTPUT FILE] = Name of the generated code object file
+
+.. note::
+  When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the sample in samples/0_Intro/module_api for differences in the arguments to be passed to the kernel.
+
+gfx-arch-specific-kernel
+============================================================
+Clang defined '__gfx*__' macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample ``14_gpu_arch`` in ``samples/2_Cookbook``.
diff --git a/docs/reference/terms.md b/docs/old/reference/terms.md
similarity index 100%
rename from docs/reference/terms.md
rename to docs/old/reference/terms.md
diff --git a/docs/user_guide/faq.md b/docs/old/user_guide/faq.md
similarity index 100%
rename from docs/user_guide/faq.md
rename to docs/old/user_guide/faq.md
diff --git a/docs/user_guide/hip_porting_driver_api.md b/docs/old/user_guide/hip_porting_driver_api.md
similarity index 100%
rename from docs/user_guide/hip_porting_driver_api.md
rename to docs/old/user_guide/hip_porting_driver_api.md
diff --git a/docs/user_guide/hip_porting_guide.md b/docs/old/user_guide/hip_porting_guide.md
similarity index 93%
rename from docs/user_guide/hip_porting_guide.md
rename to docs/old/user_guide/hip_porting_guide.md
index 078172de76..e2525237be 100644
--- a/docs/user_guide/hip_porting_guide.md
+++ b/docs/old/user_guide/hip_porting_guide.md
@@ -78,19 +78,26 @@ directory names.
 
 ### Library Equivalents
 
-| CUDA Library | ROCm Library | Comment |
-|------- | ---------   | -----   |
-| cuBLAS        |    rocBLAS     | Basic Linear Algebra Subroutines
-| cuFFT        |    rocFFT     | Fast Fourier Transfer Library
-| cuSPARSE     |    rocSPARSE   | Sparse BLAS  + SPMV
-| cuSolver     |    rocSOLVER   | Lapack library
-| AMG-X    |    rocALUTION   | Sparse iterative solvers and preconditioners with Geometric and Algebraic MultiGrid
-| Thrust    |    rocThrust | C++ parallel algorithms library
-| CUB     |    rocPRIM | Low Level Optimized Parallel Primitives
-| cuDNN    |    MIOpen | Deep learning Solver Library
-| cuRAND    |    rocRAND | Random Number Generator Library
-| EIGEN    |    EIGEN - HIP port | C++ template library for linear algebra: matrices, vectors, numerical solvers,
-| NCCL    |    RCCL  | Communications Primitives Library based on the MPI equivalents
+Most CUDA libraries have a corresponding ROCm library with similar functionality and APIs. However, ROCm also provides HIP marshalling libraries that greatly simplify the porting process because they more precisely reflect their CUDA counterparts and can be used with either the AMD or NVIDIA platforms (see "Identifying HIP Target Platform" below). There are a few notable exceptions:
+  - MIOpen does not have a marshalling library interface to ease porting from cuDNN.
+  - RCCL is a drop-in replacement for NCCL and implements the NCCL APIs.
+  - hipBLASLt does not have a ROCm library but can still target the NVIDIA platform, as needed.
+  - EIGEN's HIP support is part of the library.
+
+| CUDA Library | HIP Library | ROCm Library | Comment |
+|------------- | ----------- | ------------ | ------- |
+| cuBLAS       | hipBLAS     | rocBLAS      | Basic Linear Algebra Subroutines
+| cuBLASLt     | hipBLASLt   | N/A          | Basic Linear Algebra Subroutines, lightweight and new flexible API
+| cuFFT        | hipFFT      | rocFFT       | Fast Fourier Transfer Library
+| cuSPARSE     | hipSPARSE   | rocSPARSE    | Sparse BLAS  + SPMV
+| cuSolver     | hipSOLVER   | rocSOLVER    | Lapack library
+| AMG-X        | N/A         | rocALUTION   | Sparse iterative solvers and preconditioners with Geometric and Algebraic MultiGrid
+| Thrust       | N/A         | rocThrust    | C++ parallel algorithms library
+| CUB          | hipCUB      | rocPRIM      | Low Level Optimized Parallel Primitives
+| cuDNN        | N/A         | MIOpen       | Deep learning Solver Library
+| cuRAND       | hipRAND     | rocRAND      | Random Number Generator Library
+| EIGEN        | EIGEN       | N/A          | C++ template library for linear algebra: matrices, vectors, numerical solvers,
+| NCCL         | N/A         | RCCL         | Communications Primitives Library based on the MPI equivalents
 
 
 
diff --git a/docs/user_guide/hip_rtc.md b/docs/old/user_guide/hip_rtc.md
similarity index 100%
rename from docs/user_guide/hip_rtc.md
rename to docs/old/user_guide/hip_rtc.md
diff --git a/docs/user_guide/programming_manual.md b/docs/old/user_guide/programming_manual.md
similarity index 100%
rename from docs/user_guide/programming_manual.md
rename to docs/old/user_guide/programming_manual.md
diff --git a/docs/reference/deprecated_api_list.md b/docs/reference/deprecated_api_list.md
deleted file mode 100644
index 9821782284..0000000000
--- a/docs/reference/deprecated_api_list.md
+++ /dev/null
@@ -1,88 +0,0 @@
-# HIP Deprecated Runtime Functions
-
-
-## HIP Context Management APIs
-
-CUDA supports cuCtx API, the Driver API that defines "Context" and "Devices" as separate entities. Contexts contain a single device, and a device can theoretically have multiple contexts. HIP initially added limited support for these API to facilitate easy porting from existing driver codes. These API are marked as deprecated now since there are better alternate interface (such as hipSetDevice or the stream API) to achieve the required functions.
-### hipCtxCreate
-### hipCtxDestroy
-### hipCtxPopCurrent
-### hipCtxPushCurrent
-### hipCtxSetCurrent
-### hipCtxGetCurrent
-### hipCtxGetDevice
-### hipCtxGetApiVersion
-### hipCtxGetCacheConfig
-### hipCtxSetCacheConfig
-### hipCtxSetSharedMemConfig
-### hipCtxGetSharedMemConfig
-### hipCtxSynchronize
-### hipCtxGetFlags
-### hipCtxEnablePeerAccess
-### hipCtxDisablePeerAccess
-### hipDevicePrimaryCtxGetState
-### hipDevicePrimaryCtxRelease
-### hipDevicePrimaryCtxRetain
-### hipDevicePrimaryCtxReset
-### hipDevicePrimaryCtxSetFlags
-
-
-## HIP Memory Management APIs
-
-### hipMallocHost
-Should use "hipHostMalloc" instead.
-
-### hipMemAllocHost
-Should use "hipHostMalloc" instead.
-
-### hipHostAlloc
-Should use "hipHostMalloc" instead.
-
-### hipFreeHost
-Should use "hipHostFree" instead.
-
-### hipMemcpyToArray
-### hipMemcpyFromArray
-
-
-## HIP Profiler Control APIs
-
-### hipProfilerStart
-Should use roctracer/rocTX instead
-
-### hipProfilerStop
-Should use roctracer/rocTX instead
-
-
-## HIP Texture Management APIs
-
-### hipGetTextureReference
-### hipTexRefSetAddressMode
-### hipTexRefSetArray
-### hipTexRefSetFilterMode
-### hipTexRefSetFlags
-### hipTexRefSetFormat
-### hipBindTexture
-### hipBindTexture2D
-### hipBindTextureToArray
-### hipGetTextureAlignmentOffset
-### hipUnbindTexture
-### hipTexRefGetAddress
-### hipTexRefGetAddressMode
-### hipTexRefGetFilterMode
-### hipTexRefGetFlags
-### hipTexRefGetFormat
-### hipTexRefGetMaxAnisotropy
-### hipTexRefGetMipmapFilterMode
-### hipTexRefGetMipmapLevelBias
-### hipTexRefGetMipmapLevelClamp
-### hipTexRefGetMipMappedArray
-### hipTexRefSetAddress
-### hipTexRefSetAddress2D
-### hipTexRefSetMaxAnisotropy
-### hipTexRefSetBorderColor
-### hipTexRefSetMipmapFilterMode
-### hipTexRefSetMipmapLevelBias
-### hipTexRefSetMipmapLevelClamp
-### hipTexRefSetMipmappedArray
-### hipBindTextureToMipmappedArray
diff --git a/docs/reference/deprecated_api_list.rst b/docs/reference/deprecated_api_list.rst
new file mode 100644
index 0000000000..ee6808f9d8
--- /dev/null
+++ b/docs/reference/deprecated_api_list.rst
@@ -0,0 +1,91 @@
+.. meta::
+   :description: HIP deprecated runtime API functions.
+   :keywords: AMD, ROCm, HIP, deprecated, API
+
+**********************************************************************************************
+HIP deprecated runtime API functions
+**********************************************************************************************
+
+Several of our API functions have been flagged for deprecation. Using the following functions results in
+errors and unexpected results, so we encourage you to update your code accordingly.
+
+Context management
+============================================================
+
+CUDA supports cuCtx API, which is the driver API that defines "Context" and "Devices" as separate
+entities. Context contains a single device, and a device can theoretically have multiple contexts. HIP
+initially added limited support for these APIs in order to facilitate porting from existing driver codes.
+These APIs are now marked as deprecated because there are better alternate interfaces (such as
+``hipSetDevice`` or the stream API) to achieve these functions.
+
+* ``hipCtxCreate``
+* ``hipCtxDestroy``
+* ``hipCtxPopCurrent``
+* ``hipCtxPushCurrent``
+* ``hipCtxSetCurrent``
+* ``hipCtxGetCurrent``
+* ``hipCtxGetDevice``
+* ``hipCtxGetApiVersion``
+* ``hipCtxGetCacheConfig``
+* ``hipCtxSetCacheConfig``
+* ``hipCtxSetSharedMemConfig``
+* ``hipCtxGetSharedMemConfig``
+* ``hipCtxSynchronize``
+* ``hipCtxGetFlags``
+* ``hipCtxEnablePeerAccess``
+* ``hipCtxDisablePeerAccess``
+* ``hipDevicePrimaryCtxGetState``
+* ``hipDevicePrimaryCtxRelease``
+* ``hipDevicePrimaryCtxRetain``
+* ``hipDevicePrimaryCtxReset``
+* ``hipDevicePrimaryCtxSetFlags``
+
+Memory management
+============================================================
+
+* ``hipMallocHost`` (replaced with ``hipHostMalloc``)
+* ``hipMemAllocHost`` (replaced with ``hipHostMalloc``)
+* ``hipHostAlloc`` (replaced with ``hipHostMalloc``)
+* ``hipFreeHost`` (replaced with ``hipHostFree``)
+* ``hipMemcpyToArray``
+* ``hipMemcpyFromArray``
+
+Profiler control
+============================================================
+
+* ``hipProfilerStart`` (use roctracer/rocTX)
+* ``hipProfilerStop`` (use roctracer/rocTX)
+
+
+Texture management
+============================================================
+
+* ``hipGetTextureReference``
+* ``hipTexRefSetAddressMode``
+* ``hipTexRefSetArray``
+* ``hipTexRefSetFilterMode``
+* ``hipTexRefSetFlags``
+* ``hipTexRefSetFormat``
+* ``hipBindTexture``
+* ``hipBindTexture2D``
+* ``hipBindTextureToArray``
+* ``hipGetTextureAlignmentOffset``
+* ``hipUnbindTexture``
+* ``hipTexRefGetAddress``
+* ``hipTexRefGetAddressMode``
+* ``hipTexRefGetFilterMode``
+* ``hipTexRefGetFlags``
+* ``hipTexRefGetFormat``
+* ``hipTexRefGetMaxAnisotropy``
+* ``hipTexRefGetMipmapFilterMode``
+* ``hipTexRefGetMipmapLevelBias``
+* ``hipTexRefGetMipmapLevelClamp``
+* ``hipTexRefGetMipMappedArray``
+* ``hipTexRefSetAddress``
+* ``hipTexRefSetAddress2D``
+* ``hipTexRefSetMaxAnisotropy``
+* ``hipTexRefSetBorderColor``
+* ``hipTexRefSetMipmapFilterMode``
+* ``hipTexRefSetMipmapLevelBias``
+* ``hipTexRefSetMipmapLevelClamp``
+* ``hipTexRefSetMipmappedArray``
diff --git a/docs/reference/kernel_language.md b/docs/reference/kernel_language.md
deleted file mode 100644
index 5d41ebffba..0000000000
--- a/docs/reference/kernel_language.md
+++ /dev/null
@@ -1,824 +0,0 @@
-# Kernel Language Syntax
-
-HIP provides a C++ syntax that is suitable for compiling most code that commonly appears in compute kernels, including classes, namespaces, operator overloading, templates and more. Additionally, it defines other language features designed specifically to target accelerators, such as the following:
-- A kernel-launch syntax that uses standard C++, resembles a function call and is portable to all HIP targets
-- Short-vector headers that can serve on a host or a device
-- Math functions resembling those in the "math.h" header included with standard C++ compilers
-- Built-in functions for accessing specific GPU hardware capabilities
-
-This section describes the built-in variables and functions accessible from the HIP kernel. It is intended for readers familiar with CUDA kernel syntax and wanting to understand how HIP is different from CUDA.
-
-Features are marked with one of the following keywords:
-- **Supported**---HIP supports the feature with a Cuda-equivalent function
-- **Not supported**---HIP does not support the feature
-- **Under development**---the feature is under development but not yet available
-
-## Function-Type Qualifiers
-### `__device__`
-Supported  `__device__` functions are
-  - Executed on the device
-  - Called from the device only
-
-The `__device__` keyword can combine with the host keyword (see {ref}`host_attr`).
-
-### `__global__`
-Supported `__global__` functions are
-  - Executed on the device
-  - Called ("launched") from the host
-
-HIP `__global__` functions must have a `void` return type, and the first parameter to a HIP `__global__` function must have the type `hipLaunchParm`. See [Kernel-Launch Example](#kernel-launch-example).
-
-HIP lacks dynamic-parallelism support, so `__global__ ` functions cannot be called from the device.
-
-### `__host__`
-Supported `__host__` functions are
-  - Executed on the host
-  - Called from the host
-
-`__host__` can combine with `__device__`, in which case the function compiles for both the host and device. These functions cannot use the HIP grid coordinate functions (for example, "threadIdx.x"). A possible workaround is to pass the necessary coordinate info as an argument to the function.
-
-`__host__` cannot combine with `__global__`.
-
-HIP parses the `__noinline__` and `__forceinline__` keywords and converts them to the appropriate Clang attributes.
-
-## Calling `__global__` Functions
-
-`__global__` functions are often referred to as *kernels,* and calling one is termed *launching the kernel.* These functions require the caller to specify an "execution configuration" that includes the grid and block dimensions. The execution configuration can also include other information for the launch, such as the amount of additional shared memory to allocate and the stream where the kernel should execute. HIP introduces a standard C++ calling convention to pass the execution configuration to the kernel in addition to the Cuda <<< >>> syntax. In HIP,
-- Kernels launch with either <<< >>> syntax or the "hipLaunchKernelGGL" function
-- The first five parameters to hipLaunchKernelGGL are the following:
-   - **symbol kernelName**: the name of the kernel to launch.  To support template kernels which contains "," use the HIP_KERNEL_NAME macro.   The hipify tools insert this automatically.
-   - **dim3 gridDim**: 3D-grid dimensions specifying the number of blocks to launch.
-   - **dim3 blockDim**: 3D-block dimensions specifying the number of threads in each block.
-   - **size_t dynamicShared**: amount of additional shared memory to allocate when launching the kernel (see [__shared__](#__shared__))
-   - **hipStream_t**: stream where the kernel should execute. A value of 0 corresponds to the NULL stream (see [Synchronization Functions](#synchronization-functions)).
-- Kernel arguments follow these first five parameters
-
-
-```
-// Example pseudo code introducing hipLaunchKernelGGL:
-__global__ MyKernel(hipLaunchParm lp, float *A, float *B, float *C, size_t N)
-{
-...
-}
-
-MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
-// Alternatively, kernel can be launched by
-// hipLaunchKernelGGL(MyKernel, dim3(gridDim), dim3(groupDim), 0/*dynamicShared*/, 0/*stream), a, b, c, n);
-
-```
-
-The hipLaunchKernelGGL macro always starts with the five parameters specified above, followed by the kernel arguments. HIPIFY tools optionally convert Cuda launch syntax to hipLaunchKernelGGL, including conversion of optional arguments in <<< >>> to the five required hipLaunchKernelGGL parameters. The dim3 constructor accepts zero to three arguments and will by default initialize unspecified dimensions to 1. See [dim3](#dim3). The kernel uses the coordinate built-ins (thread*, block*, grid*) to determine coordinate index and coordinate bounds of the work item that's currently executing. See [Coordinate Built-Ins](#Coordinate-Built-Ins).
-
-Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32.
-
-
-## Kernel-Launch Example
-```
-// Example showing device function, __device__ __host__
-// <- compile for both device and host
-float PlusOne(float x)
-{
-    return x + 1.0;
-}
-
-__global__
-void
-MyKernel (hipLaunchParm lp, /*lp parm for execution configuration */
-          const float *a, const float *b, float *c, unsigned N)
-{
-    unsigned gid = threadIdx.x; // <- coordinate index function
-    if (gid < N) {
-        c[gid] = a[gid] + PlusOne(b[gid]);
-    }
-}
-void callMyKernel()
-{
-    float *a, *b, *c; // initialization not shown...
-    unsigned N = 1000000;
-    const unsigned blockSize = 256;
-
-    MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
-    // Alternatively, kernel can be launched by
-    // hipLaunchKernelGGL(MyKernel, dim3(N/blockSize), dim3(blockSize), 0, 0,  a,b,c,N);
-}
-```
-
-## Variable-Type Qualifiers
-
-### `__constant__`
-The `__constant__` keyword is supported. The host writes constant memory before launching the kernel; from the GPU, this memory is read-only during kernel execution. The functions for accessing constant memory (hipGetSymbolAddress(), hipGetSymbolSize(), hipMemcpyToSymbol(), hipMemcpyToSymbolAsync(), hipMemcpyFromSymbol(), hipMemcpyFromSymbolAsync()) are available.
-
-### `__shared__`
-The `__shared__` keyword is supported.
-
-`extern __shared__` allows the host to dynamically allocate shared memory and is specified as a launch parameter.
-Previously, it was essential to declare dynamic shared memory using the HIP_DYNAMIC_SHARED macro for accuracy, as using static shared memory in the same kernel could result in overlapping memory ranges and data-races.
-
-Now, the HIP-Clang compiler provides support for extern shared declarations, and the HIP_DYNAMIC_SHARED option is no longer required..
-
-### `__managed__`
-Managed memory, including the `__managed__` keyword, are supported in HIP combined host/device compilation.
-
-### `__restrict__`
-The `__restrict__` keyword tells the compiler that the associated memory pointer will not alias with any other pointer in the kernel or function.  This feature can help the compiler generate better code. In most cases, all pointer arguments must use this keyword to realize the benefit.
-
-
-## Built-In Variables
-
-### Coordinate Built-Ins
-Built-ins determine the coordinate of the active work item in the execution grid. They are defined in amd_hip_runtime.h (rather than being implicitly defined by the compiler).
-In HIP, built-ins coordinate variable definitions are the same as in Cuda, for instance:
-threadIdx.x, blockIdx.y, gridDim.y, etc.
-The products gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.
-Coordinates builtins are implemented as structures for better performance. When used with printf, they needs to be casted to integer types explicitly.
-
-### warpSize
-The warpSize variable is of type int and contains the warp size (in threads) for the target device. Note that all current Nvidia devices return 32 for this variable, and current AMD devices return 64 for gfx9 and 32 for gfx10 and above. The warpSize variable should only be used in device functions. Device code should use the warpSize built-in to develop portable wave-aware code.
-
-
-## Vector Types
-
-Note that these types are defined in hip_runtime.h and are not automatically provided by the compiler.
-
-### Short Vector Types
-Short vector types derive from the basic integer and floating-point types. They are structures defined in hip_vector_types.h. The first, second, third and fourth components of the vector are accessible through the ```x```, ```y```, ```z``` and ```w``` fields, respectively. All the short vector types support a constructor function of the form ```make_<type_name>()```. For example, ```float4 make_float4(float x, float y, float z, float w)``` creates a vector of type ```float4``` and value ```(x,y,z,w)```.
-
-HIP supports the following short vector formats:
-- Signed Integers:
-    - char1, char2, char3, char4
-    - short1, short2, short3, short4
-    - int1, int2, int3, int4
-    - long1, long2, long3, long4
-    - longlong1, longlong2, longlong3, longlong4
-- Unsigned Integers:
-    - uchar1, uchar2, uchar3, uchar4
-    - ushort1, ushort2, ushort3, ushort4
-    - uint1, uint2, uint3, uint4
-    - ulong1, ulong2, ulong3, ulong4
-    - ulonglong1, ulonglong2, ulonglong3, ulonglong4
-- Floating Points
-    - float1, float2, float3, float4
-    - double1, double2, double3, double4
-
-### dim3
-dim3 is a three-dimensional integer vector type commonly used to specify grid and group dimensions. Unspecified dimensions are initialized to 1.
-```
-typedef struct dim3 {
-  uint32_t x;
-  uint32_t y;
-  uint32_t z;
-
-  dim3(uint32_t _x=1, uint32_t _y=1, uint32_t _z=1) : x(_x), y(_y), z(_z) {};
-};
-
-```
-
-## Memory-Fence Instructions
-HIP supports __threadfence() and  __threadfence_block().
-
-HIP provides workaround for threadfence_system() under the HIP-Clang path.
-To enable the workaround, HIP should be built with environment variable HIP_COHERENT_HOST_ALLOC enabled.
-In addition,the kernels that use __threadfence_system() should be modified as follows:
-- The kernel should only operate on finegrained system memory; which should be allocated with hipHostMalloc().
-- Remove all memcpy for those allocated finegrained system memory regions.
-
-## Synchronization Functions
-The __syncthreads() built-in function is supported in HIP. The __syncthreads_count(int), __syncthreads_and(int) and __syncthreads_or(int) functions are under development.  
-
-## Math Functions
-HIP-Clang supports a set of math operations callable from the device.
-
-### Single Precision Mathematical Functions
-Following is the list of supported single precision mathematical functions.
-
-| **Function** | **Supported on Host** | **Supported on Device** |
-| --- | --- | --- |
-| float acosf ( float  x ) <br><sub>Calculate the arc cosine of the input argument.</sub> | ✓ | ✓ |
-| float acoshf ( float  x ) <br><sub>Calculate the nonnegative arc hyperbolic cosine of the input argument.</sub> | ✓ | ✓ |
-| float asinf ( float  x ) <br><sub>Calculate the arc sine of the input argument.</sub> | ✓ | ✓ |
-| float asinhf ( float  x ) <br><sub>Calculate the arc hyperbolic sine of the input argument.</sub> | ✓ | ✓ |
-| float atan2f ( float  y, float  x ) <br><sub>Calculate the arc tangent of the ratio of first and second input arguments.</sub> | ✓ | ✓ |
-| float atanf ( float  x ) <br><sub>Calculate the arc tangent of the input argument.</sub> | ✓ | ✓ |
-| float atanhf ( float  x ) <br><sub>Calculate the arc hyperbolic tangent of the input argument.</sub> | ✓ | ✓ |
-| float cbrtf ( float  x ) <br><sub>Calculate the cube root of the input argument.</sub> | ✓ | ✓ |
-| float ceilf ( float  x ) <br><sub>Calculate ceiling of the input argument.</sub> | ✓ | ✓ |
-| float copysignf ( float  x, float  y ) <br><sub>Create value with given magnitude, copying sign of second value.</sub> | ✓ | ✓ |
-| float cosf ( float  x ) <br><sub>Calculate the cosine of the input argument.</sub> | ✓ | ✓ |
-| float coshf ( float  x ) <br><sub>Calculate the hyperbolic cosine of the input argument.</sub> | ✓ | ✓ |
-| float erfcf ( float  x ) <br><sub>Calculate the complementary error function of the input argument.</sub> | ✓ | ✓ |
-| float erff ( float  x ) <br><sub>Calculate the error function of the input argument.</sub> | ✓ | ✓ |
-| float exp10f ( float  x ) <br><sub>Calculate the base 10 exponential of the input argument.</sub> | ✓ | ✓ |
-| float exp2f ( float  x ) <br><sub>Calculate the base 2 exponential of the input argument.</sub> | ✓ | ✓ |
-| float expf ( float  x ) <br><sub>Calculate the base e exponential of the input argument.</sub> | ✓ | ✓ |
-| float expm1f ( float  x ) <br><sub>Calculate the base e exponential of the input argument, minus 1.</sub> | ✓ | ✓ |
-| float fabsf ( float  x ) <br><sub>Calculate the absolute value of its argument.</sub> | ✓ | ✓ |
-| float fdimf ( float  x, float  y ) <br><sub>Compute the positive difference between `x` and `y`.</sub> | ✓ | ✓ |
-| float floorf ( float  x ) <br><sub>Calculate the largest integer less than or equal to `x`.</sub> | ✓ | ✓ |
-| float fmaf ( float  x, float  y, float  z ) <br><sub>Compute `x × y + z` as a single operation.</sub> | ✓ | ✓ |
-| float fmaxf ( float  x, float  y ) <br><sub>Determine the maximum numeric value of the arguments.</sub> | ✓ | ✓ |
-| float fminf ( float  x, float  y ) <br><sub>Determine the minimum numeric value of the arguments.</sub> | ✓ | ✓ |
-| float fmodf ( float  x, float  y ) <br><sub>Calculate the floating-point remainder of `x / y`.</sub> | ✓ | ✓ |
-| float frexpf ( float  x, int* nptr ) <br><sub>Extract mantissa and exponent of a floating-point value.</sub> | ✓ | ✗ |
-| float hypotf ( float  x, float  y ) <br><sub>Calculate the square root of the sum of squares of two arguments.</sub> | ✓ | ✓ |
-| int ilogbf ( float  x ) <br><sub>Compute the unbiased integer exponent of the argument.</sub> | ✓ | ✓ |
-| __RETURN_TYPE[^f1] isfinite ( float  a ) <br><sub>Determine whether argument is finite.</sub> | ✓ | ✓ |
-| __RETURN_TYPE[^f1]</sup> isinf ( float  a ) <br><sub>Determine whether argument is infinite.</sub> | ✓ | ✓ |
-| __RETURN_TYPE[^f1]</sup> isnan ( float  a ) <br><sub>Determine whether argument is a NaN.</sub> | ✓ | ✓ |
-| float ldexpf ( float  x, int  exp ) <br><sub>Calculate the value of x ⋅ 2<sup>exp</sup>.</sub> | ✓ | ✓ |
-| float log10f ( float  x ) <br><sub>Calculate the base 10 logarithm of the input argument.</sub> | ✓ | ✓ |
-| float log1pf ( float  x ) <br><sub>Calculate the value of log<sub>e</sub>( 1 + x ).</sub> | ✓ | ✓ |
-| float logbf ( float  x ) <br><sub>Calculate the floating point representation of the exponent of the input argument.</sub> | ✓ | ✓ |
-| float log2f ( float  x ) <br><sub>Calculate the base 2 logarithm of the input argument.</sub> | ✓ | ✓ |
-| float logf ( float  x ) <br><sub>Calculate the natural logarithm of the input argument.</sub> | ✓ | ✓ |
-| float modff ( float  x, float* iptr ) <br><sub>Break down the input argument into fractional and integral parts.</sub> | ✓ | ✗ |
-| float nanf ( const char* tagp ) <br><sub>Returns "Not a Number" value.</sub> | ✗ | ✓ |
-| float nearbyintf ( float  x ) <br><sub>Round the input argument to the nearest integer.</sub> | ✓ | ✓ |
-| float powf ( float  x, float  y ) <br><sub>Calculate the value of first argument to the power of second argument.</sub> | ✓ | ✓ |
-| float remainderf ( float  x, float  y ) <br><sub>Compute single-precision floating-point remainder.</sub> | ✓ | ✓ |
-| float remquof ( float  x, float  y, int* quo ) <br><sub>Compute single-precision floating-point remainder and part of quotient.</sub> | ✓ | ✗ |
-| float roundf ( float  x ) <br><sub>Round to nearest integer value in floating-point.</sub> | ✓ | ✓ |
-| float scalbnf ( float  x, int  n ) <br><sub>Scale floating-point input by integer power of two.</sub> | ✓ | ✓ |
-| __RETURN_TYPE[^f1]</sup> signbit ( float  a ) <br><sub>Return the sign bit of the input.</sub> | ✓ | ✓ |
-| void sincosf ( float  x, float* sptr, float* cptr ) <br><sub>Calculate the sine and cosine of the first input argument.</sub> | ✓ | ✗ |
-| float sinf ( float  x ) <br><sub>Calculate the sine of the input argument.</sub> | ✓ | ✓ |
-| float sinhf ( float  x ) <br><sub>Calculate the hyperbolic sine of the input argument.</sub> | ✓ | ✓ |
-| float sqrtf ( float  x ) <br><sub>Calculate the square root of the input argument.</sub> | ✓ | ✓ |
-| float tanf ( float  x ) <br><sub>Calculate the tangent of the input argument.</sub> | ✓ | ✓ |
-| float tanhf ( float  x ) <br><sub>Calculate the hyperbolic tangent of the input argument.</sub> | ✓ | ✓ |
-| float truncf ( float  x ) <br><sub>Truncate input argument to the integral part.</sub> | ✓ | ✓ |
-| float tgammaf ( float  x ) <br><sub>Calculate the gamma function of the input argument.</sub> | ✓ | ✓ |
-| float erfcinvf ( float  y ) <br><sub>Calculate the inverse complementary function of the input argument.</sub> | ✓ | ✓ |
-| float erfcxf ( float  x ) <br><sub>Calculate the scaled complementary error function of the input argument.</sub> | ✓ | ✓ |
-| float erfinvf ( float  y ) <br><sub>Calculate the inverse error function of the input argument.</sub> | ✓ | ✓ |
-| float fdividef ( float x, float  y ) <br><sub>Divide two floating point values.</sub> | ✓ | ✓ |
-| float frexpf ( float  x, int *nptr ) <br><sub>Extract mantissa and exponent of a floating-point value.</sub> | ✓ | ✓ |
-| float j0f ( float  x ) <br><sub>Calculate the value of the Bessel function of the first kind of order 0 for the input argument.</sub> | ✓ | ✓ |
-| float j1f ( float  x ) <br><sub>Calculate the value of the Bessel function of the first kind of order 1 for the input argument.</sub> | ✓ | ✓ |
-| float jnf ( int n, float  x ) <br><sub>Calculate the value of the Bessel function of the first kind of order n for the input argument.</sub> | ✓ | ✓ |
-| float lgammaf ( float  x ) <br><sub>Calculate the natural logarithm of the absolute value of the gamma function of the input argument.</sub> | ✓ | ✓ |
-| long long int llrintf ( float  x ) <br><sub>Round input to nearest integer value.</sub> | ✓ | ✓ |
-| long long int llroundf ( float  x ) <br><sub>Round to nearest integer value.</sub> | ✓ | ✓ |
-| long int lrintf ( float  x ) <br><sub>Round input to nearest integer value.</sub> | ✓ | ✓ |
-| long int lroundf ( float  x ) <br><sub>Round to nearest integer value.</sub> | ✓ | ✓ |
-| float modff ( float  x, float *iptr ) <br><sub>Break down the input argument into fractional and integral parts.</sub> | ✓ | ✓ |
-| float nextafterf ( float  x, float y ) <br><sub>Returns next representable single-precision floating-point value after argument.</sub> | ✓ | ✓ |
-| float norm3df ( float  a, float b, float c ) <br><sub>Calculate the square root of the sum of squares of three coordinates of the argument.</sub> | ✓ | ✓ |
-| float norm4df ( float  a, float b, float c, float d ) <br><sub>Calculate the square root of the sum of squares of four coordinates of the argument.</sub> | ✓ | ✓ |
-| float normcdff ( float  y ) <br><sub>Calculate the standard normal cumulative distribution function.</sub> | ✓ | ✓ |
-| float normcdfinvf ( float  y ) <br><sub>Calculate the inverse of the standard normal cumulative distribution function.</sub> | ✓ | ✓ |
-| float normf ( int dim, const float *a ) <br><sub>Calculate the square root of the sum of squares of any number of coordinates.</sub> | ✓ | ✓ |
-| float rcbrtf ( float x ) <br><sub>Calculate the reciprocal cube root function.</sub> | ✓ | ✓ |
-| float remquof ( float x, float y, int *quo ) <br><sub>Compute single-precision floating-point remainder and part of quotient.</sub> | ✓ | ✓ |
-| float rhypotf ( float x, float y ) <br><sub>Calculate one over the square root of the sum of squares of two arguments.</sub> | ✓ | ✓ |
-| float rintf ( float x ) <br><sub>Round input to nearest integer value in floating-point.</sub> | ✓ | ✓ |
-| float rnorm3df ( float  a, float b, float c ) <br><sub>Calculate one over the square root of the sum of squares of three coordinates of the argument.</sub> | ✓ | ✓ |
-| float rnorm4df ( float  a, float b, float c, float d ) <br><sub>Calculate one over the square root of the sum of squares of four coordinates of the argument.</sub> | ✓ | ✓ |
-| float rnormf ( int dim, const float *a ) <br><sub>Calculate the reciprocal of square root of the sum of squares of any number of coordinates.</sub> | ✓ | ✓ |
-| float scalblnf ( float x, long int n ) <br><sub>Scale floating-point input by integer power of two.</sub> | ✓ | ✓ |
-| void sincosf ( float x, float *sptr, float *cptr ) <br><sub>Calculate the sine and cosine of the first input argument.</sub> | ✓ | ✓ |
-| void sincospif ( float x, float *sptr, float *cptr ) <br><sub>Calculate the sine and cosine of the first input argument multiplied by PI.</sub> | ✓ | ✓ |
-| float y0f ( float  x ) <br><sub>Calculate the value of the Bessel function of the second kind of order 0 for the input argument.</sub> | ✓ | ✓ |
-| float y1f ( float  x ) <br><sub>Calculate the value of the Bessel function of the second kind of order 1 for the input argument.</sub> | ✓ | ✓ |
-| float ynf ( int n, float  x ) <br><sub>Calculate the value of the Bessel function of the second kind of order n for the input argument.</sub> | ✓ | ✓ |
-
-
-[^f1]: __RETURN_TYPE is dependent on compiler. It is usually 'int' for C compilers and 'bool' for C++ compilers.
-
-### Double Precision Mathematical Functions
-Following is the list of supported double precision mathematical functions.
-
-| **Function** | **Supported on Host** | **Supported on Device** |
-| --- | --- | --- |
-| double acos ( double  x ) <br><sub>Calculate the arc cosine of the input argument.</sub> | ✓ | ✓ |
-| double acosh ( double  x ) <br><sub>Calculate the nonnegative arc hyperbolic cosine of the input argument.</sub> | ✓ | ✓ |
-| double asin ( double  x ) <br><sub>Calculate the arc sine of the input argument.</sub> | ✓ | ✓ |
-| double asinh ( double  x ) <br><sub> Calculate the arc hyperbolic sine of the input argument.</sub> | ✓ | ✓ |
-| double atan ( double  x ) <br><sub>Calculate the arc tangent of the input argument.</sub> | ✓ | ✓ |
-| double atan2 ( double  y, double  x ) <br><sub>Calculate the arc tangent of the ratio of first and second input arguments.</sub> | ✓ | ✓ |
-| double atanh ( double  x ) <br><sub>Calculate the arc hyperbolic tangent of the input argument.</sub> | ✓ | ✓ |
-| double cbrt ( double  x ) <br><sub>Calculate the cube root of the input argument.</sub> | ✓ | ✓ |
-| double ceil ( double  x ) <br><sub>Calculate ceiling of the input argument.</sub> | ✓ | ✓ |
-| double copysign ( double  x, double  y ) <br><sub>Create value with given magnitude, copying sign of second value.</sub> | ✓ | ✓ |
-| double cos ( double  x ) <br><sub>Calculate the cosine of the input argument.</sub> | ✓ | ✓ |
-| double cosh ( double  x ) <br><sub>Calculate the hyperbolic cosine of the input argument.</sub> | ✓ | ✓ |
-| double erf ( double  x ) <br><sub>Calculate the error function of the input argument.</sub> | ✓ | ✓ |
-| double erfc ( double  x ) <br><sub>Calculate the complementary error function of the input argument.</sub> | ✓ | ✓ |
-| double exp ( double  x ) <br><sub>Calculate the base e exponential of the input argument.</sub> | ✓ | ✓ |
-| double exp10 ( double  x ) <br><sub>Calculate the base 10 exponential of the input argument.</sub> | ✓ | ✓ |
-| double exp2 ( double  x ) <br><sub>Calculate the base 2 exponential of the input argument.</sub> | ✓ | ✓ |
-| double expm1 ( double  x ) <br><sub>Calculate the base e exponential of the input argument, minus 1.</sub> | ✓ | ✓ |
-| double fabs ( double  x ) <br><sub>Calculate the absolute value of the input argument.</sub> | ✓ | ✓ |
-| double fdim ( double  x, double  y ) <br><sub>Compute the positive difference between `x` and `y`.</sub> | ✓ | ✓ |
-| double floor ( double  x ) <br><sub>Calculate the largest integer less than or equal to `x`.</sub> | ✓ | ✓ |
-| double fma ( double  x, double  y, double  z ) <br><sub>Compute `x × y + z` as a single operation.</sub> | ✓ | ✓ |
-| double fmax ( double , double ) <br><sub>Determine the maximum numeric value of the arguments.</sub> | ✓ | ✓ |
-| double fmin ( double  x, double  y ) <br><sub>Determine the minimum numeric value of the arguments.</sub> | ✓ | ✓ |
-| double fmod ( double  x, double  y ) <br><sub>Calculate the floating-point remainder of `x / y`.</sub> | ✓ | ✓ |
-| double frexp ( double  x, int* nptr ) <br><sub>Extract mantissa and exponent of a floating-point value.</sub> | ✓ | ✗ |
-| double hypot ( double  x, double  y ) <br><sub>Calculate the square root of the sum of squares of two arguments.</sub> | ✓ | ✓ |
-| int ilogb ( double  x ) <br><sub>Compute the unbiased integer exponent of the argument.</sub> | ✓ | ✓ |
-| __RETURN_TYPE[^f1] isfinite ( double  a ) <br><sub>Determine whether argument is finite.</sub> | ✓ | ✓ |
-| __RETURN_TYPE[^f1]</sup> isinf ( double  a ) <br><sub>Determine whether argument is infinite.</sub> | ✓ | ✓ |
-| __RETURN_TYPE[^f1]</sup> isnan ( double  a ) <br><sub>Determine whether argument is a NaN.</sub> | ✓ | ✓ |
-| double ldexp ( double  x, int  exp ) <br><sub>Calculate the value of x ⋅ 2<sup>exp</sup>.</sub> | ✓ | ✓ |
-| double log ( double  x ) <br><sub>Calculate the base e logarithm of the input argument.</sub> | ✓ | ✓ |
-| double log10 ( double  x ) <br><sub>Calculate the base 10 logarithm of the input argument.</sub> | ✓ | ✓ |
-| double log1p ( double  x ) <br><sub>Calculate the value of log<sub>e</sub>( 1 + x ).</sub> | ✓ | ✓ |
-| double log2 ( double  x ) <br><sub>Calculate the base 2 logarithm of the input argument.</sub> | ✓ | ✓ |
-| double logb ( double  x ) <br><sub>Calculate the floating point representation of the exponent of the input argument.</sub> | ✓ | ✓ |
-| double modf ( double  x, double* iptr ) <br><sub>Break down the input argument into fractional and integral parts.</sub> | ✓ | ✗ |
-| double nan ( const char* tagp ) <br><sub>Returns "Not a Number" value.</sub> | ✗ | ✓ |
-| double nearbyint ( double  x ) <br><sub>Round the input argument to the nearest integer.</sub> | ✓ | ✓ |
-| double pow ( double  x, double  y ) <br><sub>Calculate the value of first argument to the power of second argument.</sub> | ✓ | ✓ |
-| double remainder ( double  x, double  y ) <br><sub>Compute double-precision floating-point remainder.</sub> | ✓ | ✓ |
-| double remquo ( double  x, double  y, int* quo ) <br><sub>Compute double-precision floating-point remainder and part of quotient.</sub> | ✓ | ✗ |
-| double round ( double  x ) <br><sub>Round to nearest integer value in floating-point.</sub> | ✓ | ✓ |
-| double scalbn ( double  x, int  n ) <br><sub>Scale floating-point input by integer power of two.</sub> | ✓ | ✓ |
-| __RETURN_TYPE[^f1] signbit ( double  a ) <br><sub>Return the sign bit of the input.</sub> | ✓ | ✓ |
-| double sin ( double  x ) <br><sub>Calculate the sine of the input argument.</sub> | ✓ | ✓ |
-| void sincos ( double  x, double* sptr, double* cptr ) <br><sub>Calculate the sine and cosine of the first input argument.</sub> | ✓ | ✗ |
-| double sinh ( double  x ) <br><sub>Calculate the hyperbolic sine of the input argument.</sub> | ✓ | ✓ |
-| double sqrt ( double  x ) <br><sub>Calculate the square root of the input argument.</sub> | ✓ | ✓ |
-| double tan ( double  x ) <br><sub>Calculate the tangent of the input argument.</sub> | ✓ | ✓ |
-| double tanh ( double  x ) <br><sub>Calculate the hyperbolic tangent of the input argument.</sub> | ✓ | ✓ |
-| double tgamma ( double  x ) <br><sub>Calculate the gamma function of the input argument.</sub> | ✓ | ✓ |
-| double trunc ( double  x ) <br><sub>Truncate input argument to the integral part.</sub> | ✓ | ✓ |
-| double erfcinv ( double  y ) <br><sub>Calculate the inverse complementary function of the input argument.</sub> | ✓ | ✓ |
-| double erfcx ( double  x ) <br><sub>Calculate the scaled complementary error function of the input argument.</sub> | ✓ | ✓ |
-| double erfinv ( double  y ) <br><sub>Calculate the inverse error function of the input argument.</sub> | ✓ | ✓ |
-| double frexp ( float  x, int *nptr ) <br><sub>Extract mantissa and exponent of a floating-point value.</sub> | ✓ | ✓ |
-| double j0 ( double  x ) <br><sub>Calculate the value of the Bessel function of the first kind of order 0 for the input argument.</sub> | ✓ | ✓ |
-| double j1 ( double  x ) <br><sub>Calculate the value of the Bessel function of the first kind of order 1 for the input argument.</sub> | ✓ | ✓ |
-| double jn ( int n, double  x ) <br><sub>Calculate the value of the Bessel function of the first kind of order n for the input argument.</sub> | ✓ | ✓ |
-| double lgamma ( double  x ) <br><sub>Calculate the natural logarithm of the absolute value of the gamma function of the input argument.</sub> | ✓ | ✓ |
-| long long int llrint ( double  x ) <br><sub>Round input to nearest integer value.</sub> | ✓ | ✓ |
-| long long int llround ( double  x ) <br><sub>Round to nearest integer value.</sub> | ✓ | ✓ |
-| long int lrint ( double  x ) <br><sub>Round input to nearest integer value.</sub> | ✓ | ✓ |
-| long int lround ( double  x ) <br><sub>Round to nearest integer value.</sub> | ✓ | ✓ |
-| double modf ( double  x, double *iptr ) <br><sub>Break down the input argument into fractional and integral parts.</sub> | ✓ | ✓ |
-| double nextafter ( double  x, double y ) <br><sub>Returns next representable single-precision floating-point value after argument.</sub> | ✓ | ✓ |
-| double norm3d ( double  a, double b, double c ) <br><sub>Calculate the square root of the sum of squares of three coordinates of the argument.</sub> | ✓ | ✓ |
-| float norm4d ( double  a, double b, double c, double d ) <br><sub>Calculate the square root of the sum of squares of four coordinates of the argument.</sub> | ✓ | ✓ |
-| double normcdf ( double  y ) <br><sub>Calculate the standard normal cumulative distribution function.</sub> | ✓ | ✓ |
-| double normcdfinv ( double  y ) <br><sub>Calculate the inverse of the standard normal cumulative distribution function.</sub> | ✓ | ✓ |
-| double rcbrt ( double x ) <br><sub>Calculate the reciprocal cube root function.</sub> | ✓ | ✓ |
-| double remquo ( double x, double y, int *quo ) <br><sub>Compute single-precision floating-point remainder and part of quotient.</sub> | ✓ | ✓ |
-| double rhypot ( double x, double y ) <br><sub>Calculate one over the square root of the sum of squares of two arguments.</sub> | ✓ | ✓ |
-| double rint ( double x ) <br><sub>Round input to nearest integer value in floating-point.</sub> | ✓ | ✓ |
-| double rnorm3d ( double a, double b, double c ) <br><sub>Calculate one over the square root of the sum of squares of three coordinates of the argument.</sub> | ✓ | ✓ |
-| double rnorm4d ( double a, double b, double c, double d ) <br><sub>Calculate one over the square root of the sum of squares of four coordinates of the argument.</sub> | ✓ | ✓ |
-| double rnorm ( int dim, const double *a ) <br><sub>Calculate the reciprocal of square root of the sum of squares of any number of coordinates.</sub> | ✓ | ✓ |
-| double scalbln ( double x, long int n ) <br><sub>Scale floating-point input by integer power of two.</sub> | ✓ | ✓ |
-| void sincos ( double x, double *sptr, double *cptr ) <br><sub>Calculate the sine and cosine of the first input argument.</sub> | ✓ | ✓ |
-| void sincospi ( double x, double *sptr, double *cptr ) <br><sub>Calculate the sine and cosine of the first input argument multiplied by PI.</sub> | ✓ | ✓ |
-| double y0f ( double  x ) <br><sub>Calculate the value of the Bessel function of the second kind of order 0 for the input argument.</sub> | ✓ | ✓ |
-| double y1 ( double  x ) <br><sub>Calculate the value of the Bessel function of the second kind of order 1 for the input argument.</sub> | ✓ | ✓ |
-| double yn ( int n, double  x ) <br><sub>Calculate the value of the Bessel function of the second kind of order n for the input argument.</sub> | ✓ | ✓ |
-
-### Integer Intrinsics
-Following is the list of supported integer intrinsics. Note that intrinsics are supported on device only.
-
-| **Function** |
-| --- |
-| unsigned int __brev ( unsigned int x ) <br><sub>Reverse the bit order of a 32 bit unsigned integer.</sub> |
-| unsigned long long int __brevll ( unsigned long long int x ) <br><sub>Reverse the bit order of a 64 bit unsigned integer. </sub> |
-| int __clz ( int  x ) <br><sub>Return the number of consecutive high-order zero bits in a 32 bit integer.</sub> |
-| unsigned int __clz(unsigned int x) <br><sub>Return the number of consecutive high-order zero bits in 32 bit unsigned integer.</sub> |
-| int __clzll ( long long int x ) <br><sub>Count the number of consecutive high-order zero bits in a 64 bit integer.</sub> |
-| unsigned int __clzll(long long int x) <br><sub>Return the number of consecutive high-order zero bits in 64 bit signed integer.</sub> |
-| unsigned int __ffs(unsigned int x) <br><sub>Find the position of least signigicant bit set to 1 in a 32 bit unsigned integer.[^f3]</sub> |
-| unsigned int __ffs(int x) <br><sub>Find the position of least signigicant bit set to 1 in a 32 bit signed integer.</sub> |
-| unsigned int __ffsll(unsigned long long int x) <br><sub>Find the position of least signigicant bit set to 1 in a 64 bit unsigned integer.[^f3]</sup></sub> |
-| unsigned int __ffsll(long long int x) <br><sub>Find the position of least signigicant bit set to 1 in a 64 bit signed integer.</sub> |
-| unsigned int __popc ( unsigned int x ) <br><sub>Count the number of bits that are set to 1 in a 32 bit integer.</sub> |
-| unsigned int __popcll ( unsigned long long int x )<br><sub>Count the number of bits that are set to 1 in a 64 bit integer.</sub> |
-| int __mul24 ( int x, int y )<br><sub>Multiply two 24bit integers.</sub> |
-| unsigned int __umul24 ( unsigned int x, unsigned int y )<br><sub>Multiply two 24bit unsigned integers.</sub> |
-<sub>[^f3]
-The HIP-Clang implementation of __ffs() and __ffsll() contains code to add a constant +1 to produce the ffs result format.
-For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform,
-HIP-Clang provides __lastbit_u32_u32(unsigned int input) and __lastbit_u32_u64(unsigned long long int input).
-The index returned by __lastbit_ instructions starts at -1, while for ffs the index starts at 0.
-
-### Floating-point Intrinsics
-Following is the list of supported floating-point intrinsics. Note that intrinsics are supported on device only.
-
-| **Function** |
-| --- |
-| float __cosf ( float  x ) <br><sub>Calculate the fast approximate cosine of the input argument.</sub> |
-| float __expf ( float  x ) <br><sub>Calculate the fast approximate base e exponential of the input argument.</sub> |
-| float __frsqrt_rn ( float  x ) <br><sub>Compute `1 / √x` in round-to-nearest-even mode.</sub> |
-| float __fsqrt_rn ( float  x ) <br><sub>Compute `√x` in round-to-nearest-even mode.</sub> |
-| float __log10f ( float  x ) <br><sub>Calculate the fast approximate base 10 logarithm of the input argument.</sub> |
-| float __log2f ( float  x ) <br><sub>Calculate the fast approximate base 2 logarithm of the input argument.</sub> |
-| float __logf ( float  x ) <br><sub>Calculate the fast approximate base e logarithm of the input argument.</sub> |
-| float __powf ( float  x, float  y ) <br><sub>Calculate the fast approximate of x<sup>y</sup>.</sub> |
-| float __sinf ( float  x ) <br><sub>Calculate the fast approximate sine of the input argument.</sub> |
-| float __tanf ( float  x ) <br><sub>Calculate the fast approximate tangent of the input argument.</sub> |
-| double __dsqrt_rn ( double  x ) <br><sub>Compute `√x` in round-to-nearest-even mode.</sub> |
-
-## Texture Functions
-The supported Texture functions are listed in header files "texture_fetch_functions.h"(https://github.com/ROCm-Developer-Tools/HIP/blob/main/include/hip/amd_detail/texture_fetch_functions.h) and"texture_indirect_functions.h" (https://github.com/ROCm-Developer-Tools/HIP/blob/main/include/hip/amd_detail/texture_indirect_functions.h).
-
-Texture functions are not supported on some devices.
-Macro __HIP_NO_IMAGE_SUPPORT == 1 can be used to check whether texture functions are not supported in device code.
-Attribute hipDeviceAttributeImageSupport can be queried to check whether texture functions are supported in host runtime code.
-
-## Surface Functions
-Surface functions are not supported.
-
-## Timer Functions
-HIP provides the following built-in functions for reading a high-resolution timer from the device.
-```
-clock_t clock()
-long long int clock64()
-```
-Returns the value of counter that is incremented every clock cycle on device. Difference in values returned provides the cycles used.
-
-```
-long long int wall_clock64()
-```
-Returns wall clock count at a constant frequency on the device, which can be queried via HIP API with hipDeviceAttributeWallClockRate attribute of the device in HIP application code, for example,
-```
-int wallClkRate = 0; //in kilohertz
-HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
-```
-Where hipDeviceAttributeWallClockRate is a device attribute.
-Note that, wall clock frequency is a per-device attribute.
-
-
-## Atomic Functions
-
-Atomic functions execute as read-modify-write operations residing in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order.
-
-HIP adds new APIs with _system as suffix to support system scope atomic operations. For example, the `atomicAnd` function is meant to be atomic and coherent within the GPU device executing the function. `atomicAnd_system` will allow developers to extend the atomic operation to system scope, from the GPU device to other CPUs and GPU devices in the system.
-
-HIP supports the following atomic operations.
-
-| **Function**                                                                                                         |  **Supported in HIP** |  **Supported in CUDA** |
-| -------------------------------------------------------------------------------------------------------------------- | --------------------- | ---------------------- |
-| int atomicAdd(int* address, int val)                                                                                 |  ✓                    |  ✓                     |
-| int atomicAdd_system(int* address, int val)                                                                          |  ✓                    |  ✓                     |
-| unsigned int atomicAdd(unsigned int* address,unsigned int val)                                                       |  ✓                    |  ✓                     |
-| unsigned int atomicAdd_system(unsigned int* address, unsigned int val)                                               |  ✓                    |  ✓                     |
-| unsigned long long atomicAdd(unsigned long long* address,unsigned long long val)                                     |  ✓                    |  ✓                     |
-| unsigned long long atomicAdd_system(unsigned long long* address, unsigned long long val)                             |  ✓                    |  ✓                     |
-| float atomicAdd(float* address, float val)                                                                           |  ✓                    |  ✓                     |
-| float atomicAdd_system(float* address, float val)                                                                    |  ✓                    |  ✓                     |
-| double atomicAdd(double* address, double val)                                                                        |  ✓                    |  ✓                     |
-| double atomicAdd_system(double* address, double val)                                                                 |  ✓                    |  ✓                     |
-| float unsafeAtomicAdd(float* address, float val)                                                                     |  ✓                    |  ✗                     |
-| float safeAtomicAdd(float* address, float val)                                                                       |  ✓                    |  ✗                     |
-| double unsafeAtomicAdd(double* address, double val)                                                                  |  ✓                    |  ✗                     |
-| double safeAtomicAdd(double* address, double val)                                                                    |  ✓                    |  ✗                     |
-| int atomicSub(int* address, int val)                                                                                 |  ✓                    |  ✓                     |
-| int atomicSub_system(int* address, int val)                                                                          |  ✓                    |  ✓                     |
-| unsigned int atomicSub(unsigned int* address,unsigned int val)                                                       |  ✓                    |  ✓                     |
-| unsigned int atomicSub_system(unsigned int* address, unsigned int val)                                               |  ✓                    |  ✓                     |
-| int atomicExch(int* address, int val)                                                                                |  ✓                    |  ✓                     |
-| int atomicExch_system(int* address, int val)                                                                         |  ✓                    |  ✓                     |
-| unsigned int atomicExch(unsigned int* address,unsigned int val)                                                      |  ✓                    |  ✓                     |
-| unsigned int atomicExch_system(unsigned int* address, unsigned int val)                                              |  ✓                    |  ✓                     |
-| unsigned long long atomicExch(unsigned long long int* address,unsigned long long int val)                            |  ✓                    |  ✓                     |
-| unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val)                            |  ✓                    |  ✓                     |
-| unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val)                            |  ✓                    |  ✓                     |
-| float atomicExch(float* address, float val)                                                                          |  ✓                    |  ✓                     |
-| int atomicMin(int* address, int val)                                                                                 |  ✓                    |  ✓                     |
-| int atomicMin_system(int* address, int val)                                                                          |  ✓                    |  ✓                     |
-| unsigned int atomicMin(unsigned int* address,unsigned int val)                                                       |  ✓                    |  ✓                     |
-| unsigned int atomicMin_system(unsigned int* address, unsigned int val)                                               |  ✓                    |  ✓                     |
-| unsigned long long atomicMin(unsigned long long* address,unsigned long long val)                                     |  ✓                    |  ✓                     |
-| int atomicMax(int* address, int val)                                                                                 |  ✓                    |  ✓                     |
-| int atomicMax_system(int* address, int val)                                                                          |  ✓                    |  ✓                     |
-| unsigned int atomicMax(unsigned int* address,unsigned int val)                                                       |  ✓                    |  ✓                     |
-| unsigned int atomicMax_system(unsigned int* address, unsigned int val)                                               |  ✓                    |  ✓                     |
-| unsigned long long atomicMax(unsigned long long* address,unsigned long long val)                                     |  ✓                    |  ✓                     |
-| unsigned int atomicInc(unsigned int* address)                                                                        |  ✗                    |  ✓                     |
-| unsigned int atomicDec(unsigned int* address)                                                                        |  ✗                    |  ✓                     |
-| int atomicCAS(int* address, int compare, int val)                                                                    |  ✓                    |  ✓                     |
-| int atomicCAS_system(int* address, int compare, int val)                                                             |  ✓                    |  ✓                     |
-| unsigned int atomicCAS(unsigned int* address,unsigned int compare,unsigned int val)                                  |  ✓                    |  ✓                     |
-| unsigned int atomicCAS_system(unsigned int* address, unsigned int compare, unsigned int val)                         |  ✓                    |  ✓                     |
-| unsigned long long atomicCAS(unsigned long long* address,unsigned long long compare,unsigned long long val)          |  ✓                    |  ✓                     |
-| unsigned long long atomicCAS_system(unsigned long long* address, unsigned long long compare, unsigned long long val) |  ✓                    |  ✓                     |
-| int atomicAnd(int* address, int val)                                                                                 |  ✓                    |  ✓                     |
-| int atomicAnd_system(int* address, int val)                                                                          |  ✓                    |  ✓                     |
-| unsigned int atomicAnd(unsigned int* address,unsigned int val)                                                       |  ✓                    |  ✓                     |
-| unsigned int atomicAnd_system(unsigned int* address, unsigned int val)                                               |  ✓                    |  ✓                     |
-| unsigned long long atomicAnd(unsigned long long* address,unsigned long long val)                                     |  ✓                    |  ✓                     |
-| unsigned long long atomicAnd_system(unsigned long long* address, unsigned long long val)                             |  ✓                    |  ✓                     |
-| int atomicOr(int* address, int val)                                                                                  |  ✓                    |  ✓                     |
-| int atomicOr_system(int* address, int val)                                                                           |  ✓                    |  ✓                     |
-| unsigned int atomicOr(unsigned int* address,unsigned int val)                                                        |  ✓                    |  ✓                     |
-| unsigned int atomicOr_system(unsigned int* address, unsigned int val)                                                |  ✓                    |  ✓                     |
-| unsigned int atomicOr_system(unsigned int* address, unsigned int val)                                                |  ✓                    |  ✓                     |
-| unsigned long long atomicOr(unsigned long long int* address,unsigned long long val)                                  |  ✓                    |  ✓                     |
-| unsigned long long atomicOr_system(unsigned long long* address, unsigned long long val)                              |  ✓                    |  ✓                     |
-| int atomicXor(int* address, int val)                                                                                 |  ✓                    |  ✓                     |
-| int atomicXor_system(int* address, int val)                                                                          |  ✓                    |  ✓                     |
-| unsigned int atomicXor(unsigned int* address,unsigned int val)                                                       |  ✓                    |  ✓                     |
-| unsigned int atomicXor_system(unsigned int* address, unsigned int val)                                               |  ✓                    |  ✓                     |
-| unsigned long long atomicXor(unsigned long long* address,unsigned long long val))                                    |  ✓                    |  ✓                     |
-| unsigned long long atomicXor_system(unsigned long long* address, unsigned long long val)                             |  ✓                    |  ✓                     |
-
-### Unsafe Floating-Point Atomic RMW Operations
-
-Some HIP devices support fast atomic read-modify-write (RMW) operations on floating-point values.
-For example, `atomicAdd` on single- or double-precision floating-point values may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.
-
-On some devices, these fast atomic RMW instructions can produce different results when compared with the same functions implemented with atomic CAS loops.
-For example, some devices will produce incorrect answers if a fast atomic floating-point RMW instruction targets fine-grained memory allocations.
-As another example, some devices will use different rounding or denormal modes when using fast atomic floating-point RMW instructions.
-
-As such, the HIP-Clang compiler offers a compile-time option for users to choose whether their code will use the fast, potentially unsafe, atomic instructions.
-On devices that support these fast, but unsafe, floating-point atomic RMW instructions, the compiler option `-munsafe-fp-atomics` will allow the compiler to generate them when it sees appropriate atomic RMW function calls.
-By passing the `-munsafe-fp-atomics` flag to the compiler, the user is indicating that all floating-point atomic function calls are allowed to use an unsafe version if one exists.
-For instance, on some devices, this flag indicates to the compiler that that no floating-point `atomicAdd` function targets fine-grained memory.
-
-If the user instead compiles with `-mno-unsafe-fp-atomics`, the user is telling the compiler to never use a floating-point atomic RMW that may not be safe.
-The compiler will default to not producing unsafe floating-point atomic RMW instructions, so the `-mno-unsafe-fp-atomics` compilation option is not strictly necessary.
-Explicitly passing this flag to the compiler is good practice, however.
-
-Whenever either of the two options described above, `-munsafe-fp-atomics` and `-mno-unsafe-fp-atomics` are passed to the compiler's command line, they are applied globally for that entire compilation.
-If only a subset of the atomic RMW function calls could safely use the faster floating-point atomic RMW instructions, the developer would instead need to compile with `-mno-unsafe-fp-atomics` in order to ensure the remaining atomic RMW function calls produce correct results.
-Towards this end, HIP has four extra functions to help developers more precisely control which floating-point atomic RMW functions produce unsafe atomic RMW instructions:
-
-- `float unsafeAtomicAdd(float* address, float val)`
-- `double unsafeAtomicAdd(double* address, double val)`
-   - These functions will always produce fast atomic RMW instructions on devices that have them, even when `-mno-unsafe-fp-atomics` is set
-
-- `float safeAtomicAdd(float* address, float val)`
-- `double safeAtomicAdd(double* address, double val)`
-   - These functions will always produce safe atomic RMW operations, even when `-munsafe-fp-atomics` is set
-
-(warp_cross_lane_functions)=
-## Warp Cross-Lane Functions
-
-Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions use no shared memory.
-
-Note that Nvidia and AMD devices have different warp sizes, so portable code should use the warpSize built-ins to query the warp size. Hipified code from the Cuda path requires careful review to ensure it doesn't assume a waveSize of 32. "Wave-aware" code that assumes a waveSize of 32 will run on a wave-64 machine, but it will utilize only half of the machine resources. WarpSize built-ins should only be used in device functions and its value depends on GPU arch. Users should not assume warpSize to be a compile-time constant. Host functions should use hipGetDeviceProperties to get the default warp size of a GPU device:
-
-```
-	cudaDeviceProp props;
-	cudaGetDeviceProperties(&props, deviceID);
-    int w = props.warpSize;
-    // implement portable algorithm based on w (rather than assume 32 or 64)
-```
-
-Note that assembly kernels may be built for a warp size which is different than the default warp size.
-
-### Warp Vote and Ballot Functions
-
-```
-int __all(int predicate)
-int __any(int predicate)
-uint64_t __ballot(int predicate)
-```
-
-Threads in a warp are referred to as *lanes* and are numbered from 0 to warpSize -- 1. For these functions, each warp lane contributes 1 -- the bit value (the predicate), which is efficiently broadcast to all lanes in the warp. The 32-bit int predicate from each lane reduces to a 1-bit value: 0 (predicate = 0) or 1 (predicate != 0). `__any` and `__all` provide a summary view of the predicates that the other warp lanes contribute:
-
-- `__any()` returns 1 if any warp lane contributes a nonzero predicate, or 0 otherwise
-- `__all()` returns 1 if all other warp lanes contribute nonzero predicates, or 0 otherwise
-
-Applications can test whether the target platform supports the any/all instruction using the `hasWarpVote` device property or the HIP_ARCH_HAS_WARP_VOTE compiler define.
-
-`__ballot` provides a bit mask containing the 1-bit predicate value from each lane. The nth bit of the result contains the 1 bit contributed by the nth warp lane. Note that HIP's `__ballot` function supports a 64-bit return value (compared with Cuda's 32 bits). Code ported from Cuda should support the larger warp sizes that the HIP version of this instruction supports. Applications can test whether the target platform supports the ballot instruction using the `hasWarpBallot` device property or the HIP_ARCH_HAS_WARP_BALLOT compiler define.
-
-
-### Warp Shuffle Functions
-
-Half-float shuffles are not supported. The default width is warpSize---see [Warp Cross-Lane Functions](#warp-cross-lane-functions). Applications should not assume the warpSize is 32 or 64.
-
-```
-int   __shfl      (int var,   int srcLane, int width=warpSize);
-float __shfl      (float var, int srcLane, int width=warpSize);
-int   __shfl_up   (int var,   unsigned int delta, int width=warpSize);
-float __shfl_up   (float var, unsigned int delta, int width=warpSize);
-int   __shfl_down (int var,   unsigned int delta, int width=warpSize);
-float __shfl_down (float var, unsigned int delta, int width=warpSize);
-int   __shfl_xor  (int var,   int laneMask, int width=warpSize);
-float __shfl_xor  (float var, int laneMask, int width=warpSize);
-
-```
-
-## Cooperative Groups Functions
-
-Cooperative groups is a mechanism for forming and communicating between groups of threads at
-a granularity different than the block.  This feature was introduced in Cuda 9.
-
-HIP supports the following kernel language cooperative groups types or functions.
-
-
-| **Function** | **Supported in HIP** | **Supported in CUDA** |
-| --- | --- | --- |
-| `void thread_group.sync();` | ✓ | ✓ |
-| `unsigned thread_group.size();` | ✓ | ✓ |
-| `unsigned thread_group.thread_rank()` | ✓ | ✓ |
-| `bool thread_group.is_valid();` | ✓ | ✓ |
-| `grid_group this_grid()` | ✓ | ✓ |
-| `void grid_group.sync()` | ✓ | ✓ |
-| `unsigned grid_group.size()` | ✓ | ✓ |
-| `unsigned grid_group.thread_rank()` | ✓ | ✓ |
-| `bool grid_group.is_valid()` | ✓ | ✓ |
-| `multi_grid_group this_multi_grid()` | ✓ | ✓ |
-| `void multi_grid_group.sync()` | ✓ | ✓ |
-| `unsigned multi_grid_group.size()` | ✓ | ✓ |
-| `unsigned multi_grid_group.thread_rank()` | ✓ | ✓ |
-| `bool multi_grid_group.is_valid()` | ✓ | ✓ |
-| `unsigned multi_grid_group.num_grids()` | ✓ | ✓ |
-| `unsigned multi_grid_group.grid_rank()` | ✓ | ✓ |
-| `thread_block this_thread_block()` | ✓ | ✓ |
-| `multi_grid_group this_multi_grid()` | ✓ | ✓ |
-| `void multi_grid_group.sync()` | ✓ | ✓ |
-| `void thread_block.sync()` | ✓ | ✓ |
-| `unsigned thread_block.size()` | ✓ | ✓ |
-| `unsigned thread_block.thread_rank()` | ✓ | ✓ |
-| `bool thread_block.is_valid()` | ✓ | ✓ |
-| `dim3 thread_block.group_index()` | ✓ | ✓ |
-| `dim3 thread_block.thread_index()` | ✓ | ✓ |
-
-## Warp Matrix Functions
-
-Warp matrix functions allow a warp to cooperatively operate on small matrices
-whose elements are spread over the lanes in an unspecified manner.  This feature
-was introduced in Cuda 9.
-
-HIP does not support any of the kernel language warp matrix
-types or functions.
-
-| **Function** | **Supported in HIP** | **Supported in CUDA** |
-| --- | --- | --- |
-| `void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda)` | | ✓ |
-| `void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda, layout_t layout)` | | ✓ |
-| `void store_matrix_sync(T* mptr, fragment<...> &a,  unsigned lda, layout_t layout)` | | ✓ |
-| `void fill_fragment(fragment<...> &a, const T &value)` | | ✓ |
-| `void mma_sync(fragment<...> &d, const fragment<...> &a, const fragment<...> &b, const fragment<...> &c , bool sat)` | | ✓ |
-
-## Independent Thread Scheduling
-
-The hardware support for independent thread scheduling introduced in certain architectures
-supporting Cuda allows threads to progress independently of each other and enables
-intra-warp synchronizations that were previously not allowed.
-
-HIP does not support this type of scheduling.
-
-## Profiler Counter Function
-
-The Cuda `__prof_trigger()` instruction is not supported.
-
-## Assert
-
-The assert function is supported in HIP.
-Assert function is used for debugging purpose, when the input expression equals to zero, the execution will be stopped.
-```
-void assert(int input)
-```
-
-There are two kinds of implementations for assert functions depending on the use sceneries,
-- One is for the host version of assert, which is defined in assert.h,
-- Another is the device version of assert, which is implemented in hip/hip_runtime.h.
-Users need to include assert.h to use assert. For assert to work in both device and host functions, users need to include "hip/hip_runtime.h".
-
-## Printf
-
-Printf function is supported in HIP.
-The following is a simple example to print information in the kernel.
-
-```
-#include <hip/hip_runtime.h>
-
-__global__ void run_printf() { printf("Hello World\n"); }
-
-int main() {
-  run_printf<<<dim3(1), dim3(1), 0, 0>>>();
-}
-```
-
-## Device-Side Dynamic Global Memory Allocation
-
-Device-side dynamic global memory allocation is under development.  HIP now includes a preliminary
-implementation of malloc and free that can be called from device functions.
-
-## `__launch_bounds__`
-
-
-GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simulaneously running. Thus GPUs have a complex relationship between resource usage and performance.
-
-__launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code.  It is a function attribute that must be attached to a __global__ function:
-
-```
-__global__ void `__launch_bounds__`(MAX_THREADS_PER_BLOCK, MIN_WARPS_PER_EXECUTION_UNIT)
-MyKernel(hipGridLaunch lp, ...)
-...
-```
-
-__launch_bounds__ supports two parameters:
-- MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the .maxntid PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time.
-The threads-per-block is the product of (blockDim.x * blockDim.y * blockDim.z).
-- MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler's resource usage.
-
-When launch kernel with HIP APIs, for example, hipModuleLaunchKernel(), HIP will do validation to make sure input kernel dimension size is not larger than specified launch_bounds.
-In case exceeded, HIP would return launch failure, if AMD_LOG_LEVEL is set with proper value (for details, please refer to docs/markdown/hip_logging.md), detail information will be shown in the error log message, including
-launch parameters of kernel dim size, launch bounds, and the name of the faulting kernel. It's helpful to figure out which is the faulting kernel, besides, the kernel dim size and launch bounds values will also assist in debugging such failures.
-
-### Compiler Impact
-The compiler uses these parameters as follows:
-- The compiler uses the hints only to manage register usage, and does not automatically reduce shared memory or other resources.
-- Compilation fails if compiler cannot generate a kernel which meets the requirements of the specified launch bounds.
-- From MAX_THREADS_PER_BLOCK, the compiler derives the maximum number of warps/block that can be used at launch time.
-Values of MAX_THREADS_PER_BLOCK less than the default allows the compiler to use a larger pool of registers : each warp uses registers, and this hint constains the launch to a warps/block size which is less than maximum.
-- From MIN_WARPS_PER_EXECUTION_UNIT, the compiler derives a maximum number of registers that can be used by the kernel (to meet the required #simultaneous active blocks).
-If MIN_WARPS_PER_EXECUTION_UNIT is 1, then the kernel can use all registers supported by the multiprocessor.
-- The compiler ensures that the registers used in the kernel is less than both allowed maximums, typically by spilling registers (to shared or global memory), or by using more instructions.
-- The compiler may use hueristics to increase register usage, or may simply be able to avoid spilling. The MAX_THREADS_PER_BLOCK is particularly useful in this cases, since it allows the compiler to use more registers and avoid situations where the compiler constrains the register usage (potentially spilling) to meet the requirements of a large block size that is never used at launch time.
-
-
-### CU and EU Definitions
-A compute unit (CU) is responsible for executing the waves of a work-group. It is composed of one or more execution units (EU) which are responsible for executing waves. An EU can have enough resources to maintain the state of more than one executing wave. This allows an EU to hide latency by switching between waves in a similar way to symmetric multithreading on a CPU. In order to allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be limited. Limiting such resources can allow greater latency hiding, but can result in having to spill some register state to memory. This attribute allows an advanced developer to tune the number of waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a certain number will fit to help hide latency, and can also be used to ensure no more than a certain number will fit to limit cache thrashing.
-
-### Porting from CUDA __launch_bounds
-CUDA defines a __launch_bounds which is also designed to control occupancy:
-```
-__launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
-```
-
-- The second parameter __launch_bounds parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors (this conversion is performed automatically by hipify tools).
-```
-MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / 32
-```
-
-The key differences in the interface are:
-- Warps (rather than blocks):
-The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding.  Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control.
-- Execution Units  (rather than multiProcessor):
-The use of execution units rather than multiprocessors provides support for architectures with multiple execution units/multi-processor. For example, the AMD GCN architecture has 4 execution units per multiProcessor.  The hipDeviceProps has a field executionUnitsPerMultiprocessor.
-Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired.
-
-
-### maxregcount
-Unlike nvcc, HIP-Clang does not support the "--maxregcount" option.  Instead, users are encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and portable than
-micro-architecture details like registers, and also the directive allows per-kernel control rather than an entire file.  hip_launch_bounds works on both HIP-Clang and nvcc targets.
-
-
-## Register Keyword
-The register keyword is deprecated in C++, and is silently ignored by both nvcc and HIP-Clang.  You can pass the option `-Wdeprecated-register` the compiler warning message.
-
-## Pragma Unroll
-
-Unroll with a bounds that is known at compile-time is supported.  For example:
-
-```
-#pragma unroll 16 /* hint to compiler to unroll next loop by 16 */
-for (int i=0; i<16; i++) ...
-```
-
-```
-#pragma unroll 1  /* tell compiler to never unroll the loop */
-for (int i=0; i<16; i++) ...
-```
-
-
-```
-#pragma unroll /* hint to compiler to completely unroll next loop. */
-for (int i=0; i<16; i++) ...
-```
-
-
-## In-Line Assembly
-
-GCN ISA In-line assembly, is supported. For example:
-
-```
-asm volatile ("v_mac_f32_e32 %0, %2, %3" : "=v" (out[i]) : "0"(out[i]), "v" (a), "v" (in[i]));
-```
-
-We insert the GCN isa into the kernel using `asm()` Assembler statement.
-`volatile` keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations.
-`v_mac_f32_e32` is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/)
-Index for the respective operand in the ordered fashion is provided by `%` followed by position in the list of operands
-`"v"` is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list)
-Output Constraints are specified by an `"="` prefix as shown above ("=v"). This indicate that assemby will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of `"0"` says to use the assigned register for output as an input as well (it being the 0'th constraint).
-
-## C++ Support
-The following C++ features are not supported:
-- Run-time-type information (RTTI)
-- Try/catch
-- Virtual functions
-Virtual functions are not supported if objects containing virtual function tables are passed between GPU's of different offload arch's, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.
-
-## Kernel Compilation
-hipcc now supports compiling C++/HIP kernels to binary code objects.
-The file format for binary is `.co` which means Code Object. The following command builds the code object using `hipcc`.
-
-`hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]`
-
-```
-[TARGET GPU] = GPU architecture
-[INPUT FILE] = Name of the file containing kernels
-[OUTPUT FILE] = Name of the generated code object file
-```
-
-Note: When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the sample in samples/0_Intro/module_api for differences in the arguments to be passed to the kernel.
-
-## gfx-arch-specific-kernel
-Clang defined '__gfx*__' macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample 14_gpu_arch in samples/2_Cookbook.
diff --git a/docs/reference/math_api.md b/docs/reference/math_api.md
deleted file mode 100644
index 74d20384c3..0000000000
--- a/docs/reference/math_api.md
+++ /dev/null
@@ -1,3846 +0,0 @@
-# HIP MATH APIs Documentation
-HIP supports most of the device functions supported by CUDA. Way to find the unsupported one is to search for the function and check its description
-Note: This document is not human generated. Any changes to this file will be discarded. Please make changes to Python3 script docs/markdown/device_md_gen.py
-
-## For Developers
-If you add or fixed a device function, make sure to add a signature of the function and definition later.
-For example, if you want to add `__device__ float __dotf(float4, float4)`, which does a dot product on 4 float vector components
-The way to add to the header is,
-```cpp
-__device__ static float __dotf(float4, float4);
-/*Way down in the file....*/
-__device__ static inline float __dotf(float4 x, float4 y) {
- /*implementation*/
-}
-```
-
-This helps python script to add the device function newly declared into markdown documentation (as it looks at functions with `;` at the end and `__device__` at the beginning)
-
-The next step would be to add Description to  `deviceFuncDesc` dictionary in python script.
-From the above example, it can be writtern as,
-`deviceFuncDesc['__dotf'] = 'This functions takes 2 4 component float vector and outputs dot product across them'`
-
-### acosf
-```cpp
-__device__ float acosf(float x);
-
-```
-**Description:**  This function returns floating point of arc cosine from a floating point input
-
-
-### acoshf
-```cpp
-__device__ float acoshf(float x);
-
-```
-**Description:**  Supported
-
-
-### asinf
-```cpp
-__device__ float asinf(float x);
-
-```
-**Description:**  Supported
-
-
-### asinhf
-```cpp
-__device__ float asinhf(float x);
-
-```
-**Description:**  Supported
-
-
-### atan2f
-```cpp
-__device__ float atan2f(float y, float x);
-
-```
-**Description:**  Supported
-
-
-### atanf
-```cpp
-__device__ float atanf(float x);
-
-```
-**Description:**  Supported
-
-
-### atanhf
-```cpp
-__device__ float atanhf(float x);
-
-```
-**Description:**  Supported
-
-
-### cbrtf
-```cpp
-__device__ float cbrtf(float x);
-
-```
-**Description:**  Supported
-
-
-### ceilf
-```cpp
-__device__ float ceilf(float x);
-
-```
-**Description:**  Supported
-
-
-### copysignf
-```cpp
-__device__ float copysignf(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### cosf
-```cpp
-__device__ float cosf(float x);
-
-```
-**Description:**  Supported
-
-
-### coshf
-```cpp
-__device__ float coshf(float x);
-
-```
-**Description:**  Supported
-
-
-### cospif
-```cpp
-__device__ float cospif(float x);
-
-```
-**Description:**  Supported
-
-
-### cyl_bessel_i0f
-```cpp
-//__device__ float cyl_bessel_i0f(float x);
-
-```
-**Description:**  **NOT Supported**
-
-
-### cyl_bessel_i1f
-```cpp
-//__device__ float cyl_bessel_i1f(float x);
-
-```
-**Description:**  **NOT Supported**
-
-
-### erfcf
-```cpp
-__device__ float erfcf(float x);
-
-```
-**Description:**  Supported
-
-
-### erfcinvf
-```cpp
-__device__  float erfcinvf(float y);
-
-```
-**Description:**  Supported
-
-
-### erfcxf
-```cpp
-__device__ float erfcxf(float x);
-
-```
-**Description:**  Supported
-
-
-### erff
-```cpp
-__device__ float erff(float x);
-
-```
-**Description:**  Supported
-
-
-### erfinvf
-```cpp
-__device__ float erfinvf(float y);
-
-```
-**Description:**  Supported
-
-
-### exp10f
-```cpp
-__device__ float exp10f(float x);
-
-```
-**Description:**  Supported
-
-
-### exp2f
-```cpp
-__device__ float exp2f(float x);
-
-```
-**Description:**  Supported
-
-
-### expf
-```cpp
-__device__ float expf(float x);
-
-```
-**Description:**  Supported
-
-
-### expm1f
-```cpp
-__device__ float expm1f(float x);
-
-```
-**Description:**  Supported
-
-
-### fabsf
-```cpp
-__device__ float fabsf(float x);
-
-```
-**Description:**  Supported
-
-
-### fdimf
-```cpp
-__device__ float fdimf(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### fdividef
-```cpp
-__device__ float fdividef(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### floorf
-```cpp
-__device__ float floorf(float x);
-
-```
-**Description:**  Supported
-
-
-### fmaf
-```cpp
-__device__ float fmaf(float x, float y, float z);
-
-```
-**Description:**  Supported
-
-
-### fmaxf
-```cpp
-__device__ float fmaxf(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### fminf
-```cpp
-__device__ float fminf(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### fmodf
-```cpp
-__device__ float fmodf(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### frexpf
-```cpp
-//__device__ float frexpf(float x, int* nptr);
-
-```
-**Description:**  **NOT Supported**
-
-
-### hypotf
-```cpp
-__device__ float hypotf(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### ilogbf
-```cpp
-__device__ float ilogbf(float x);
-
-```
-**Description:**  Supported
-
-
-### isfinite
-```cpp
-__device__ int isfinite(float a);
-
-```
-**Description:**  Supported
-
-
-### isinf
-```cpp
-__device__ unsigned isinf(float a);
-
-```
-**Description:**  Supported
-
-
-### isnan
-```cpp
-__device__ unsigned isnan(float a);
-
-```
-**Description:**  Supported
-
-
-### j0f
-```cpp
-__device__ float j0f(float x);
-
-```
-**Description:**  Supported
-
-
-### j1f
-```cpp
-__device__ float j1f(float x);
-
-```
-**Description:**  Supported
-
-
-### jnf
-```cpp
-__device__ float jnf(int n, float x);
-
-```
-**Description:**  Supported
-
-
-### ldexpf
-```cpp
-__device__ float ldexpf(float x, int exp);
-
-```
-**Description:**  Supported
-
-
-### lgammaf
-```cpp
-//__device__ float lgammaf(float x);
-
-```
-**Description:**  **NOT Supported**
-
-
-### llrintf
-```cpp
-__device__ long long int llrintf(float x);
-
-```
-**Description:**  Supported
-
-
-### llroundf
-```cpp
-__device__ long long int llroundf(float x);
-
-```
-**Description:**  Supported
-
-
-### log10f
-```cpp
-__device__ float log10f(float x);
-
-```
-**Description:**  Supported
-
-
-### log1pf
-```cpp
-__device__ float log1pf(float x);
-
-```
-**Description:**  Supported
-
-
-### logbf
-```cpp
-__device__ float logbf(float x);
-
-```
-**Description:**  Supported
-
-
-### lrintf
-```cpp
-__device__ long int lrintf(float x);
-
-```
-**Description:**  Supported
-
-
-### lroundf
-```cpp
-__device__ long int lroundf(float x);
-
-```
-**Description:**  Supported
-
-
-### modff
-```cpp
-//__device__ float modff(float x, float *iptr);
-
-```
-**Description:**  **NOT Supported**
-
-
-### nanf
-```cpp
-__device__ float nanf(const char* tagp);
-
-```
-**Description:**  Supported
-
-
-### nearbyintf
-```cpp
-__device__ float nearbyintf(float x);
-
-```
-**Description:**  Supported
-
-
-### nextafterf
-```cpp
-//__device__ float nextafterf(float x, float y);
-
-```
-**Description:**  **NOT Supported**
-
-
-### norm3df
-```cpp
-__device__ float norm3df(float a, float b, float c);
-
-```
-**Description:**  Supported
-
-
-### norm4df
-```cpp
-__device__ float norm4df(float a, float b, float c, float d);
-
-```
-**Description:**  Supported
-
-
-### normcdff
-```cpp
-__device__ float normcdff(float y);
-
-```
-**Description:**  Supported
-
-
-### normcdfinvf
-```cpp
-__device__ float normcdfinvf(float y);
-
-```
-**Description:**  Supported
-
-
-### normf
-```cpp
-__device__ float normf(int dim, const float *a);
-
-```
-**Description:**  Supported
-
-
-### powf
-```cpp
-__device__ float powf(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### rcbrtf
-```cpp
-__device__ float rcbrtf(float x);
-
-```
-**Description:**  Supported
-
-
-### remainderf
-```cpp
-__device__ float remainderf(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### remquof
-```cpp
-__device__ float remquof(float x, float y, int *quo);
-
-```
-**Description:**  Supported
-
-
-### rhypotf
-```cpp
-__device__ float rhypotf(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### rintf
-```cpp
-__device__ float rintf(float x);
-
-```
-**Description:**  Supported
-
-
-### rnorm3df
-```cpp
-__device__ float rnorm3df(float a, float b, float c);
-
-```
-**Description:**  Supported
-
-
-### rnorm4df
-```cpp
-__device__ float rnorm4df(float a, float b, float c, float d);
-
-```
-**Description:**  Supported
-
-
-### rnormf
-```cpp
-__device__ float rnormf(int dim, const float* a);
-
-```
-**Description:**  Supported
-
-
-### roundf
-```cpp
-__device__ float roundf(float x);
-
-```
-**Description:**  Supported
-
-
-### rsqrtf
-```cpp
-__device__ float rsqrtf(float x);
-
-```
-**Description:**  Supported
-
-
-### scalblnf
-```cpp
-__device__ float scalblnf(float x, long int n);
-
-```
-**Description:**  Supported
-
-
-### scalbnf
-```cpp
-__device__ float scalbnf(float x, int n);
-
-```
-**Description:**  Supported
-
-
-### signbit
-```cpp
-__device__ int signbit(float a);
-
-```
-**Description:**  Supported
-
-
-### sincosf
-```cpp
-__device__ void sincosf(float x, float *sptr, float *cptr);
-
-```
-**Description:**  Supported
-
-
-### sincospif
-```cpp
-__device__ void sincospif(float x, float *sptr, float *cptr);
-
-```
-**Description:**  Supported
-
-
-### sinf
-```cpp
-__device__ float sinf(float x);
-
-```
-**Description:**  Supported
-
-
-### sinhf
-```cpp
-__device__ float sinhf(float x);
-
-```
-**Description:**  Supported
-
-
-### sinpif
-```cpp
-__device__ float sinpif(float x);
-
-```
-**Description:**  Supported
-
-
-### sqrtf
-```cpp
-__device__ float sqrtf(float x);
-
-```
-**Description:**  Supported
-
-
-### tanf
-```cpp
-__device__ float tanf(float x);
-
-```
-**Description:**  Supported
-
-
-### tanhf
-```cpp
-__device__ float tanhf(float x);
-
-```
-**Description:**  Supported
-
-
-### tgammaf
-```cpp
-__device__ float tgammaf(float x);
-
-```
-**Description:**  Supported
-
-
-### truncf
-```cpp
-__device__ float truncf(float x);
-
-```
-**Description:**  Supported
-
-
-### y0f
-```cpp
-__device__ float y0f(float x);
-
-```
-**Description:**  Supported
-
-
-### y1f
-```cpp
-__device__ float y1f(float x);
-
-```
-**Description:**  Supported
-
-
-### ynf
-```cpp
-__device__ float ynf(int n, float x);
-
-```
-**Description:**  Supported
-
-
-### acos
-```cpp
-__device__ double acos(double x);
-
-```
-**Description:**  Supported
-
-
-### acosh
-```cpp
-__device__ double acosh(double x);
-
-```
-**Description:**  Supported
-
-
-### asin
-```cpp
-__device__ double asin(double x);
-
-```
-**Description:**  Supported
-
-
-### asinh
-```cpp
-__device__ double asinh(double x);
-
-```
-**Description:**  Supported
-
-
-### atan
-```cpp
-__device__ double atan(double x);
-
-```
-**Description:**  Supported
-
-
-### atan2
-```cpp
-__device__ double atan2(double y, double x);
-
-```
-**Description:**  Supported
-
-
-### atanh
-```cpp
-__device__ double atanh(double x);
-
-```
-**Description:**  Supported
-
-
-### cbrt
-```cpp
-__device__ double cbrt(double x);
-
-```
-**Description:**  Supported
-
-
-### ceil
-```cpp
-__device__ double ceil(double x);
-
-```
-**Description:**  Supported
-
-
-### copysign
-```cpp
-__device__ double copysign(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### cos
-```cpp
-__device__ double cos(double x);
-
-```
-**Description:**  Supported
-
-
-### cosh
-```cpp
-__device__ double cosh(double x);
-
-```
-**Description:**  Supported
-
-
-### cospi
-```cpp
-__device__ double cospi(double x);
-
-```
-**Description:**  Supported
-
-
-### cyl_bessel_i0
-```cpp
-//__device__ double cyl_bessel_i0(double x);
-
-```
-**Description:**  **NOT Supported**
-
-
-### cyl_bessel_i1
-```cpp
-//__device__ double cyl_bessel_i1(double x);
-
-```
-**Description:**  **NOT Supported**
-
-
-### erf
-```cpp
-__device__ double erf(double x);
-
-```
-**Description:**  Supported
-
-
-### erfc
-```cpp
-__device__ double erfc(double x);
-
-```
-**Description:**  Supported
-
-
-### erfcinv
-```cpp
-__device__ double erfcinv(double y);
-
-```
-**Description:**  Supported
-
-
-### erfcx
-```cpp
-__device__ double erfcx(double x);
-
-```
-**Description:**  Supported
-
-
-### erfinv
-```cpp
-__device__ double erfinv(double x);
-
-```
-**Description:**  Supported
-
-
-### exp
-```cpp
-__device__ double exp(double x);
-
-```
-**Description:**  Supported
-
-
-### exp10
-```cpp
-__device__ double exp10(double x);
-
-```
-**Description:**  Supported
-
-
-### exp2
-```cpp
-__device__ double exp2(double x);
-
-```
-**Description:**  Supported
-
-
-### expm1
-```cpp
-__device__ double expm1(double x);
-
-```
-**Description:**  Supported
-
-
-### fabs
-```cpp
-__device__ double fabs(double x);
-
-```
-**Description:**  Supported
-
-
-### fdim
-```cpp
-__device__ double fdim(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### floor
-```cpp
-__device__ double floor(double x);
-
-```
-**Description:**  Supported
-
-
-### fma
-```cpp
-__device__ double fma(double x, double y, double z);
-
-```
-**Description:**  Supported
-
-
-### fmax
-```cpp
-__device__ double fmax(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### fmin
-```cpp
-__device__ double fmin(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### fmod
-```cpp
-__device__ double fmod(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### frexp
-```cpp
-//__device__ double frexp(double x, int *nptr);
-
-```
-**Description:**  **NOT Supported**
-
-
-### hypot
-```cpp
-__device__ double hypot(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### ilogb
-```cpp
-__device__ double ilogb(double x);
-
-```
-**Description:**  Supported
-
-
-### isfinite
-```cpp
-__device__ int isfinite(double x);
-
-```
-**Description:**  Supported
-
-
-### isinf
-```cpp
-__device__ unsigned isinf(double x);
-
-```
-**Description:**  Supported
-
-
-### isnan
-```cpp
-__device__ unsigned isnan(double x);
-
-```
-**Description:**  Supported
-
-
-### j0
-```cpp
-__device__ double j0(double x);
-
-```
-**Description:**  Supported
-
-
-### j1
-```cpp
-__device__ double j1(double x);
-
-```
-**Description:**  Supported
-
-
-### jn
-```cpp
-__device__ double jn(int n, double x);
-
-```
-**Description:**  Supported
-
-
-### ldexp
-```cpp
-__device__ double ldexp(double x, int exp);
-
-```
-**Description:**  Supported
-
-
-### lgamma
-```cpp
-__device__ double lgamma(double x);
-
-```
-**Description:**  Supported
-
-
-### llrint
-```cpp
-__device__ long long llrint(double x);
-
-```
-**Description:**  Supported
-
-
-### llround
-```cpp
-__device__ long long llround(double x);
-
-```
-**Description:**  Supported
-
-
-### log
-```cpp
-__device__ double log(double x);
-
-```
-**Description:**  Supported
-
-
-### log10
-```cpp
-__device__ double log10(double x);
-
-```
-**Description:**  Supported
-
-
-### log1p
-```cpp
-__device__ double log1p(double x);
-
-```
-**Description:**  Supported
-
-
-### log2
-```cpp
-__device__ double log2(double x);
-
-```
-**Description:**  Supported
-
-
-### logb
-```cpp
-__device__ double logb(double x);
-
-```
-**Description:**  Supported
-
-
-### lrint
-```cpp
-__device__ long int lrint(double x);
-
-```
-**Description:**  Supported
-
-
-### lround
-```cpp
-__device__ long int lround(double x);
-
-```
-**Description:**  Supported
-
-
-### modf
-```cpp
-//__device__ double modf(double x, double *iptr);
-
-```
-**Description:**  **NOT Supported**
-
-
-### nan
-```cpp
-__device__ double nan(const char* tagp);
-
-```
-**Description:**  Supported
-
-
-### nearbyint
-```cpp
-__device__ double nearbyint(double x);
-
-```
-**Description:**  Supported
-
-
-### nextafter
-```cpp
-__device__ double nextafter(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### norm
-```cpp
-__device__ double norm(int dim, const double* t);
-
-```
-**Description:**  Supported
-
-
-### norm3d
-```cpp
-__device__ double norm3d(double a, double b, double c);
-
-```
-**Description:**  Supported
-
-
-### norm4d
-```cpp
-__device__ double norm4d(double a, double b, double c, double d);
-
-```
-**Description:**  Supported
-
-
-### normcdf
-```cpp
-__device__ double normcdf(double y);
-
-```
-**Description:**  Supported
-
-
-### normcdfinv
-```cpp
-__device__ double normcdfinv(double y);
-
-```
-**Description:**  Supported
-
-
-### pow
-```cpp
-__device__ double pow(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### rcbrt
-```cpp
-__device__ double rcbrt(double x);
-
-```
-**Description:**  Supported
-
-
-### remainder
-```cpp
-__device__ double remainder(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### remquo
-```cpp
-//__device__ double remquo(double x, double y, int *quo);
-
-```
-**Description:**  **NOT Supported**
-
-
-### rhypot
-```cpp
-__device__ double rhypot(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### rint
-```cpp
-__device__ double rint(double x);
-
-```
-**Description:**  Supported
-
-
-### rnorm
-```cpp
-__device__ double rnorm(int dim, const double* t);
-
-```
-**Description:**  Supported
-
-
-### rnorm3d
-```cpp
-__device__ double rnorm3d(double a, double b, double c);
-
-```
-**Description:**  Supported
-
-
-### rnorm4d
-```cpp
-__device__ double rnorm4d(double a, double b, double c, double d);
-
-```
-**Description:**  Supported
-
-
-### round
-```cpp
-__device__ double round(double x);
-
-```
-**Description:**  Supported
-
-
-### rsqrt
-```cpp
-__device__ double rsqrt(double x);
-
-```
-**Description:**  Supported
-
-
-### scalbln
-```cpp
-__device__ double scalbln(double x, long int n);
-
-```
-**Description:**  Supported
-
-
-### scalbn
-```cpp
-__device__ double scalbn(double x, int n);
-
-```
-**Description:**  Supported
-
-
-### signbit
-```cpp
-__device__ int signbit(double a);
-
-```
-**Description:**  Supported
-
-
-### sin
-```cpp
-__device__ double sin(double a);
-
-```
-**Description:**  Supported
-
-
-### sincos
-```cpp
-__device__ void sincos(double x, double *sptr, double *cptr);
-
-```
-**Description:**  Supported
-
-
-### sincospi
-```cpp
-__device__ void sincospi(double x, double *sptr, double *cptr);
-
-```
-**Description:**  Supported
-
-
-### sinh
-```cpp
-__device__ double sinh(double x);
-
-```
-**Description:**  Supported
-
-
-### sinpi
-```cpp
-__device__ double sinpi(double x);
-
-```
-**Description:**  Supported
-
-
-### sqrt
-```cpp
-__device__ double sqrt(double x);
-
-```
-**Description:**  Supported
-
-
-### tan
-```cpp
-__device__ double tan(double x);
-
-```
-**Description:**  Supported
-
-
-### tanh
-```cpp
-__device__ double tanh(double x);
-
-```
-**Description:**  Supported
-
-
-### tgamma
-```cpp
-__device__ double tgamma(double x);
-
-```
-**Description:**  Supported
-
-
-### trunc
-```cpp
-__device__ double trunc(double x);
-
-```
-**Description:**  Supported
-
-
-### y0
-```cpp
-__device__ double y0(double x);
-
-```
-**Description:**  Supported
-
-
-### y1
-```cpp
-__device__ double y1(double y);
-
-```
-**Description:**  Supported
-
-
-### yn
-```cpp
-__device__ double yn(int n, double x);
-
-```
-**Description:**  Supported
-
-
-### __cosf
-```cpp
-__device__  float __cosf(float x);
-
-```
-**Description:**  Supported
-
-
-### __exp10f
-```cpp
-__device__  float __exp10f(float x);
-
-```
-**Description:**  Supported
-
-
-### __expf
-```cpp
-__device__  float __expf(float x);
-
-```
-**Description:**  Supported
-
-
-### __fadd_rd
-```cpp
-__device__ static  float __fadd_rd(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fadd_rn
-```cpp
-__device__ static  float __fadd_rn(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### __fadd_ru
-```cpp
-__device__ static  float __fadd_ru(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fadd_rz
-```cpp
-__device__ static  float __fadd_rz(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fdiv_rd
-```cpp
-__device__ static  float __fdiv_rd(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fdiv_rn
-```cpp
-__device__ static  float __fdiv_rn(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### __fdiv_ru
-```cpp
-__device__ static  float __fdiv_ru(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fdiv_rz
-```cpp
-__device__ static  float __fdiv_rz(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fdividef
-```cpp
-__device__ static  float __fdividef(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### __fmaf_rd
-```cpp
-__device__  float __fmaf_rd(float x, float y, float z);
-
-```
-**Description:**  Unsupported
-
-
-### __fmaf_rn
-```cpp
-__device__  float __fmaf_rn(float x, float y, float z);
-
-```
-**Description:**  Supported
-
-
-### __fmaf_ru
-```cpp
-__device__  float __fmaf_ru(float x, float y, float z);
-
-```
-**Description:**  Unsupported
-
-
-### __fmaf_rz
-```cpp
-__device__  float __fmaf_rz(float x, float y, float z);
-
-```
-**Description:**  Unsupported
-
-
-### __fmul_rd
-```cpp
-__device__ static  float __fmul_rd(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fmul_rn
-```cpp
-__device__ static  float __fmul_rn(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### __fmul_ru
-```cpp
-__device__ static  float __fmul_ru(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fmul_rz
-```cpp
-__device__ static  float __fmul_rz(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __frcp_rd
-```cpp
-__device__  float __frcp_rd(float x);
-
-```
-**Description:**  Unsupported
-
-
-### __frcp_rn
-```cpp
-__device__  float __frcp_rn(float x);
-
-```
-**Description:**  Supported
-
-
-### __frcp_ru
-```cpp
-__device__  float __frcp_ru(float x);
-
-```
-**Description:**  Unsupported
-
-
-### __frcp_rz
-```cpp
-__device__  float __frcp_rz(float x);
-
-```
-**Description:**  Unsupported
-
-
-### __frsqrt_rn
-```cpp
-__device__  float __frsqrt_rn(float x);
-
-```
-**Description:**  Supported
-
-
-### __fsqrt_rd
-```cpp
-__device__  float __fsqrt_rd(float x);
-
-```
-**Description:**  Unsupported
-
-
-### __fsqrt_rn
-```cpp
-__device__  float __fsqrt_rn(float x);
-
-```
-**Description:**  Supported
-
-
-### __fsqrt_ru
-```cpp
-__device__  float __fsqrt_ru(float x);
-
-```
-**Description:**  Unsupported
-
-
-### __fsqrt_rz
-```cpp
-__device__  float __fsqrt_rz(float x);
-
-```
-**Description:**  Unsupported
-
-
-### __fsub_rd
-```cpp
-__device__ static  float __fsub_rd(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fsub_rn
-```cpp
-__device__ static  float __fsub_rn(float x, float y);
-
-```
-**Description:**  Supported
-
-
-### __fsub_ru
-```cpp
-__device__ static  float __fsub_ru(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __fsub_rz
-```cpp
-__device__ static  float __fsub_rz(float x, float y);
-
-```
-**Description:**  Unsupported
-
-
-### __log10f
-```cpp
-__device__  float __log10f(float x);
-
-```
-**Description:**  Supported
-
-
-### __log2f
-```cpp
-__device__  float __log2f(float x);
-
-```
-**Description:**  Supported
-
-
-### __logf
-```cpp
-__device__  float __logf(float x);
-
-```
-**Description:**  Supported
-
-
-### __powf
-```cpp
-__device__  float __powf(float base, float exponent);
-
-```
-**Description:**  Supported
-
-
-### __saturatef
-```cpp
-__device__ static  float __saturatef(float x);
-
-```
-**Description:**  Supported
-
-
-### __sincosf
-```cpp
-__device__  void __sincosf(float x, float *s, float *c);
-
-```
-**Description:**  Supported
-
-
-### __sinf
-```cpp
-__device__  float __sinf(float x);
-
-```
-**Description:**  Supported
-
-
-### __tanf
-```cpp
-__device__  float __tanf(float x);
-
-```
-**Description:**  Supported
-
-
-### __dadd_rd
-```cpp
-__device__ static  double __dadd_rd(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __dadd_rn
-```cpp
-__device__ static  double __dadd_rn(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### __dadd_ru
-```cpp
-__device__ static  double __dadd_ru(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __dadd_rz
-```cpp
-__device__ static  double __dadd_rz(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __ddiv_rd
-```cpp
-__device__ static  double __ddiv_rd(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __ddiv_rn
-```cpp
-__device__ static  double __ddiv_rn(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### __ddiv_ru
-```cpp
-__device__ static  double __ddiv_ru(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __ddiv_rz
-```cpp
-__device__ static  double __ddiv_rz(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __dmul_rd
-```cpp
-__device__ static  double __dmul_rd(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __dmul_rn
-```cpp
-__device__ static  double __dmul_rn(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### __dmul_ru
-```cpp
-__device__ static  double __dmul_ru(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __dmul_rz
-```cpp
-__device__ static  double __dmul_rz(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __drcp_rd
-```cpp
-__device__  double __drcp_rd(double x);
-
-```
-**Description:**  Unsupported
-
-
-### __drcp_rn
-```cpp
-__device__  double __drcp_rn(double x);
-
-```
-**Description:**  Supported
-
-
-### __drcp_ru
-```cpp
-__device__  double __drcp_ru(double x);
-
-```
-**Description:**  Unsupported
-
-
-### __drcp_rz
-```cpp
-__device__  double __drcp_rz(double x);
-
-```
-**Description:**  Unsupported
-
-
-### __dsqrt_rd
-```cpp
-__device__  double __dsqrt_rd(double x);
-
-```
-**Description:**  Unsupported
-
-
-### __dsqrt_rn
-```cpp
-__device__  double __dsqrt_rn(double x);
-
-```
-**Description:**  Supported
-
-
-### __dsqrt_ru
-```cpp
-__device__  double __dsqrt_ru(double x);
-
-```
-**Description:**  Unsupported
-
-
-### __dsqrt_rz
-```cpp
-__device__  double __dsqrt_rz(double x);
-
-```
-**Description:**  Unsupported
-
-
-### __dsub_rd
-```cpp
-__device__ static  double __dsub_rd(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __dsub_rn
-```cpp
-__device__ static  double __dsub_rn(double x, double y);
-
-```
-**Description:**  Supported
-
-
-### __dsub_ru
-```cpp
-__device__ static  double __dsub_ru(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __dsub_rz
-```cpp
-__device__ static  double __dsub_rz(double x, double y);
-
-```
-**Description:**  Unsupported
-
-
-### __fma_rd
-```cpp
-__device__  double __fma_rd(double x, double y, double z);
-
-```
-**Description:**  Unsupported
-
-
-### __fma_rn
-```cpp
-__device__  double __fma_rn(double x, double y, double z);
-
-```
-**Description:**  Supported
-
-
-### __fma_ru
-```cpp
-__device__  double __fma_ru(double x, double y, double z);
-
-```
-**Description:**  Unsupported
-
-
-### __fma_rz
-```cpp
-__device__  double __fma_rz(double x, double y, double z);
-
-```
-**Description:**  Unsupported
-
-
-### __brev
-```cpp
-__device__ unsigned int __brev( unsigned int x);
-
-```
-**Description:**  Supported
-
-
-### __brevll
-```cpp
-__device__ unsigned long long int __brevll( unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __byte_perm
-```cpp
-__device__ unsigned int __byte_perm(unsigned int x, unsigned int y, unsigned int s);
-
-```
-**Description:**  Supported
-
-
-### __clz
-```cpp
-__device__ unsigned int __clz(int x);
-
-```
-**Description:**  Supported
-
-
-### __clzll
-```cpp
-__device__ unsigned int __clzll(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ffs
-```cpp
-__device__ unsigned int __ffs(int x);
-
-```
-**Description:**  Supported
-
-
-### __ffsll
-```cpp
-__device__ unsigned int __ffsll(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __hadd
-```cpp
-__device__ static unsigned int __hadd(int x, int y);
-
-```
-**Description:**  Supported
-
-
-### __mul24
-```cpp
-__device__ static int __mul24(int x, int y);
-
-```
-**Description:**  Supported
-
-
-### __mul64hi
-```cpp
-__device__ long long int __mul64hi(long long int x, long long int y);
-
-```
-**Description:**  Supported
-
-
-### __mulhi
-```cpp
-__device__ static int __mulhi(int x, int y);
-
-```
-**Description:**  Supported
-
-
-### __popc
-```cpp
-__device__ unsigned int __popc(unsigned int x);
-
-```
-**Description:**  Supported
-
-
-### __popcll
-```cpp
-__device__ unsigned int __popcll(unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __rhadd
-```cpp
-__device__ static int __rhadd(int x, int y);
-
-```
-**Description:**  Supported
-
-
-### __sad
-```cpp
-__device__ static unsigned int __sad(int x, int y, int z);
-
-```
-**Description:**  Supported
-
-
-### __uhadd
-```cpp
-__device__ static unsigned int __uhadd(unsigned int x, unsigned int y);
-
-```
-**Description:**  Supported
-
-
-### __umul24
-```cpp
-__device__ static int __umul24(unsigned int x, unsigned int y);
-
-```
-**Description:**  Supported
-
-
-### __umul64hi
-```cpp
-__device__ unsigned long long int __umul64hi(unsigned long long int x, unsigned long long int y);
-
-```
-**Description:**  Supported
-
-
-### __umulhi
-```cpp
-__device__ static unsigned int __umulhi(unsigned int x, unsigned int y);
-
-```
-**Description:**  Supported
-
-
-### __urhadd
-```cpp
-__device__ static unsigned int __urhadd(unsigned int x, unsigned int y);
-
-```
-**Description:**  Supported
-
-
-### __usad
-```cpp
-__device__ static unsigned int __usad(unsigned int x, unsigned int y, unsigned int z);
-
-```
-**Description:**  Supported
-
-
-### __double2float_rd
-```cpp
-__device__ float __double2float_rd(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2float_rn
-```cpp
-__device__ float __double2float_rn(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2float_ru
-```cpp
-__device__ float __double2float_ru(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2float_rz
-```cpp
-__device__ float __double2float_rz(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2hiint
-```cpp
-__device__ int __double2hiint(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2int_rd
-```cpp
-__device__ int __double2int_rd(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2int_rn
-```cpp
-__device__ int __double2int_rn(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2int_ru
-```cpp
-__device__ int __double2int_ru(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2int_rz
-```cpp
-__device__ int __double2int_rz(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2ll_rd
-```cpp
-__device__ long long int __double2ll_rd(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2ll_rn
-```cpp
-__device__ long long int __double2ll_rn(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2ll_ru
-```cpp
-__device__ long long int __double2ll_ru(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2ll_rz
-```cpp
-__device__ long long int __double2ll_rz(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2loint
-```cpp
-__device__ int __double2loint(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2uint_rd
-```cpp
-__device__ unsigned int __double2uint_rd(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2uint_rn
-```cpp
-__device__ unsigned int __double2uint_rn(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2uint_ru
-```cpp
-__device__ unsigned int __double2uint_ru(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2uint_rz
-```cpp
-__device__ unsigned int __double2uint_rz(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2ull_rd
-```cpp
-__device__ unsigned long long int __double2ull_rd(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2ull_rn
-```cpp
-__device__ unsigned long long int __double2ull_rn(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2ull_ru
-```cpp
-__device__ unsigned long long int __double2ull_ru(double x);
-
-```
-**Description:**  Supported
-
-
-### __double2ull_rz
-```cpp
-__device__ unsigned long long int __double2ull_rz(double x);
-
-```
-**Description:**  Supported
-
-
-### __double_as_longlong
-```cpp
-__device__ long long int __double_as_longlong(double x);
-
-```
-**Description:**  Supported
-
-
-### __float2half_rn
-```cpp
-__device__ unsigned short __float2half_rn(float x);
-
-```
-**Description:**  Supported
-
-
-### __half2float
-```cpp
-__device__ float __half2float(unsigned short);
-
-```
-**Description:**  Supported
-
-
-### __float2half_rn
-```cpp
-__device__ __half __float2half_rn(float x);
-
-```
-**Description:**  Supported
-
-
-### __half2float
-```cpp
-__device__ float __half2float(__half);
-
-```
-**Description:**  Supported
-
-
-### __float2int_rd
-```cpp
-__device__ int __float2int_rd(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2int_rn
-```cpp
-__device__ int __float2int_rn(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2int_ru
-```cpp
-__device__ int __float2int_ru(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2int_rz
-```cpp
-__device__ int __float2int_rz(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2ll_rd
-```cpp
-__device__ long long int __float2ll_rd(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2ll_rn
-```cpp
-__device__ long long int __float2ll_rn(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2ll_ru
-```cpp
-__device__ long long int __float2ll_ru(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2ll_rz
-```cpp
-__device__ long long int __float2ll_rz(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2uint_rd
-```cpp
-__device__ unsigned int __float2uint_rd(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2uint_rn
-```cpp
-__device__ unsigned int __float2uint_rn(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2uint_ru
-```cpp
-__device__ unsigned int __float2uint_ru(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2uint_rz
-```cpp
-__device__ unsigned int __float2uint_rz(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2ull_rd
-```cpp
-__device__ unsigned long long int __float2ull_rd(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2ull_rn
-```cpp
-__device__ unsigned long long int __float2ull_rn(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2ull_ru
-```cpp
-__device__ unsigned long long int __float2ull_ru(float x);
-
-```
-**Description:**  Supported
-
-
-### __float2ull_rz
-```cpp
-__device__ unsigned long long int __float2ull_rz(float x);
-
-```
-**Description:**  Supported
-
-
-### __float_as_int
-```cpp
-__device__ int __float_as_int(float x);
-
-```
-**Description:**  Supported
-
-
-### __float_as_uint
-```cpp
-__device__ unsigned int __float_as_uint(float x);
-
-```
-**Description:**  Supported
-
-
-### __hiloint2double
-```cpp
-__device__ double __hiloint2double(int hi, int lo);
-
-```
-**Description:**  Supported
-
-
-### __int2double_rn
-```cpp
-__device__ double __int2double_rn(int x);
-
-```
-**Description:**  Supported
-
-
-### __int2float_rd
-```cpp
-__device__ float __int2float_rd(int x);
-
-```
-**Description:**  Supported
-
-
-### __int2float_rn
-```cpp
-__device__ float __int2float_rn(int x);
-
-```
-**Description:**  Supported
-
-
-### __int2float_ru
-```cpp
-__device__ float __int2float_ru(int x);
-
-```
-**Description:**  Supported
-
-
-### __int2float_rz
-```cpp
-__device__ float __int2float_rz(int x);
-
-```
-**Description:**  Supported
-
-
-### __int_as_float
-```cpp
-__device__ float __int_as_float(int x);
-
-```
-**Description:**  Supported
-
-
-### __ll2double_rd
-```cpp
-__device__ double __ll2double_rd(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ll2double_rn
-```cpp
-__device__ double __ll2double_rn(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ll2double_ru
-```cpp
-__device__ double __ll2double_ru(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ll2double_rz
-```cpp
-__device__ double __ll2double_rz(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ll2float_rd
-```cpp
-__device__ float __ll2float_rd(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ll2float_rn
-```cpp
-__device__ float __ll2float_rn(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ll2float_ru
-```cpp
-__device__ float __ll2float_ru(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ll2float_rz
-```cpp
-__device__ float __ll2float_rz(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __longlong_as_double
-```cpp
-__device__ double __longlong_as_double(long long int x);
-
-```
-**Description:**  Supported
-
-
-### __uint2double_rn
-```cpp
-__device__ double __uint2double_rn(int x);
-
-```
-**Description:**  Supported
-
-
-### __uint2float_rd
-```cpp
-__device__ float __uint2float_rd(unsigned int x);
-
-```
-**Description:**  Supported
-
-
-### __uint2float_rn
-```cpp
-__device__ float __uint2float_rn(unsigned int x);
-
-```
-**Description:**  Supported
-
-
-### __uint2float_ru
-```cpp
-__device__ float __uint2float_ru(unsigned int x);
-
-```
-**Description:**  Supported
-
-
-### __uint2float_rz
-```cpp
-__device__ float __uint2float_rz(unsigned int x);
-
-```
-**Description:**  Supported
-
-
-### __uint_as_float
-```cpp
-__device__ float __uint_as_float(unsigned int x);
-
-```
-**Description:**  Supported
-
-
-### __ull2double_rd
-```cpp
-__device__ double __ull2double_rd(unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ull2double_rn
-```cpp
-__device__ double __ull2double_rn(unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ull2double_ru
-```cpp
-__device__ double __ull2double_ru(unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ull2double_rz
-```cpp
-__device__ double __ull2double_rz(unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ull2float_rd
-```cpp
-__device__ float __ull2float_rd(unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ull2float_rn
-```cpp
-__device__ float __ull2float_rn(unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ull2float_ru
-```cpp
-__device__ float __ull2float_ru(unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __ull2float_rz
-```cpp
-__device__ float __ull2float_rz(unsigned long long int x);
-
-```
-**Description:**  Supported
-
-
-### __hadd
-```cpp
-__device__ static __half __hadd(const __half a, const __half b);
-
-```
-**Description:**  Supported
-
-
-### __hadd_sat
-```cpp
-__device__ static __half __hadd_sat(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hfma
-```cpp
-__device__ static __half __hfma(__half a, __half b, __half c);
-
-```
-**Description:**  Supported
-
-
-### __hfma_sat
-```cpp
-__device__ static __half __hfma_sat(__half a, __half b, __half c);
-
-```
-**Description:**  Supported
-
-
-### __hmul
-```cpp
-__device__ static __half __hmul(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hmul_sat
-```cpp
-__device__ static __half __hmul_sat(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hneg
-```cpp
-__device__ static __half __hneg(__half a);
-
-```
-**Description:**  Supported
-
-
-### __hsub
-```cpp
-__device__ static __half __hsub(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hsub_sat
-```cpp
-__device__ static __half __hsub_sat(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### hdiv
-```cpp
-__device__ static __half hdiv(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hadd2
-```cpp
-__device__ static __half2 __hadd2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hadd2_sat
-```cpp
-__device__ static __half2 __hadd2_sat(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hfma2
-```cpp
-__device__ static __half2 __hfma2(__half2 a, __half2 b, __half2 c);
-
-```
-**Description:**  Supported
-
-
-### __hfma2_sat
-```cpp
-__device__ static __half2 __hfma2_sat(__half2 a, __half2 b, __half2 c);
-
-```
-**Description:**  Supported
-
-
-### __hmul2
-```cpp
-__device__ static __half2 __hmul2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hmul2_sat
-```cpp
-__device__ static __half2 __hmul2_sat(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hsub2
-```cpp
-__device__ static __half2 __hsub2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hneg2
-```cpp
-__device__ static __half2 __hneg2(__half2 a);
-
-```
-**Description:**  Supported
-
-
-### __hsub2_sat
-```cpp
-__device__ static __half2 __hsub2_sat(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### h2div
-```cpp
-__device__ static __half2 h2div(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __heq
-```cpp
-__device__  bool __heq(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hge
-```cpp
-__device__  bool __hge(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hgt
-```cpp
-__device__  bool __hgt(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hisinf
-```cpp
-__device__  bool __hisinf(__half a);
-
-```
-**Description:**  Supported
-
-
-### __hisnan
-```cpp
-__device__  bool __hisnan(__half a);
-
-```
-**Description:**  Supported
-
-
-### __hle
-```cpp
-__device__  bool __hle(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hlt
-```cpp
-__device__  bool __hlt(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hne
-```cpp
-__device__  bool __hne(__half a, __half b);
-
-```
-**Description:**  Supported
-
-
-### __hbeq2
-```cpp
-__device__  bool __hbeq2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hbge2
-```cpp
-__device__  bool __hbge2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hbgt2
-```cpp
-__device__  bool __hbgt2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hble2
-```cpp
-__device__  bool __hble2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hblt2
-```cpp
-__device__  bool __hblt2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hbne2
-```cpp
-__device__  bool __hbne2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __heq2
-```cpp
-__device__  __half2 __heq2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hge2
-```cpp
-__device__  __half2 __hge2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hgt2
-```cpp
-__device__  __half2 __hgt2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hisnan2
-```cpp
-__device__  __half2 __hisnan2(__half2 a);
-
-```
-**Description:**  Supported
-
-
-### __hle2
-```cpp
-__device__  __half2 __hle2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hlt2
-```cpp
-__device__  __half2 __hlt2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __hne2
-```cpp
-__device__  __half2 __hne2(__half2 a, __half2 b);
-
-```
-**Description:**  Supported
-
-
-### hceil
-```cpp
-__device__ static __half hceil(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hcos
-```cpp
-__device__ static __half hcos(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hexp
-```cpp
-__device__ static __half hexp(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hexp10
-```cpp
-__device__ static __half hexp10(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hexp2
-```cpp
-__device__ static __half hexp2(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hfloor
-```cpp
-__device__ static __half hfloor(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hlog
-```cpp
-__device__ static __half hlog(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hlog10
-```cpp
-__device__ static __half hlog10(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hlog2
-```cpp
-__device__ static __half hlog2(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hrcp
-```cpp
-//__device__ static __half hrcp(const __half h);
-
-```
-**Description:**  **NOT Supported**
-
-
-### hrint
-```cpp
-__device__ static __half hrint(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hsin
-```cpp
-__device__ static __half hsin(const __half h);
-
-```
-**Description:**  Supported
-
-
-### hsqrt
-```cpp
-__device__ static __half hsqrt(const __half a);
-
-```
-**Description:**  Supported
-
-
-### htrunc
-```cpp
-__device__ static __half htrunc(const __half a);
-
-```
-**Description:**  Supported
-
-
-### h2ceil
-```cpp
-__device__ static __half2 h2ceil(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2exp
-```cpp
-__device__ static __half2 h2exp(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2exp10
-```cpp
-__device__ static __half2 h2exp10(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2exp2
-```cpp
-__device__ static __half2 h2exp2(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2floor
-```cpp
-__device__ static __half2 h2floor(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2log
-```cpp
-__device__ static __half2 h2log(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2log10
-```cpp
-__device__ static __half2 h2log10(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2log2
-```cpp
-__device__ static __half2 h2log2(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2rcp
-```cpp
-__device__ static __half2 h2rcp(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2rsqrt
-```cpp
-__device__ static __half2 h2rsqrt(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2sin
-```cpp
-__device__ static __half2 h2sin(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### h2sqrt
-```cpp
-__device__ static __half2 h2sqrt(const __half2 h);
-
-```
-**Description:**  Supported
-
-
-### __float22half2_rn
-```cpp
-__device__  __half2 __float22half2_rn(const float2 a);
-
-```
-**Description:**  Supported
-
-
-### __float2half
-```cpp
-__device__  __half __float2half(const float a);
-
-```
-**Description:**  Supported
-
-
-### __float2half2_rn
-```cpp
-__device__  __half2 __float2half2_rn(const float a);
-
-```
-**Description:**  Supported
-
-
-### __float2half_rd
-```cpp
-__device__  __half __float2half_rd(const float a);
-
-```
-**Description:**  Supported
-
-
-### __float2half_rn
-```cpp
-__device__  __half __float2half_rn(const float a);
-
-```
-**Description:**  Supported
-
-
-### __float2half_ru
-```cpp
-__device__  __half __float2half_ru(const float a);
-
-```
-**Description:**  Supported
-
-
-### __float2half_rz
-```cpp
-__device__  __half __float2half_rz(const float a);
-
-```
-**Description:**  Supported
-
-
-### __floats2half2_rn
-```cpp
-__device__  __half2 __floats2half2_rn(const float a, const float b);
-
-```
-**Description:**  Supported
-
-
-### __half22float2
-```cpp
-__device__  float2 __half22float2(const __half2 a);
-
-```
-**Description:**  Supported
-
-
-### __half2float
-```cpp
-__device__  float __half2float(const __half a);
-
-```
-**Description:**  Supported
-
-
-### half2half2
-```cpp
-__device__  __half2 half2half2(const __half a);
-
-```
-**Description:**  Supported
-
-
-### __half2int_rd
-```cpp
-__device__  int __half2int_rd(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2int_rn
-```cpp
-__device__  int __half2int_rn(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2int_ru
-```cpp
-__device__  int __half2int_ru(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2int_rz
-```cpp
-__device__  int __half2int_rz(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ll_rd
-```cpp
-__device__  long long int __half2ll_rd(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ll_rn
-```cpp
-__device__  long long int __half2ll_rn(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ll_ru
-```cpp
-__device__  long long int __half2ll_ru(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ll_rz
-```cpp
-__device__  long long int __half2ll_rz(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2short_rd
-```cpp
-__device__  short __half2short_rd(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2short_rn
-```cpp
-__device__  short __half2short_rn(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2short_ru
-```cpp
-__device__  short __half2short_ru(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2short_rz
-```cpp
-__device__  short __half2short_rz(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2uint_rd
-```cpp
-__device__  unsigned int __half2uint_rd(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2uint_rn
-```cpp
-__device__  unsigned int __half2uint_rn(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2uint_ru
-```cpp
-__device__  unsigned int __half2uint_ru(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2uint_rz
-```cpp
-__device__  unsigned int __half2uint_rz(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ull_rd
-```cpp
-__device__  unsigned long long int __half2ull_rd(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ull_rn
-```cpp
-__device__  unsigned long long int __half2ull_rn(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ull_ru
-```cpp
-__device__  unsigned long long int __half2ull_ru(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ull_rz
-```cpp
-__device__  unsigned long long int __half2ull_rz(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ushort_rd
-```cpp
-__device__  unsigned short int __half2ushort_rd(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ushort_rn
-```cpp
-__device__  unsigned short int __half2ushort_rn(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ushort_ru
-```cpp
-__device__  unsigned short int __half2ushort_ru(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half2ushort_rz
-```cpp
-__device__  unsigned short int __half2ushort_rz(__half h);
-
-```
-**Description:**  Supported
-
-
-### __half_as_short
-```cpp
-__device__  short int __half_as_short(const __half h);
-
-```
-**Description:**  Supported
-
-
-### __half_as_ushort
-```cpp
-__device__  unsigned short int __half_as_ushort(const __half h);
-
-```
-**Description:**  Supported
-
-
-### __halves2half2
-```cpp
-__device__  __half2 __halves2half2(const __half a, const __half b);
-
-```
-**Description:**  Supported
-
-
-### __high2float
-```cpp
-__device__  float __high2float(const __half2 a);
-
-```
-**Description:**  Supported
-
-
-### __high2half
-```cpp
-__device__  __half __high2half(const __half2 a);
-
-```
-**Description:**  Supported
-
-
-### __high2half2
-```cpp
-__device__  __half2 __high2half2(const __half2 a);
-
-```
-**Description:**  Supported
-
-
-### __highs2half2
-```cpp
-__device__  __half2 __highs2half2(const __half2 a, const __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __int2half_rd
-```cpp
-__device__  __half __int2half_rd(int i);
-
-```
-**Description:**  Supported
-
-
-### __int2half_rn
-```cpp
-__device__  __half __int2half_rn(int i);
-
-```
-**Description:**  Supported
-
-
-### __int2half_ru
-```cpp
-__device__  __half __int2half_ru(int i);
-
-```
-**Description:**  Supported
-
-
-### __int2half_rz
-```cpp
-__device__  __half __int2half_rz(int i);
-
-```
-**Description:**  Supported
-
-
-### __ll2half_rd
-```cpp
-__device__  __half __ll2half_rd(long long int i);
-
-```
-**Description:**  Supported
-
-
-### __ll2half_rn
-```cpp
-__device__  __half __ll2half_rn(long long int i);
-
-```
-**Description:**  Supported
-
-
-### __ll2half_ru
-```cpp
-__device__  __half __ll2half_ru(long long int i);
-
-```
-**Description:**  Supported
-
-
-### __ll2half_rz
-```cpp
-__device__  __half __ll2half_rz(long long int i);
-
-```
-**Description:**  Supported
-
-
-### __low2float
-```cpp
-__device__  float __low2float(const __half2 a);
-
-```
-**Description:**  Supported
-
-
-### __low2half
-```cpp
-__device__ __half __low2half(const __half2 a);
-
-```
-**Description:**  Supported
-
-
-### __low2half2
-```cpp
-__device__ __half2 __low2half2(const __half2 a, const __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __low2half2
-```cpp
-__device__ __half2 __low2half2(const __half2 a);
-
-```
-**Description:**  Supported
-
-
-### __lowhigh2highlow
-```cpp
-__device__ __half2 __lowhigh2highlow(const __half2 a);
-
-```
-**Description:**  Supported
-
-
-### __lows2half2
-```cpp
-__device__ __half2 __lows2half2(const __half2 a, const __half2 b);
-
-```
-**Description:**  Supported
-
-
-### __short2half_rd
-```cpp
-__device__  __half __short2half_rd(short int i);
-
-```
-**Description:**  Supported
-
-
-### __short2half_rn
-```cpp
-__device__  __half __short2half_rn(short int i);
-
-```
-**Description:**  Supported
-
-
-### __short2half_ru
-```cpp
-__device__  __half __short2half_ru(short int i);
-
-```
-**Description:**  Supported
-
-
-### __short2half_rz
-```cpp
-__device__  __half __short2half_rz(short int i);
-
-```
-**Description:**  Supported
-
-
-### __uint2half_rd
-```cpp
-__device__  __half __uint2half_rd(unsigned int i);
-
-```
-**Description:**  Supported
-
-
-### __uint2half_rn
-```cpp
-__device__  __half __uint2half_rn(unsigned int i);
-
-```
-**Description:**  Supported
-
-
-### __uint2half_ru
-```cpp
-__device__  __half __uint2half_ru(unsigned int i);
-
-```
-**Description:**  Supported
-
-
-### __uint2half_rz
-```cpp
-__device__  __half __uint2half_rz(unsigned int i);
-
-```
-**Description:**  Supported
-
-
-### __ull2half_rd
-```cpp
-__device__  __half __ull2half_rd(unsigned long long int i);
-
-```
-**Description:**  Supported
-
-
-### __ull2half_rn
-```cpp
-__device__  __half __ull2half_rn(unsigned long long int i);
-
-```
-**Description:**  Supported
-
-
-### __ull2half_ru
-```cpp
-__device__  __half __ull2half_ru(unsigned long long int i);
-
-```
-**Description:**  Supported
-
-
-### __ull2half_rz
-```cpp
-__device__  __half __ull2half_rz(unsigned long long int i);
-
-```
-**Description:**  Supported
-
-
-### __ushort2half_rd
-```cpp
-__device__  __half __ushort2half_rd(unsigned short int i);
-
-```
-**Description:**  Supported
-
-
-### __ushort2half_rn
-```cpp
-__device__  __half __ushort2half_rn(unsigned short int i);
-
-```
-**Description:**  Supported
-
-
-### __ushort2half_ru
-```cpp
-__device__  __half __ushort2half_ru(unsigned short int i);
-
-```
-**Description:**  Supported
-
-
-### __ushort2half_rz
-```cpp
-__device__  __half __ushort2half_rz(unsigned short int i);
-
-```
-**Description:**  Supported
-
-
-### __ushort_as_half
-```cpp
-__device__  __half __ushort_as_half(const unsigned short int i);
-
-```
-**Description:**  Supported
-
-
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index fc06dd0b8b..0d40e70c83 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -1,29 +1,46 @@
+# Anywhere {branch} is used, the branch name will be substituted.
+# These comments will also be removed.
+defaults:
+  numbered: False
+  maxdepth: 6
 root: index
 subtrees:
-- caption: User Guide
+
+- caption: Installation
   entries:
-    - file: user_guide/programming_manual
-    - file: user_guide/hip_rtc
-    - file: user_guide/faq
-    - file: user_guide/hip_porting_guide
-    - file: user_guide/hip_porting_driver_api
-- caption: How to Guides
-  entries:
-  - file: how_to_guides/install.md
-  - file: how_to_guides/debugging.md
+  - file: install/install.rst
+    title: Install HIP
+
 - caption: Reference
   entries:
-  - file: doxygen/html/index
-  - file: reference/kernel_language
-  - file: reference/math_api
-  - file: reference/terms
-  - file: reference/glossary
-  - file: reference/deprecated_api_list
-- caption: Developer Guide
+    - file: doxygen/html/index.html
+      title: API library
+
+- caption: How-to
+  entries:
+  - file: how-to/logging.rst
+    title: Logging
+  - file: how-to/debugging.rst
+    title: Debugging
+
+- caption: Conceptual
   entries:
-  - file: developer_guide/build
-  - file: developer_guide/logging
-  - file: developer_guide/contributing.md
-- caption: About
+  - file: conceptual/gpu-arch.md
+    title: GPU architectures
+
+
+
+- caption: Compatibility & support
   entries:
-  - file: license
+  - file: about/compatibility/linux-support.md
+    title: Linux (GPU & OS)
+  - file: about/compatibility/windows-support.md
+    title: Windows (GPU & OS)
+  - file: about/compatibility/3rd-party-support-matrix.md
+    title: Third-party
+  - file: about/compatibility/user-kernel-space-compat-matrix.md
+    title: User/kernel space support
+  - file: about/compatibility/docker-image-support-matrix.md
+    title: Docker
+  - file: about/compatibility/openmp.md
+    title: OpenMP
diff --git a/include/hip/hip_runtime_api.h b/include/hip/hip_runtime_api.h
index 74c14494b7..6521579496 100644
--- a/include/hip/hip_runtime_api.h
+++ b/include/hip/hip_runtime_api.h
@@ -2334,7 +2334,7 @@ hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString);
  * Create a new asynchronous stream.  @p stream returns an opaque handle that can be used to
  * reference the newly created stream in subsequent hipStream* commands.  The stream is allocated on
  * the heap and will remain allocated even if the handle goes out-of-scope.  To release the memory
- * used by the stream, applicaiton must call hipStreamDestroy.
+ * used by the stream, application must call hipStreamDestroy.
  *
  * @return #hipSuccess, #hipErrorInvalidValue
  *
@@ -2351,7 +2351,7 @@ hipError_t hipStreamCreate(hipStream_t* stream);
  * Create a new asynchronous stream.  @p stream returns an opaque handle that can be used to
  * reference the newly created stream in subsequent hipStream* commands.  The stream is allocated on
  * the heap and will remain allocated even if the handle goes out-of-scope.  To release the memory
- * used by the stream, applicaiton must call hipStreamDestroy. Flags controls behavior of the
+ * used by the stream, application must call hipStreamDestroy. Flags controls behavior of the
  * stream.  See #hipStreamDefault, #hipStreamNonBlocking.
  *
  *
@@ -2369,7 +2369,7 @@ hipError_t hipStreamCreateWithFlags(hipStream_t* stream, unsigned int flags);
  * Create a new asynchronous stream with the specified priority.  @p stream returns an opaque handle
  * that can be used to reference the newly created stream in subsequent hipStream* commands.  The
  * stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope.
- * To release the memory used by the stream, applicaiton must call hipStreamDestroy. Flags controls
+ * To release the memory used by the stream, application must call hipStreamDestroy. Flags controls
  * behavior of the stream.  See #hipStreamDefault, #hipStreamNonBlocking.
  *
  *
@@ -2387,7 +2387,7 @@ hipError_t hipStreamCreateWithPriority(hipStream_t* stream, unsigned int flags,
  * and greatest stream priority respectively. Stream priorities follow a convention where lower numbers
  * imply greater priorities. The range of meaningful stream priorities is given by
  * [*greatestPriority, *leastPriority]. If the user attempts to create a stream with a priority value
- * that is outside the the meaningful range as specified by this API, the priority is automatically
+ * that is outside the meaningful range as specified by this API, the priority is automatically
  * clamped to within the valid range.
  */
 hipError_t hipDeviceGetStreamPriorityRange(int* leastPriority, int* greatestPriority);
@@ -2459,8 +2459,8 @@ hipError_t hipStreamSynchronize(hipStream_t stream);
  * All future work submitted to @p stream will wait until @p event reports completion before
  * beginning execution.
  *
- * This function only waits for commands in the current stream to complete.  Notably,, this function
- * does not impliciy wait for commands in the default stream to complete, even if the specified
+ * This function only waits for commands in the current stream to complete.  Notably, this function
+ * does not implicitly wait for commands in the default stream to complete, even if the specified
  * stream is created with hipStreamNonBlocking = 0.
  *
  * @see hipStreamCreate, hipStreamCreateWithFlags, hipStreamCreateWithPriority, hipStreamSynchronize, hipStreamDestroy
@@ -3308,7 +3308,7 @@ hipError_t hipStreamAttachMemAsync(hipStream_t stream,
  *
  * Inserts a memory allocation operation into @p stream.
  * A pointer to the allocated memory is returned immediately in *dptr.
- * The allocation must not be accessed until the the allocation operation completes.
+ * The allocation must not be accessed until the allocation operation completes.
  * The allocation comes from the memory pool associated with the stream's device.
  *
  * @note The default memory pool of a device contains device memory from that device.
@@ -3560,7 +3560,7 @@ hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool);
  *
  * Inserts an allocation operation into @p stream.
  * A pointer to the allocated memory is returned immediately in @p dev_ptr.
- * The allocation must not be accessed until the the allocation operation completes.
+ * The allocation must not be accessed until the allocation operation completes.
  * The allocation comes from the specified memory pool.
  *
  * @note The specified memory pool may be from a device different than that of the specified @p stream.
@@ -6224,7 +6224,7 @@ hipError_t hipGetTextureAlignmentOffset(
 DEPRECATED(DEPRECATED_MSG)
 hipError_t hipUnbindTexture(const textureReference* tex);
 /**
- * @brief Gets the the address for a texture reference.
+ * @brief Gets the address for a texture reference.
  *
  * @param [out] dev_ptr  Pointer of device address.
  * @param [in] texRef  Pointer of texture reference.