Sync_Microbenchmark

Abstract

This work aims at characterizing the synchronization methods in CUDA. It mainly includes two-part:

Non-primitive Synchronization:

Implicit Barrier, i.e. overhead of launching a kernel, including the new kernel launch function introduced for Cooperative Groups.
Multi-GPU synchronization with OpenMP in a single node.

Primitive Synchronization Methods in Nvidia GPUs introduced in CUDA 9.0:

Warp level, Thread block level, and grid-level synchronization.
Multi-grid synchronization for Multi-GPU synchronization.

PLUS, for stable result, please avoid using dynamic frequency.

Requirements

CUDA 9.0 sm_60 for most of the measurements sm_70 for sleep instruction involved measurements.

Implicit Barrier

Compile

Directly compile with the Makefile in Implicit_Barrier folder

The sleep function is only available after sm_70. We tried to use other instructions to control the kernel execute latency, but the result is not so stable as sleep instruction, and larger than sleep instruction.

Input explanation

ImplicitBarrier [gpu_count]
gpu_count is 2 by default. When testing multi-GPU related latencies, the experiments would iterate from 1 to gpu_count.

Output explanation

Null Kernel

method: the method to do kernel launch [traditional_launch|cooperative_launch|multi_cooperative_launch|single_omp_traditional_launch]
GPUCount: how many gpu involved
rep: repeat calling launch function times
blk: griddim
thrd: blockdim
m(clk) s(clk): mean and standard variation of CPU clock() function
m(sync) s(sync): mean and standard variation of DeviceSynchronization();
m(laun) s(laun): latency of launch functions (test before DeviceSynchronization());
m(ttl) s(ttl): latency of kernels total execution (test after Device Synchronization());
m(avelaun) s(avelaun): mean and standard variation of average latency of each launch function, compute with m(laun)/rep
m(addl) s(addl): mean and standard variation of average additional latency for each additional launch function. By default this code use repeat 1 times and 128 times and compute by m(ttl_128)-m(ttl_1)/(128-1)

By using "additional latency", it will be possible to eliminate the overhead of synchronization (which is not negligible when considering kernel overhead) and other unrelated parts. Details are explained in the Use_Microbenchmark_To_Better_Understand_The_Overhead_Of_CUDA_Kernels__Poster_.pdf in the same folder.

Sleep Kernel (Fused Sleep Kernels to test the kernel overhead when kernel execution latency is long enough)

method: the method to do kernel launch [traditional_launch|cooperative_launch|multi_cooperative_launch|single_omp_traditional_launch]
GPUCount: how many GPU involved
rep: repeat calling launch function times for both the basic kernel and the fused kernel.
blk: griddim
thrd: blockdim
idea(wkld): the work unit. When fuse two kernel means the kernel execution latency of fused kernel should be twice the idea(wkld)
m(wkld) s(wkld): the basic workload deduce from the measurements.

Workload Test

The same as NULL KERNEL. Using "Sleep" to act as "workload"

Just to show how additioanl latency tested is related to the real kernel execution latency. Default sequencing is "(0 1 2 4 8 16 32 64 128)X2000ns". Before a certain point, increasing kernel execution latency would not affect the additional latency caused by the additional kernel (imagine a pipeline when pipeline is full, the most timeconsuming step mainly influence the performance of the whole system. Just to help understanding, we do not know whether a "pipeline" exist, especially in the situation of multi-grid launch).

Details are explained in the Use_Microbenchmark_To_Better_Understand_The_Overhead_Of_CUDA_Kernels__Poster_.pdf in the same folder.

Explicit Barrier

Compile

Directly compile with the Makefile in Explicit_Barrier folder. Three executable files will be created:

TestRepeat
BenchmarkIntraSM
BenchmarkInterSM

TestRepeat

To show if repeating a synchronization instruction will influence the performance itself

The result shows that the result of shufl, block sync and grid-level syncs become more accurate as the repeat times increase. But for warp level syncs, this would happen, probably because the current implementation is based on software codes, repeat too many times will cause instruction overflow, harming the performance.

Execution

./TestRepeat

Output

method: kernel function name
rep: repeat instruction times
blk: griddim
thrd: blockdim
tile: control the group size of coalesced group and tile group
m(cycle) s(cycle): mean and standard variation of total instruction execution
m(ns) s(ns): mean and standard variation of total kernel execution (meaningless here just for comparison)
m(ave_cycle) s(ave_cycle): mean and standard variation of average instruction(cycle)

BenchmarkIntraSM

Benchmark measurements that only need clock inside an SM. Includes:

Throughput of Warp level syncs and block sync Latency of block sync for each possible group

Execution

./BenchmarkIntraSM

Output

Latency

method: kernel function name
GPUcount: 1
rep: repeat instruction times
blk: griddim
thrd: blockdim
m(ave_cycle) s(ave_cycle): mean and standard variation of average instruction(cycle)

Throughput

method: kernel function name
GPUcount: 1
rep: repeat instruction times
blk: griddim
thrd: blockdim
tile: used to control tile group size
m(ttl_latency): total latency for synchronization
m(thrput): throuput (warp/cycle) computed base on m(ttl_latency)

BenchmarkInterSM

Benchmark measurements involve several SMs

Latency of grid-level syncs

Execution

./BenchmarkInterSM [gpu_count]

Output

method: kernel function name
GPUcount: GPU involved
basicrep: repeat instruction times in basic kernel
moredrep: repeat instruction times in more kernel
blk: griddim
thrd: blockdim
m(basic_ttl) s(basic_ttl): mean and standard variation of total kernel latency (ns) for executing basic kernel
m(more_ttl) s(more_ttl): mean and standard variation of total kernel latency (ns) for executing more kernel
m(avginstru) s(avginstr): mean and standard variation of average instruction (ns) deduced

Side note:

CUDA 11.5 marked multi-grid synchronization API as deprecated. But it seems that there is no new implementation yet.

Citation

This research will be published in IPDPS20. Please cite:

Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, Satoshi Matsuoka. A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs. InProceedings of the IPDPS 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Explicit_Barrier		Explicit_Barrier
Implicit_Barrier		Implicit_Barrier
Reduction		Reduction
share		share
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sync_Microbenchmark

Abstract

Requirements

Implicit Barrier

Compile

Input explanation

Output explanation

Null Kernel

Sleep Kernel (Fused Sleep Kernels to test the kernel overhead when kernel execution latency is long enough)

Workload Test

Explicit Barrier

Compile

TestRepeat

Execution

Output

BenchmarkIntraSM

Execution

Output

Latency

Throughput

BenchmarkInterSM

Execution

Output

Side note:

Citation

About

Releases

Packages

Languages

neozhang307/SyncMicrobenchmark

Folders and files

Latest commit

History

Repository files navigation

Sync_Microbenchmark

Abstract

Requirements

Implicit Barrier

Compile

Input explanation

Output explanation

Null Kernel

Sleep Kernel (Fused Sleep Kernels to test the kernel overhead when kernel execution latency is long enough)

Workload Test

Explicit Barrier

Compile

TestRepeat

Execution

Output

BenchmarkIntraSM

Execution

Output

Latency

Throughput

BenchmarkInterSM

Execution

Output

Side note:

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages