CUDA Programming

Jump to bottom

Baiju Meswani edited this page Feb 15, 2023 · 3 revisions

CUDA programming basics

Understand the hardware
- Architecture Generations
  - P100: Pascal / sm 60
  - V100: Volta / sm 70
  - A100: Ampere / sm 80
- CUDA Core vs. Tensor Core
Programming model
- Thread
- Block
- Grid
- Stream
Must-know functions
- cudaMalloc() vs. cudaFree()
- cudaMemcpy() vs. cudaMemcpyAsync()
- cudaMemset() vs. cudaMemsetAsync()
- cudaStreamSynchronize() vs. cudaDeviceSynchronize()
- cudaEventRecord() vs. cudaStreamWaitEvent()

Common tricks

Avoid memcpy
Avoid unnecessary Sync
Preprocess data in CPU
when to use #pragma unroll?

CUDA Kernel Examples

Easy: Dropout/DropGrad
Medium: SoftmaxCrossEntropyLoss(Grad)
Hard: LayerNormalization, ReduceSum, GatherGrad

Debugging CUDA kernels

printf() works inside CUDA code
Memcpy data to CPU for inspection?

Understanding IO bound and compute bound