Skip to content

CUDA Programming

Baiju Meswani edited this page Feb 15, 2023 · 3 revisions

CUDA programming basics

  • Understand the hardware

    • Architecture Generations

      • P100: Pascal / sm 60
      • V100: Volta / sm 70
      • A100: Ampere / sm 80
    • CUDA Core vs. Tensor Core

  • Programming model

    • Thread
    • Block
    • Grid
    • Stream
  • Must-know functions

    • cudaMalloc() vs. cudaFree()
    • cudaMemcpy() vs. cudaMemcpyAsync()
    • cudaMemset() vs. cudaMemsetAsync()
    • cudaStreamSynchronize() vs. cudaDeviceSynchronize()
    • cudaEventRecord() vs. cudaStreamWaitEvent()

Common tricks

  • Avoid memcpy
  • Avoid unnecessary Sync
  • Preprocess data in CPU
  • when to use #pragma unroll?

CUDA Kernel Examples

  • Easy: Dropout/DropGrad
  • Medium: SoftmaxCrossEntropyLoss(Grad)
  • Hard: LayerNormalization, ReduceSum, GatherGrad

Debugging CUDA kernels

  • printf() works inside CUDA code
  • Memcpy data to CPU for inspection?

Understanding IO bound and compute bound