Sort fails on Lovelace (sm8.9) GPUs #1874

xaellison · 2023-04-18T00:06:53Z

Describe the bug

Both quicksort and bitonic sort fail non-deterministically on a lovelace gpu.

To reproduce

Quick example:

julia> begin
       c = CUDA.rand(Float32, 1<<16+129)
       CUDA.@sync sort!(c)
       issorted(Array(c))
       end
false

I found this running ] test CUDA locally, so the test suite identifies specific failures:
Testset "reduced block sizes" for quicksort.
Testset "bitonic sort" (link)

Except for "reduced block sizes", quicksort works. Bitonic sort sometimes works

Manifest.toml

Paste your Manifest.toml here, or accurately describe which version of CUDA.jl and its dependencies (GPUArrays.jl, GPUCompiler.jl, LLVM.jl) you are using.

Expected behavior

A clear and concise description of what you expected to happen.

Version info

Details on Julia:

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × AMD Ryzen 7 5800X 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 1 on 2 virtual cores
Environment:
  JULIA_CPU_THREADS = 2

Details on CUDA:

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 531.61.0

Libraries:
- CUBLAS: 12.1.0
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.4
- CUSPARSE: 12.0.2
- CUPTI: 18.0.0
- NVML: 12.0.0+531.61

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 18.994 GiB / 23.988 GiB available)

Additional context
This is on branch: https://github.com/xaellison/CUDA.jl/tree/ae_support_sm_89

The text was updated successfully, but these errors were encountered:

maleadt · 2023-05-03T13:09:08Z

MWE:

    host = Int8[fill(0, 113)..., -12, 0, 0, 0, 0, 0, 0, 0, 0, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1]
    @show length(host)
    device = CuArray(host)
    sort!(device; alg=CUDA.BitonicSort)
    println(device)
    println(sort(host))
    issorted(Array(device))

This happens because on my Lovelace GPU, blocks_per_mp=24, while on an older Turing GPU it's 16. Hard-coding it to 24 reproduces the bug on older hardware, so this looks like an issue with the implementation of Bitonic sort.

maleadt · 2023-05-03T13:13:40Z

I'm also confused by that launch configuration logic, @xaellison could you explain it? If you want to ensure enough blocks are launched, you should generally use the number that's returned by the launch configuration (it returns both a max thread and min block count), instead of computing it yourself based on device attributes.

maleadt · 2023-06-13T11:54:07Z

@xaellison Bump, got a minute to explain your reasoning with the launch configuration?

CUDA.jl/src/sorting.jl

Lines 864 to 873 in a719eb3

    
           # determine launch configuration 
        
           blocks_per_mp = if CUDA.driver_version() >= v"11.0" 
        
               CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_BLOCKS_PER_MULTIPROCESSOR) 
        
           else 
        
               16 
        
           end 
        
           blocks_per_mp = 16   # XXX: JuliaGPU/CUDA.jl#1874 
        
           threads = min(threads1, threads2) 
        
           min_pseudo_block = threads ÷ blocks_per_mp 
        
           log_threads = threads |> log2 |> Int

xaellison · 2023-06-20T13:53:45Z

hey @maleadt, There are two things going on here:

This launch config is complex because I assumed that comparator_kernel and comparator_small_kernel needed to have the same block size. I believe that can be relaxed since comparator_kernel has no logic that requires block level coordination.
The sort is correct iff blocks_per_mp is a power of two (4, 8, 16 work but 6, 12, 24 do not). This is the case on both my Turing and Lovelace machines. That seems like a bug.

Hopefully I can take a closer look soon and fix this

xaellison added the bug Something isn't working label Apr 18, 2023

maleadt mentioned this issue May 3, 2023

Lovelace fixes #1894

Merged

xaellison mentioned this issue Jun 24, 2023

fix launch config bug in bitonic sort #1979

Merged

maleadt closed this as completed in #1979 Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort fails on Lovelace (sm8.9) GPUs #1874

Sort fails on Lovelace (sm8.9) GPUs #1874

xaellison commented Apr 18, 2023

maleadt commented May 3, 2023 •

edited

Loading

maleadt commented May 3, 2023

maleadt commented Jun 13, 2023

xaellison commented Jun 20, 2023

Sort fails on Lovelace (sm8.9) GPUs #1874

Sort fails on Lovelace (sm8.9) GPUs #1874

Comments

xaellison commented Apr 18, 2023

maleadt commented May 3, 2023 • edited Loading

maleadt commented May 3, 2023

maleadt commented Jun 13, 2023

xaellison commented Jun 20, 2023

maleadt commented May 3, 2023 •

edited

Loading