Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort fails on Lovelace (sm8.9) GPUs #1874

Closed
xaellison opened this issue Apr 18, 2023 · 4 comments · Fixed by #1979
Closed

Sort fails on Lovelace (sm8.9) GPUs #1874

xaellison opened this issue Apr 18, 2023 · 4 comments · Fixed by #1979
Labels
bug Something isn't working

Comments

@xaellison
Copy link
Contributor

Describe the bug

Both quicksort and bitonic sort fail non-deterministically on a lovelace gpu.

To reproduce

Quick example:

julia> begin
       c = CUDA.rand(Float32, 1<<16+129)
       CUDA.@sync sort!(c)
       issorted(Array(c))
       end
false

I found this running ] test CUDA locally, so the test suite identifies specific failures:
Testset "reduced block sizes" for quicksort.
Testset "bitonic sort" (link)

Except for "reduced block sizes", quicksort works. Bitonic sort sometimes works

Manifest.toml

Paste your Manifest.toml here, or accurately describe which version of CUDA.jl and its dependencies (GPUArrays.jl, GPUCompiler.jl, LLVM.jl) you are using.

Expected behavior

A clear and concise description of what you expected to happen.

Version info

Details on Julia:

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × AMD Ryzen 7 5800X 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 1 on 2 virtual cores
Environment:
  JULIA_CPU_THREADS = 2

Details on CUDA:

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 531.61.0

Libraries:
- CUBLAS: 12.1.0
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.4
- CUSPARSE: 12.0.2
- CUPTI: 18.0.0
- NVML: 12.0.0+531.61

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 18.994 GiB / 23.988 GiB available)

Additional context
This is on branch: https://github.com/xaellison/CUDA.jl/tree/ae_support_sm_89

@xaellison xaellison added the bug Something isn't working label Apr 18, 2023
@maleadt
Copy link
Member

maleadt commented May 3, 2023

MWE:

    host = Int8[fill(0, 113)..., -12, 0, 0, 0, 0, 0, 0, 0, 0, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1]
    @show length(host)
    device = CuArray(host)
    sort!(device; alg=CUDA.BitonicSort)
    println(device)
    println(sort(host))
    issorted(Array(device))

This happens because on my Lovelace GPU, blocks_per_mp=24, while on an older Turing GPU it's 16. Hard-coding it to 24 reproduces the bug on older hardware, so this looks like an issue with the implementation of Bitonic sort.

@maleadt
Copy link
Member

maleadt commented May 3, 2023

I'm also confused by that launch configuration logic, @xaellison could you explain it? If you want to ensure enough blocks are launched, you should generally use the number that's returned by the launch configuration (it returns both a max thread and min block count), instead of computing it yourself based on device attributes.

@maleadt maleadt mentioned this issue May 3, 2023
@maleadt
Copy link
Member

maleadt commented Jun 13, 2023

@xaellison Bump, got a minute to explain your reasoning with the launch configuration?

CUDA.jl/src/sorting.jl

Lines 864 to 873 in a719eb3

# determine launch configuration
blocks_per_mp = if CUDA.driver_version() >= v"11.0"
CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_BLOCKS_PER_MULTIPROCESSOR)
else
16
end
blocks_per_mp = 16 # XXX: JuliaGPU/CUDA.jl#1874
threads = min(threads1, threads2)
min_pseudo_block = threads ÷ blocks_per_mp
log_threads = threads |> log2 |> Int

@xaellison
Copy link
Contributor Author

hey @maleadt, There are two things going on here:

  1. This launch config is complex because I assumed that comparator_kernel and comparator_small_kernel needed to have the same block size. I believe that can be relaxed since comparator_kernel has no logic that requires block level coordination.
  2. The sort is correct iff blocks_per_mp is a power of two (4, 8, 16 work but 6, 12, 24 do not). This is the case on both my Turing and Lovelace machines. That seems like a bug.

Hopefully I can take a closer look soon and fix this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants