Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA support for ImageBufAlgo (experimental and very incomplete) #1929

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Commits on Oct 13, 2018

  1. CUDA support for ImageBufAlgo (experimental and very incomplete)

    First stab at this, it's experimental, the general organization may change
    as we extend it.
    
    * To get these features, you must build with `USE_CUDA=1`, in which case
      it will look for Cuda toolkit. For simplicity, we're setting a version
      floor of Cuda 7.0 and sm_30.
    
    * To enable at runtime (duh, still only if you built with Cuda support
      enabled), you can either set `OIIO::attribute("cuda",1)` or use the
      magic environment variable `OPENIMAGEIO_CUDA=1`. When running oiiotool,
      the command line argument `--cuda` turns the attribut on (or cheat with
      the aforementioned env variable).
    
    * When the attribute is set, ImageBuf of "local" (not ImageCache-backed)
      float (no other data types yet) buffers will allocate and free with
      cudaMallocManaged/cudaFree (other cases will use the usual malloc/free).
      We are thus heavily leveraging Unified Memory, never do any explicit
      copying of data back and forth.
    
    * Certain ImageBufAlgo functions, then, have the options of calling
      Cuda implementations when all the stars align -- Cuda support enabled,
      Cuda turned on, the ImageBufs in question all have local storage that
      was allocated as visible to Cuda, the buffers are all float, and other
      restrctions to just the most common cases (all image inputs have
      identical ROIs, etc.).
    
    * Implemented this for IBA::add() and sub() initially. Will extend to
      other operations in the future and as the need arises.
    
    Results and discussion:
    
    Perf: add and sub operations on 1920x1080 3 channel float images, on my
    workstation (16 core Xeon Silver 4110, it's ISA is AVX-512 but I'm only
    compiling for SSE4.2 support at the moment) runs in about 20ms single
    threaded, ~3.8ms multithreaded. With Cuda enabled (NVIDIA Quadro P5000,
    Pascal architecture), I am getting about 12ms (i.e., moderately faster
    than single core, quite a bit slower than fully using all the CPU
    cores).
    
    Now, this is not an especially good case for GPU -- the compute-to-memory
    ratio is very poor, just a single math op for every 12 bytes of transfer
    on or off the GPU. When I contrive to do an example with about 10x more
    math per pixel, the Cuda times are approximately equal to the CPU times
    when I take advantage of all the CPU cores. Maybe it only helps if we
    do a bunch of IBA operations in a row before needing the results. Maybe
    it's only worth Cuda-accelerating the most expensive operations (resize,
    area ops, etc.), but we'll never get gain from something simple like add?
    
    If anybody can point out ways in which I'm being very wasteful, please do
    let me know!
    
    Even after we flesh out many more image operations to be
    Cuda-accelerated, and even we see an improvement in all cases over CPU,
    I don't expect people to see much practical improvement in a typical
    oiiotool command line, since disk/network to read input images and write
    results are almost certain to dominate runtime, compared to the
    math. But if you have a program that's doing a whole bunch of repeated
    image math via IBA calls themselves, that's where the bigger payoff is
    going to be, I think.
    
    Note that CUDA is extremely finicky about what compilers it can use,
    with an especially narrow idea of which "host compiler" is required by
    each version of the Cuda Toolkit/nvcc. I'm still working through those
    issues, and am considering the merits of compiling the cuda itself with
    clang (if available) rather than nvcc, just to ease up on these
    requirements. We'll be making the rest of the build issues more robust
    over time as well.
    lgritz committed Oct 13, 2018
    Configuration menu
    Copy the full SHA
    77afa5e View commit details
    Browse the repository at this point in the history