Add CUDA decoding support #242

ahmadsharif1 · 2024-10-04T20:25:22Z

Actually implement cuda decoding in C++:

Initialize a cuda device if requested. We create a small tensor on the device to initialize the context.
Use the cuda device to decode the video to NV12 format.
Use libNPP to convert from NV12 to RGB. We make sure to wait on this cuda event so there are no race conditions in accessing this tensor from the downstream consumer (that is on a different stream than libNPP).

Note that the GPU decodes frames that are not bit-accurate. This is by design and we ensure tensors are approximately equal rather than fully accurate. The actual tensor values depends on the GPU architecture because GPU math is not precise.

Also added a gpu_benchmark with the following results:

python benchmarks/decoders/gpu_benchmark.py --video /tmp/frame_numbers_1920x1080_100.mp4
[--------------------- Decode+Resize Time --------------------]
                       |  video=frame_numbers_1920x1080_100.mp4
1 threads: ----------------------------------------------------
      D=cuda R=cuda    |                   12.1                
      D=cuda R=cpu     |                  148.0                
      D=cuda R=native  |                   11.5                
      D=cuda R=none    |                   11.5                
      D=cpu R=cuda     |                   16.9                
      D=cpu R=cpu      |                  134.2                
      D=cpu R=native   |                   23.2                
      D=cpu R=none     |                    9.4                

Times are in seconds (s).

Key: D=Decode, R=Resize
Native resize is done as part of the decode step
none resize means there is no resize step -- native or otherwise

Results show that a single NVDec is slower than 22 core CPU without resizing, but faster with resizing.

I also added a "throughput mode" for the benchmark that decodes W videos in parallel using T threads. Results of this "throughput mode" shows that A100 has higher decode throughput than my 22-core CPU:

python benchmarks/decoders/gpu_benchmark.py --video /tmp/frame_numbers_1920x1080_100.mp4 --devices=cuda:0,cpu --resize_
devices=none --num_threads 10 --num_videos 10

[---------------------------------- Decode+Resize Time ----------------------------------]
                               |  threads=10 work=10 video=frame_numbers_1920x1080_100.mp4
1 threads: -------------------------------------------------------------------------------
      D=cuda R=none T=10 W=10  |                            29.0                          
      D=cpu R=none T=10 W=10   |                            38.8                          

Times are in seconds (s).

Key: D=Decode, R=Resize T=threads W=work (number of videos to decode)
Native resize is done as part of the decode step
none resize means there is no resize step -- native or otherwise

nvidia-smi shows 99% NVDEC utilization :)

# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command 
# Idx           #    C/G      %      %      %      %      %      %    name 
    0    2180816     C     70      5      -     99      -      -    python

This throughput mode is representative of video decoding using the dataloader with multiple threads.

ronghanghu · 2024-10-07T19:24:36Z

Looking forward to this! It would be great to have GPU decoding added (back) to TorchCodec

src/torchcodec/decoders/_core/CMakeLists.txt

test/decoders/test_video_decoder_ops.py

src/torchcodec/decoders/_core/VideoDecoder.cpp

NicolasHug · 2024-10-08T09:55:38Z

src/torchcodec/decoders/_core/DeviceInterface.h

+    const torch::Device& device,
+    AVCodecContext* codecContext);
+
+VideoDecoder::DecodedOutput convertAVFrameToDecodedOutputOnDevice(


Should "OnDevice" be "OnCUDA"? I know that within the context of CUDA development, "device" is often used to mean the GPU in contrast to the host, but in the context of torchcodec/pytorch the distinction isn't always as obvious to me. The CPU is a device too.

Since this is an interface, I want it to be generic so we can support AMD, etc. in the future. That's why I call it "device"

Agreed that we should say "OnCUDA" if the implementation only supports CUDA. We may support other kinds of devices in the future. If we have "device" in the name, that should mean the implementation works for any kind of device.

src/torchcodec/decoders/_core/CudaDevice.cpp

benchmarks/decoders/gpu_benchmark.py

scotts · 2024-10-08T13:35:15Z

src/torchcodec/decoders/_core/CPUOnlyDevice.cpp

+    const VideoDecoder::VideoStreamDecoderOptions& options,
+    AVCodecContext* codecContext,
+    VideoDecoder::RawDecodedOutput& rawOutput) {
+  throwUnsupportedDeviceError(device);


Should this function and the function below always throw? If yes, then we should just do something like TORCH_CHECK(false, "Unsupported device.");. In order avoid the need for a return value, mark the function as [[noreturn]]: https://en.cppreference.com/w/cpp/language/attributes/noreturn. We should rely on a TORCH macro to do the throwing for us rather than doing the throw ourselves, and we should make it obviously one that will always fail its check.

Good suggestion. Done

I think we maybe should also annotate convertAVFrameToDecodedOutputOnDevice() and initializeDeviceContext() with [[noreturn]]. Let's also avoid two TORCH_CHECK calls. Whatever message we want to put on stderr, we can do it in one check.

The two checks are there because one is a programming/logic error on our part -- we should never pass in a CPU device for device functions.

The other is the check for passing in a non-compiled device.

src/torchcodec/decoders/_core/CudaDevice.cpp

scotts · 2024-10-08T14:11:25Z

src/torchcodec/decoders/_core/CudaDevice.cpp

+  at::DeviceIndex deviceIndex = device.index();
+  deviceIndex = std::max<at::DeviceIndex>(deviceIndex, 0);


Two things:

This is the second place we're doing this same logic. We should abstract it into a function, even though it's small. The function name will probably help with my second point.

I'm not quite sure why we're doing it? Under what circumstance will ATen's reported index for a device be less than 0? It looks like it defaults to -1 in some cases (https://pytorch.org/cppdocs/api/structc10_1_1_device.html#_CPPv4N3c106Device6DeviceE10DeviceType11DeviceIndex), but wouldn't that be an error for us? Notably, this logic will make any value less than 0 be 0, which means maybe we could map multiple devices to 0. I don't think we should ever see such values, but it's confusing to me that our code makes it possible.

ffmpeg doesn't accept negative values for the device index. I added a comment to that effect

For a single GPU libtorch returns -1 while ffmpeg assumes it will be 0. So we have to bridge that gap.

For multi-GPU setup, I haven't seen -1 being returned by torch -- so there we wont have to do a max.

The -1 seems to be a libtorch specific thing.

scotts · 2024-10-08T14:12:40Z

src/torchcodec/decoders/_core/CudaDevice.cpp

+  at::DeviceIndex deviceIndex = device.index();
+  deviceIndex = std::max<at::DeviceIndex>(deviceIndex, 0);
+  at::DeviceIndex originalDeviceIndex = at::cuda::current_device();
+  cudaSetDevice(deviceIndex);


I can see that later on line 121 we restore the device index. Why? Can we explain why we need to set and then restore?

I ended up using CUDADeviceGuard. Callers are assuming we are not interfering with the cuda device in this function

scotts · 2024-10-08T14:25:29Z

src/torchcodec/decoders/_core/VideoDecoder.cpp

@@ -856,6 +856,25 @@ VideoDecoder::DecodedOutput VideoDecoder::convertAVFrameToDecodedOutput(
  output.duration = getDuration(frame);
  output.durationSeconds = ptsToSeconds(
      getDuration(frame), formatContext_->streams[streamIndex]->time_base);
+  if (streamInfo.options.device.type() != torch::kCPU) {


Is this logic truly general to all non-CPU devices? If no, then here and elsewhere, we should do something closer to:

if (streamInfo.options.device.type() == torch::kCUDA) { logicSpecificToCUDA(); } else if (streamInfo.options.device.type() == torch::kCPU) { logicSpecificToCPU(); } else { TORCH_CHECK(false, "Unsupported device"); }

Right now we only support cuda.

I am assuming if we support AMD we will use the same interface, and use cmake or #ifdefs to link in the correct device code.

So VideoDecoder.cpp just assumes cmake or the linker will do the right thing and just calls the device code for any type of device.

At the moment cuda device is linked by cmake for cuda builds. How that will be done for AMD is TBD.

I see, that makes sense. That also means that we don't need to have CPU versions tof functions hat throw for all N devices we support.

scotts · 2024-10-08T19:03:44Z

src/torchcodec/decoders/_core/CudaDevice.cpp

+  enum AVHWDeviceType type = av_hwdevice_find_type_by_name("cuda");
+  TORCH_CHECK(type != AV_HWDEVICE_TYPE_NONE, "Failed to find cuda device");
+  torch::DeviceIndex deviceIndex = device.index();
+  // FFMPEG cannot handle negative device indices.


I understand now what instigated this code, but I still can't evaluate if it's correct. Looking at the docs, a negative value indicates the "current device": https://pytorch.org/cppdocs/api/structc10_1_1_device.html#_CPPv4N3c106DeviceE

Is it safe to map all values of "current device" to 0? Is this a mapping we need to track? What happens when we are on a system with multiple GPUs? I'm assuming we don't fully understand the answers to these questions, and I don't want to block progress. So I think we should have a meatier comment both explaining what we do know, and indicating this may be a problem in the future.

I have added a longer comment with a TODO to investigate that it works properly with multi-GPU setup. I am sure once users start using it, we will hit more edge cases.

facebook-github-bot · 2024-10-09T18:30:48Z

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-09T19:01:14Z

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-09T19:36:50Z

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-09T19:41:18Z

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ronghanghu · 2024-10-09T22:08:28Z

Hi @ahmadsharif1, thanks for the great PR to add back GPU support! Wondering if it's possible to also add back the device parameter into SimpleVideoDecoder, which was previously removed in https://github.com/pytorch/torchcodec/pull/196/files#diff-5ff4f051479ffd5d021001e2a101973746feda3a3f579bf2d072629329c421dc?

ahmadsharif1 added 2 commits October 4, 2024 11:31

Add GPU decoding support

72bdd25

.

707cff3

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 4, 2024

ahmadsharif1 added 9 commits October 4, 2024 14:43

.

0c916b3

.

3600eee

.

45df9dc

.

05d02a4

.

9f169c9

.

461a2ff

.

41a1ba2

.

57818c5

.

1611245

ahmadsharif1 added 5 commits October 7, 2024 12:29

.

58624d0

.

e576929

.

dca3540

.

8ff05ee

.

a52ba5c

NicolasHug reviewed Oct 8, 2024

View reviewed changes