Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU jpeg decoder: add batch support and hardware decoding #8496

Merged
merged 41 commits into from
Aug 7, 2024

Conversation

deekay42
Copy link
Contributor

Over 8000 imgs/s on 1 A100 GPU


Platform: Linux-5.12.0-0_fbk7_zion_6511_gd766966f605a-x86_64-with-glibc2.34
Logical CPUs: 192

CUDA device: NVIDIA PG509-210
Total Memory: 84.99 GB

Mean image size: 551x676
[---------------------------------------------------------------- Image Decoding ----------------------------------------------------------------]
                                                                                                        |  1 images  |  100 images  |  1000 images
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.decode_jpeg(img, device='cpu') for img in encoded_images_trunc]    |   3301.9   |   271141.6   |   2541465.3 
      CPU (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cpu')                       |   3239.7   |   288522.8   |   2596394.3 
      CUDA (unfused): [torchvision.io.decode_jpeg(img, device='cuda') for img in encoded_images_trunc]  |    603.7   |    60097.8   |    573783.4 
      CUDA (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cuda')                     |    600.6   |    12972.6   |    127654.8 
12 threads: --------------------------------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.decode_jpeg(img, device='cpu') for img in encoded_images_trunc]    |   3330.5   |   272498.9   |   2552944.3 
      CPU (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cpu')                       |   3339.7   |   257796.7   |   2511005.4 
      CUDA (unfused): [torchvision.io.decode_jpeg(img, device='cuda') for img in encoded_images_trunc]  |    603.8   |    59138.0   |    588341.4 
      CUDA (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cuda')                     |    605.0   |    13163.7   |    127891.4 
24 threads: --------------------------------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.decode_jpeg(img, device='cpu') for img in encoded_images_trunc]    |   3227.5   |   276357.8   |   2518914.3 
      CPU (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cpu')                       |   3277.7   |   257554.9   |   2497894.3 
      CUDA (unfused): [torchvision.io.decode_jpeg(img, device='cuda') for img in encoded_images_trunc]  |    607.9   |    58306.1   |    583932.6 
      CUDA (fused): torchvision.io.decode_jpeg(encoded_images_trunc, device='cuda')                     |    653.2   |    12604.1   |    124130.5 

Times are in microseconds (us).```

deekay42 and others added 30 commits April 22, 2024 22:17
Summary:
I'm adding GPU support to the existing torchvision.io.encode_jpeg function. If the input tensors are on the GPU, the CUDA version will be used and the CPU version otherwise.
Additionally, I'm adding a new function torchvision.io.encode_jpegs (plural) with uses a fused kernel and may be faster than successive calls to the singular version which incurs kernel launch overhead for each call.
If it's alright, I'll be happy to refactor decode_jpeg to follow this
convention in a follow up PR.

Test Plan:
1. pytest test -vvv
2. ufmt format torchvision
3. flake8 torchvision

Reviewers:

Subscribers:

Tasks:

Tags:
This reverts commit c5810ff.
Copy link

pytorch-bot bot commented Jun 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8496

Note: Links to docs will display an error until the docs builds have been completed.

❌ 13 New Failures, 6 Unrelated Failures

As of commit efa746d with merge base 5242d6a (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@deekay42 deekay42 marked this pull request as ready for review June 17, 2024 20:27
Copy link
Contributor

@ahmadsharif1 ahmadsharif1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing out some comments on the C++ code.

const torch::Tensor& data,
ImageReadMode mode,
C10_EXPORT std::vector<torch::Tensor> decode_jpegs_cuda(
const std::vector<torch::Tensor>& encoded_images,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not a single torch::Tensor that's stacked as opposed to a std::vector of Tensors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the encoded jpegs are represented as variable length byte streams of type tensor(1). They cannot be stacked into a batch size they don't have the same dimensions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request: add that as a comment in the code itself for future readers

torchvision/csrc/io/image/cuda/decode_jpegs_cuda.cpp Outdated Show resolved Hide resolved

if (cudaJpegDecoder == nullptr || device != cudaJpegDecoder->target_device) {
if (cudaJpegDecoder != nullptr)
delete cudaJpegDecoder.release();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do cudaJpegDecoder.reset() instead of manually deleting

if (cudaJpegDecoder != nullptr)
delete cudaJpegDecoder.release();
cudaJpegDecoder = std::make_unique<CUDAJpegDecoder>(device);
std::atexit([]() { delete cudaJpegDecoder.release(); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .reset() here as well instead of manually deleting?


CUDAJpegDecoder::~CUDAJpegDecoder() {
/*
The below code works on Mac and Linux, but fails on Windows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to use a #ifdef _WIN32 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about it but I understand that C++ order of destruction is generally undefined, so even if it passes on the specific Mac and Linux versions I've tested it on, it still is undefined behavior and may fail at any time.

const torch::Device target_device;
const c10::cuda::CUDAStream stream;

protected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why protected and not private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real reason. Happy to change to private

nvjpegStatus_t status;
cudaError_t cudaStatus;

cudaStatus = cudaStreamSynchronize(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how they do it here: https://github.com/NVIDIA/CUDALibrarySamples/blob/f17940ac4e705bf47a8c39f5365925c1665f6c98/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp#L36
After buffers are allocated they synchronize before starting the decoding

}
}

cudaStatus = cudaStreamSynchronize(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have the outer CUDAEvent in the caller of this function if you already wait for all ops to finish here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 563 is needed because I need to do some pruning right after. The outer CUDAEvent is needed to synchronize the internal CUDAJpegDecoder::stream with the calling code's current stream before returning the results

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing out some comments on the C++ code.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I only have minor comments. Let's wait for Nicolas to review the python changes too.

nvjpegJpegStream_t jpeg_streams[2];
nvjpegDecodeParams_t nvjpeg_decode_params;
nvjpegJpegDecoder_t nvjpeg_decoder;
bool hw_decode_available{true};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more defensive to set this to false by default and only set it to true at the end of the function at line228 of the decode_jmpegs_cuda.cpp file

const torch::Tensor& data,
ImageReadMode mode,
C10_EXPORT std::vector<torch::Tensor> decode_jpegs_cuda(
const std::vector<torch::Tensor>& encoded_images,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request: add that as a comment in the code itself for future readers

Comment on lines +560 to +570
status = nvjpegDecodeJpegDevice(
nvjpeg_handle,
nvjpeg_decoder,
nvjpeg_decoupled_state,
&sw_output_buffer[i],
stream);
TORCH_CHECK(
status == NVJPEG_STATUS_SUCCESS,
"Failed to decode jpeg stream: ",
status);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is needed if the decoding is done in software on the host.

Why is another decode on the GPU needed here?

(Add a comment in the code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many different types of jpegs, for out purposes most notably baseline and progressive. The main difference between the two is that progressive jpegs encapsulate multiple renderings of the same image at different resolutions. Baseline jpegs can be decoded on the GPU right away, but progressive jpegs need some preprocessing on the host before the GPU can process them. Added a comment

std::vector<nvjpegImage_t> hw_output_buffer;

// other JPEG types such as progressive JPEGs can be decoded one-by-one in
// software slow :(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more details here about the software decode process since it appears from the code below that some work is done on the GPU even in this case (and 2 transfers are needed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to comment on line 400

const c10::cuda::CUDAStream stream;

private:
std::tuple<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the return type is a tuple, it's hard to tell what it returns (other than the type).

i.e. it's not obvious that the last element is the number of channels. Can you add a comment about the return type? EDIT: I noticed you do have a comment in the implementation. Maybe move that here?

Even more readable would be a struct with proper member names.

torch.uint8 and device cpu
- output_format (nvjpegOutputFormat_t): NVJPEG_OUTPUT_RGB, NVJPEG_OUTPUT_Y
or NVJPEG_OUTPUT_UNCHANGED
- device (torch::Device): The desired CUDA device for the returned Tensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a stale comment since there is no device arg

// which is related to the subsampling used I'm not sure why this is the
// case, but for now we're just using RGB and later removing channels from
// grayscale images.
output_format = NVJPEG_OUTPUT_UNCHANGED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would it be simpler to just use NVJPEG_OUTPUT_RGB here since you are assuming this expands the channels anyway?

Also add a TODO to investigate and fix this behavior of pruning

namespace image {

std::mutex decoderMutex;
std::unique_ptr<CUDAJpegDecoder> cudaJpegDecoder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe use gCudaJpegDecoder to indicate it's a global variable?

We do not have a solution to this problem at the moment, so we'll
just leak the libnvjpeg & cuda variables for the time being and hope
that the CUDA runtime handles cleanup for us.
Please send a PR if you have a solution for this problem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One request: maybe try the driver API to see if cuda is available?

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g52b5ce05cb8c5fb6831b2c0ff2887c74

int dummy;
if (cuDeviceGetCount (&dummy) == CUDA_SUCESS) {
...
}

If that doesn't work don't bother with a unique_ptr for the global variable and just use a regular pointer.

Add a comment above the global variable saying this is not a unique_ptr because our destructor could race with cuda's own destructors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I tried that and it didn't work :(

"[torchvision.io.encode_jpeg(img) for img in decoded_images_device_trunc]",
"torchvision.io.encode_jpeg(decoded_images_device_trunc)",
],
["unfused", "fused"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but batched seems like a better term than fused since it appears to be batching images, not fusing kernels necessarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the images are batched it uses a fused kernel

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton @deekay42

@NicolasHug NicolasHug changed the title Add gpu decode GPU jpeg decoder: add batch support and hardware decoding Aug 6, 2024
@NicolasHug NicolasHug merged commit 0d80848 into pytorch:main Aug 7, 2024
41 of 60 checks passed
Copy link

github-actions bot commented Aug 7, 2024

Hey @NicolasHug!

You merged this PR, but no labels were added.
The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

facebook-github-bot pushed a commit that referenced this pull request Aug 7, 2024
…8496)

Summary: Co-authored-by: Nicolas Hug <nicolashug@fb.com>

Differential Revision: D60903713

fbshipit-source-id: f0f9908e2be6436372132a575c7ff066129a1f78
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants