Feature/get data gpu #230

Rad-hi · 2023-10-30T23:41:35Z

Keep the `retrieved` data on GPU

Description

Through the utilization of cupy.ndarrays (extra dependency), we were able to keep the retrieved data on GPU and provide a view of the memory on the Python side.

This PR introduces a slight modification to the get_data() function in a way that extends the API, but doesn't break any of its current functionalities.

Notes

In an effort to stay true to the original API, data on the GPU could be either deeply copied, or viewed by the Python API consumer.

A list of references is kept as a comment above the added code for potential enhancements, and/or better understanding of the feature.

Tests

This was tested with the ZED SDK 4.0, on both SVOs (recorded in HD2K-15), and on a live stream from a ZED Mini, with a custom wrapper of the Python API, on a Ubuntu 20.04.6 LTS system, with an Nvidia GTX 1650 4Gb graphics card.

Additionally, we are attempting its test on an Nvidia AGX Orin 32Gb developer kit, though we're having some Cython compilation errors that are prohibiting us from proceeding with the tests (would love to discuss, but not our topic).

Disclaimers

This is not thoroughly tested, and this PR is intended to spark a discussion around how this feature might be developed by the community, rather than being a final product (though we hope it's ready as is).

…emory through the seperate function call

adujardin · 2023-10-31T08:58:24Z

This is interesting, thanks for your effort! Do you have a sample code where you use the GPU buffer, maybe with Pytorch/TensorRT? Did you notice a significant performance improvement?

Rad-hi · 2023-10-31T11:52:57Z

Thanks!
Currently, I only used it with a custom TF that applies to the whole pointcloud (rotation along the X axis) using numba.cuda accelerated Python code:

import math
import time
import numpy as np
import cupy as cp
from numba import cuda
from numba.cuda.cudadrv.devicearray import DeviceNDArray
import numba
import torch

from typing import Union, Tuple, Callable, TypeVar
from typing_extensions import ParamSpec

T = TypeVar('T')
P = ParamSpec('P')
RotateCallable = Callable[[float, float, float, float], Tuple[float, float]]

Y = 1
Z = 2
ROT = 69.0


class TimeMe:
    '''
    Timing decorator that takes into account whether the timing would be done in GPU or CPU
    @depends: torch
    Ref: https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964/2
    '''
    def __init__(self, gpu: bool = False) -> None:
        self._gpu: bool = gpu
        if gpu:
            self._start = torch.cuda.Event(enable_timing=True)
            self._end = torch.cuda.Event(enable_timing=True)

    def __call__(self, func: Callable[P, T]) -> Callable[P, T]:
        def wrapper(*args: P.args, **kwargs: P.kwargs) -> Tuple[float, T]:
            all_ts: float = 0.0
            ts: float = 0.0

            if self._gpu:
                self._start.record()
            else:
                ts = time.time()

            result = func(*args, **kwargs)

            if self._gpu:
                self._end.record()
                torch.cuda.synchronize()

            all_ts = self._start.elapsed_time(self._end) if self._gpu else (time.time() - ts) * 1000.0

            return all_ts, result

        return wrapper


def rotate_yz_sin_cos(y: float, z: float, sin_: float, cos_: float) -> Tuple[float, float]:
    if math.isfinite(y) and math.isfinite(z):
        yy = y * cos_ - z * sin_
        zz = y * sin_ + z * cos_
        return yy, zz
    return math.nan, math.nan


# JIT compile for both CPU and GPU
rotate_yz_sin_cos_cpu: RotateCallable = numba.njit(nogil=True, cache=True)(rotate_yz_sin_cos)
rotate_yz_sin_cos_gpu: RotateCallable = cuda.jit(device=True, cache=True)(rotate_yz_sin_cos)


@cuda.jit
def rotate_pcl_kernel(pcl: DeviceNDArray, rot: float) -> DeviceNDArray:
    '''
    GPU kernel to rotate pointclouds with an angle rot
    Ref: https://github.com/harrism/numba_examples/blob/master/mandelbrot_numba.ipynb
    '''
    sin_ = math.sin(rot)
    cos_ = math.cos(rot)
    h, w, _ = pcl.shape

    startX, startY = cuda.grid(2)
    gridX = cuda.gridDim.x * cuda.blockDim.x
    gridY = cuda.gridDim.y * cuda.blockDim.y

    for i in range(startY, h, gridY):
        for j in range(startX, w, gridX):
            pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_gpu(
                pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)


@TimeMe(gpu=True)
def rotate_pcl_gpu(pcl: Union[cp.ndarray, np.ndarray],
                   rot: float,
                   h2d: bool = False) -> Union[cp.ndarray, np.ndarray]:
    blockdim = (16, 16)
    griddim = (16, 16)

    pcl = cuda.to_device(pcl) if h2d else pcl
    rotate_pcl_kernel[griddim, blockdim](pcl, rot)
    pcl = pcl.copy_to_host() if h2d else pcl

    # The rotation doesn't happen inplace only when H2D/D2H
    # transfers were requested; pcl originally on host memory 
    return pcl


@TimeMe()
@numba.njit(nogil=True, parallel=True, cache=True)
def rotate_pcl_cpu(pcl: np.ndarray, rot: float) -> None:
    sin_ = math.sin(rot)
    cos_ = math.cos(rot)
    h, w, _ = pcl.shape
    for i in numba.prange(h):
        for j in numba.prange(w):
            pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_cpu(
                pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)


if __name__ == '__main__':

    # TODO: add the Zed init and Mat creation
    pcl_gpu = pcl_mat.get_data(memory_type=sl.MEM.GPU)

    # Note that this will take "D2H" time
    # Ref: https://docs.cupy.dev/en/stable/reference/generated/cupy.ndarray.html#cupy.ndarray.get
    pcl_cpu = pcl_gpu.get()

    # rotation will happen inplace since pcl is on Device memory
    rot_gpu_lat, _ = rotate_pcl_gpu(pcl_gpu, ROT)
    rot_cpu_lat, _ = rotate_pcl_cpu(pcl_cpu, ROT)

    # NOTE: Tests for correctness were performed to make sure the rotation is identical on GPU/CPU,
    # and that the rotation actually happened; inplace.

Benchmark

System info:

GPU: NVIDIA GeForce GTX 1650, 4096 MB
CPU: 11th Gen Intel® Core™ i7-11370H @ 3.30GHz × 8

GPU warmup iterations:

SVO had 198 frames, and we used a queue of length 100 for profiling

Numbers:

[ROT_GPU] Mean: 1114.543 us, Std: 103.850 us, Max: 2000.896 us, Min: 1052.640 us, N Samples: 100.
[ROT_CPU] Mean: 5875.752 us, Std: 2744.054 us, Max: 16989.231 us, Min: 4649.878 us, N Samples: 100.
[D2H] Mean: 31647.370 us, Std: 1649.754 us, Max: 42148.352 us, Min: 29398.680 us, N Samples: 100.

As you can see, the gain is significant, and we can write all pcl processing code in the same manner, thus, never having to bring the pcl to the host memory. Additionally, we can run DL inference on the frames using TRT without having to pay for the D2H-H2D transfers.

Notes about the cupy choice:

Cupy has great interoperability with most common frameworks (notably numba.cuda, Pytorch, and TensorRT) https://docs.cupy.dev/en/stable/user_guide/interoperability.html.

The same results might be achieved using numba.cuda DeviceNDArrays, but we didn't test it.

adujardin · 2023-10-31T14:56:44Z

Impressive! We need to think about how we could go forward with this. Ideally, I think it would be best to have the cupy dependency optional to limit complexity for most users. I'll try to test this soon

Rad-hi · 2023-10-31T20:48:58Z

When it comes to dependency management, I do agree that cupy being an optional dependency is the way to go.

I also want to note that we've seen an increase of the maximum FPS. We went from around 12 FPS (grab/retrieve on CPU), to around 15 when we started keeping the data on the GPU (even with 2 Zed mini cameras working at the same time, on an Orin 32Gb). This led us to believe that your specs are measured using the C++ API !?

Testing session's findings

I spent some time testing it in depth, and I saw some confusing behavior that I would love your help about.
So, as I mentioned, I had code to test the correctness of the retrieved data. The code compares the CPU frame/pcl with its GPU counterpart (I considered the CPU data as the ground truth) element wise, and gives me back the match percentage:

@numba.njit(nogil=True, cache=True)
def almost_eq(a: Union[int, float], b: Union[int, float], e: Union[int, float]) -> bool: 
    if not (math.isfinite(a) and math.isfinite(b)):
        # If either is nan, consider them equal
        return True
    return abs(a - b) <= e

@numba.njit(nogil=True, cache=True)
def count_almost_eq(mat1: np.ndarray, mat2: np.ndarray, eupsilon: float) -> None:
    '''
    This method is needed (instead of np.count_nonzero(mat1 == mat2)) becauce
    np.nan == np.nan >>> False. Thus we need to manually check the elements.
    '''
    sh1 = mat1.shape
    sh2 = mat2.shape
    assert sh1 == sh2
    assert mat1.size == mat2.size

    h, w, _ = sh1

    eq = mat1.size

    for j in range(h):
        for i in range(w):
            x_eq = almost_eq(mat1[j][i][X], mat2[j][i][X], eupsilon)
            y_eq = almost_eq(mat1[j][i][Y], mat2[j][i][Y], eupsilon)
            z_eq = almost_eq(mat1[j][i][Z], mat2[j][i][Z], eupsilon)
            eq -= 1 * (not (x_eq and y_eq and z_eq))
    return eq

# Retrieve the image/measure
gpu_pcl = pcl_mat.get_data(memory_type=sl.MEM.GPU)
gpu_frame = frame_mat.get_data(memory_type=sl.MEM.GPU)

# Bring the data to Host memory to visualize it and test its correctness
gpu_pcl_np = gpu_pcl.get()
gpu_frame_np = gpu_frame.get()

print(f'PCL -- {(count_almost_eq(gpu_pcl, gpu_pcl_np, 0.000001) / pcl_cpu.size) * 100.0:.2f} % almost identical')
print(f'RGB -- {(count_almost_eq(cpu_frame, gpu_frame_np, 1) / cpu_frame.size) * 100.0:.2f} % almost identical')

I tested with live streams from a ZED Mini, a ZED 2, and SVOs, both on my laptop, and an Nvidia AGX Orin Dev kit, with the latest SDK (4.0).

The confusing behavior is that in HD1080 and HD720 (in all FPS options), everything works perfectly (100% match on both PCL, and Frame), but when I choose either HD2K or VGA (in all FPS options), the PCL still matches 100%, but the Frame only matches around 74 ±3 %.

Additionally, when I visualize the Frames (after converting Cupy arr to Numpy arr), the GPU one looks kinda shifted (for HD2K and VGA only):

I tried:

Adjusting the polling rate (how frequently I called grab/retrieve) by adding and removing some delays here and there,
Only retrieving the frame,

But nothing seems to fix this,

Am I missing something about how CUDA works, or how the frame data is “organized” in memory ?
Is this related to a striding mechanism in the frame representation in GPU (I am just guessing, not really sure what I am asking) ?

Thank you for indulging me.

Rad-hi · 2023-11-28T14:31:13Z

@adujardin I was wondering if you had time to test this out, or maybe if you can point me towards why this issue might be happening.
In order to further test the data retrieval, I created the image from the RGBA data in the PCL using this function:

@numba.njit(cache=True, parallel=True)
def xyzrgba2rgba(src: np.ndarray, dst: np.ndarray) -> None:
    '''
    Fill dst matrix with RGBA data from the XYZRGBA pcl.
    '''
    RGBA_AXIS = 3
    BYTE_MASK = 0xFF
    R, G, B = (0, 1, 2)
    RED_MASK = np.uint32(BYTE_MASK << (R * 8))
    GREEN_MASK = np.uint32(BYTE_MASK << (G * 8))
    BLUE_MASK = np.uint32(BYTE_MASK << (B * 8))

    h, w, _ = dst.shape
    for i in numba.prange(w):
        for j in numba.prange(h):
            if not np.isfinite(src[j][i][RGBA_AXIS]):
                continue
            rgba_bin = np.asarray(src[j][i][RGBA_AXIS], np.float32).view(np.uint32)
            dst[j][i][R] = np.uint8((rgba_bin & RED_MASK) >> (R * 8))
            dst[j][i][G] = np.uint8((rgba_bin & GREEN_MASK) >> (G * 8))
            dst[j][i][B] = np.uint8((rgba_bin & BLUE_MASK) >> (B * 8))
            dst[j][i][RGBA_AXIS] = BYTE_MASK


def image_from_pcl_rgba(pcl):
    im = np.ones_like(pcl, dtype=np.uint8)
    xyzrgba2rgba(pcl, im)
    return im

And it gave me the next two images (CPU retrieval VS image from GPU PCL), which points me towards the fact that the data is in fact being read correctly.

Do you have any ideas for me to try ?

Neeklow · 2024-04-23T09:54:46Z

Hi! I'm using just now the python API for a project, and I realized that one of the main bottlenecks is, also for me, the double passage of textures from gpu to cpu.
I'd be super interested in testing your PR but I'm rather confused on how to test the repo. Could I ask you some directions?
Thanks a lot!

Rad-hi · 2024-04-24T10:27:57Z

Hey, @Neeklow!
I would appreciate another tester.
So to test it, you need to build the Python API from source, for that:

clone this repo
navigate to src folder and follow the README for instructions for building the pluglin
Use the API as you are used to, but for the memory_type argument, supply sl.MEM.GPU instead of sl.MEM.CPU

It's to note that I only tested this with the ZED 4.0 SDK and with Python 3.8.

Neeklow · 2024-04-25T10:01:20Z

Thanks a lot @Rad-hi !
I'm having troubles to build the plugin.
I'm currently on a python 3.8 virtual environment (I'm assuming that shouldn't be a problem?).
After I install all the requirements, I run python setup.py build and I first receive

ZED SDK Version: OK
compilation flags: 
include dirs: ['C:\\Users\\neeklo\\Documents\\Repos\\ng-stories-people-room-mono\\workflow\\camera-feed-system\\.venv\\lib\\site-packages\\numpy\\core\\include', 'C:\\Program Files (x86)\\ZED SDK/include', ``'C:\\Users\\neeklo\\scoop\\apps\\cuda\\current/include']
library dirs: ['C:\\Users\\neeklo\\Documents\\Repos\\ng-stories-people-room-mono\\workflow\\camera-feed-system\\.venv\\lib\\site-packages\\numpy\\core\\include', 'C:\\Program Files (x86)\\ZED SDK/lib', 'C:\\Users\\neeklo\\scoop\\apps\\cuda\\current/lib/x64']
libraries: ['sl_zed64']
Building module: ('pyzed.sl', ['pyzed/sl.pyx'])
Compiling pyzed/sl.pyx because it changed.
[1/1] Cythonizing pyzed/sl.pyx

Followed by a mega Cython error from (Cython.Compiler.Errors.CompileError: pyzed/sl.pyx ... the error is huge, so I'm attaching errors-build.txt printed to avoid cluttering here). The currently installed Cython is Cython 3.0.10 (not sure if this make any difference).

Any help is really appreciated!!

Rad-hi · 2024-04-26T10:16:53Z

Okay so after inspecting the build log, I think I can see the issue:

When running this line, from sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString, python is not able to find sl_c.pxd:

from sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString
^
------------------------------------------------------------
pyzed\sl.pyx:28:0: 'sl_c.pxd' not found

this is from the errors-build.txt ~ line 21

Two possibilities:

you're not running the python3 setup.py build command under the src folder
this might be a path issue of some sort, where python is not able to locate the sl_c.pxd file, in which case you could help Python by adding the path to the pyzed folder to the system path like:

from libcpp.map cimport map

# Add the next two lines to sl.pyx, ~ line 27
import sys
sys.path.append('/ABSOLUTE/PATH/TO/ZED/API/zed-python-api/src/pyzed')

# And modify the import: note the pyzed.sl_c
from pyzed.sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString

Make sure to update the path with the appropriate value (I see you're using windows, and I have no idea how it works)

Hope this helps.

Neeklow · 2024-04-29T09:40:16Z

@Rad-hi thansk for the tips. That defineteyl helped but it seems I'm having troubles with MSCV compiler:

error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2

Is where I get to (i've uploaded the whole build error.txt again just in case).
I've been bashing my head around this for the past few days, and I'm a bit short of ideas. If you have any experience/i

dea where to look for solution I'd really appreciate (but I understand you're not on Win and I seem to understand this is a strictly windows Issue)!
Thanks a lot!

Rad-hi · 2024-04-29T15:38:55Z

I see @Neeklow, you seem to be encountering the same issues I encountered when trying to build this for the Orin (an NVIDIA board I am working on).

The actual error you're encountering is (line ~303 in the build log):

pyzed/sl.cpp(146250): error C2039: 'gnss_ignore_threshold': is not a member of 'sl::PositionalTrackingFusionParameters'

And the quickest fix for me was to comment out the gnss_ignore_threshold related code. You can find the changes in this branch https://github.com/Rad-hi/zed-python-api/tree/changes-for-orin.

Beware of the changes to the sl.pyx in the branch:

# This is added because sl_c.pxd wasn't being found
import os
import sys
HOME = os.path.expanduser('~')
sys.path.append(os.path.join(HOME, 'Desktop/zed-python-api/src'))

Since I placed the API under my desktop at the time of testing this. Apply the changes that you applied in the past fix here instead of keeping it this way.

I hope this helps and the API builds.

Rad-hi added 3 commits October 27, 2023 17:59

Added get_data() on GPU through cupy arrays

5ccacfa

fix image mat dtype uint8 not float32

592e82d

utalized deep_copy in GPU option, and removed option to get unowned m…

17b42ae

…emory through the seperate function call

Rad-hi added 3 commits November 1, 2023 12:15

fix cleanall leaves behind pyzed/sl.cpp

c20e4e6

fix previous commit since it breaks the install step

6d3c6e8

fix deep_copy in CPU seg-faults since arr is not a cdef

e5adef7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/get data gpu #230

Feature/get data gpu #230

Rad-hi commented Oct 30, 2023 •

edited

Loading

adujardin commented Oct 31, 2023

Rad-hi commented Oct 31, 2023

adujardin commented Oct 31, 2023

Rad-hi commented Oct 31, 2023 •

edited

Loading

Rad-hi commented Nov 28, 2023

Neeklow commented Apr 23, 2024 •

edited

Loading

Rad-hi commented Apr 24, 2024

Neeklow commented Apr 25, 2024

Rad-hi commented Apr 26, 2024

Neeklow commented Apr 29, 2024

Rad-hi commented Apr 29, 2024

Feature/get data gpu #230

Are you sure you want to change the base?

Feature/get data gpu #230

Conversation

Rad-hi commented Oct 30, 2023 • edited Loading

Keep the retrieved data on GPU

Description

Notes

Tests

Disclaimers

adujardin commented Oct 31, 2023

Rad-hi commented Oct 31, 2023

Benchmark

System info:

GPU warmup iterations:

Numbers:

Notes about the cupy choice:

adujardin commented Oct 31, 2023

Rad-hi commented Oct 31, 2023 • edited Loading

Testing session's findings

Rad-hi commented Nov 28, 2023

Neeklow commented Apr 23, 2024 • edited Loading

Rad-hi commented Apr 24, 2024

Neeklow commented Apr 25, 2024

Rad-hi commented Apr 26, 2024

Neeklow commented Apr 29, 2024

Rad-hi commented Apr 29, 2024

Rad-hi commented Oct 30, 2023 •

edited

Loading

Keep the `retrieved` data on GPU

Rad-hi commented Oct 31, 2023 •

edited

Loading

Neeklow commented Apr 23, 2024 •

edited

Loading