-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/get data gpu #230
base: master
Are you sure you want to change the base?
Feature/get data gpu #230
Conversation
…emory through the seperate function call
This is interesting, thanks for your effort! Do you have a sample code where you use the GPU buffer, maybe with Pytorch/TensorRT? Did you notice a significant performance improvement? |
Thanks! import math
import time
import numpy as np
import cupy as cp
from numba import cuda
from numba.cuda.cudadrv.devicearray import DeviceNDArray
import numba
import torch
from typing import Union, Tuple, Callable, TypeVar
from typing_extensions import ParamSpec
T = TypeVar('T')
P = ParamSpec('P')
RotateCallable = Callable[[float, float, float, float], Tuple[float, float]]
Y = 1
Z = 2
ROT = 69.0
class TimeMe:
'''
Timing decorator that takes into account whether the timing would be done in GPU or CPU
@depends: torch
Ref: https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964/2
'''
def __init__(self, gpu: bool = False) -> None:
self._gpu: bool = gpu
if gpu:
self._start = torch.cuda.Event(enable_timing=True)
self._end = torch.cuda.Event(enable_timing=True)
def __call__(self, func: Callable[P, T]) -> Callable[P, T]:
def wrapper(*args: P.args, **kwargs: P.kwargs) -> Tuple[float, T]:
all_ts: float = 0.0
ts: float = 0.0
if self._gpu:
self._start.record()
else:
ts = time.time()
result = func(*args, **kwargs)
if self._gpu:
self._end.record()
torch.cuda.synchronize()
all_ts = self._start.elapsed_time(self._end) if self._gpu else (time.time() - ts) * 1000.0
return all_ts, result
return wrapper
def rotate_yz_sin_cos(y: float, z: float, sin_: float, cos_: float) -> Tuple[float, float]:
if math.isfinite(y) and math.isfinite(z):
yy = y * cos_ - z * sin_
zz = y * sin_ + z * cos_
return yy, zz
return math.nan, math.nan
# JIT compile for both CPU and GPU
rotate_yz_sin_cos_cpu: RotateCallable = numba.njit(nogil=True, cache=True)(rotate_yz_sin_cos)
rotate_yz_sin_cos_gpu: RotateCallable = cuda.jit(device=True, cache=True)(rotate_yz_sin_cos)
@cuda.jit
def rotate_pcl_kernel(pcl: DeviceNDArray, rot: float) -> DeviceNDArray:
'''
GPU kernel to rotate pointclouds with an angle rot
Ref: https://github.com/harrism/numba_examples/blob/master/mandelbrot_numba.ipynb
'''
sin_ = math.sin(rot)
cos_ = math.cos(rot)
h, w, _ = pcl.shape
startX, startY = cuda.grid(2)
gridX = cuda.gridDim.x * cuda.blockDim.x
gridY = cuda.gridDim.y * cuda.blockDim.y
for i in range(startY, h, gridY):
for j in range(startX, w, gridX):
pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_gpu(
pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)
@TimeMe(gpu=True)
def rotate_pcl_gpu(pcl: Union[cp.ndarray, np.ndarray],
rot: float,
h2d: bool = False) -> Union[cp.ndarray, np.ndarray]:
blockdim = (16, 16)
griddim = (16, 16)
pcl = cuda.to_device(pcl) if h2d else pcl
rotate_pcl_kernel[griddim, blockdim](pcl, rot)
pcl = pcl.copy_to_host() if h2d else pcl
# The rotation doesn't happen inplace only when H2D/D2H
# transfers were requested; pcl originally on host memory
return pcl
@TimeMe()
@numba.njit(nogil=True, parallel=True, cache=True)
def rotate_pcl_cpu(pcl: np.ndarray, rot: float) -> None:
sin_ = math.sin(rot)
cos_ = math.cos(rot)
h, w, _ = pcl.shape
for i in numba.prange(h):
for j in numba.prange(w):
pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_cpu(
pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)
if __name__ == '__main__':
# TODO: add the Zed init and Mat creation
pcl_gpu = pcl_mat.get_data(memory_type=sl.MEM.GPU)
# Note that this will take "D2H" time
# Ref: https://docs.cupy.dev/en/stable/reference/generated/cupy.ndarray.html#cupy.ndarray.get
pcl_cpu = pcl_gpu.get()
# rotation will happen inplace since pcl is on Device memory
rot_gpu_lat, _ = rotate_pcl_gpu(pcl_gpu, ROT)
rot_cpu_lat, _ = rotate_pcl_cpu(pcl_cpu, ROT)
# NOTE: Tests for correctness were performed to make sure the rotation is identical on GPU/CPU,
# and that the rotation actually happened; inplace. BenchmarkSystem info:
GPU warmup iterations:
Numbers:
As you can see, the gain is significant, and we can write all pcl processing code in the same manner, thus, never having to bring the pcl to the host memory. Additionally, we can run DL inference on the frames using TRT without having to pay for the D2H-H2D transfers. Notes about the cupy choice:Cupy has great interoperability with most common frameworks (notably The same results might be achieved using |
Impressive! We need to think about how we could go forward with this. Ideally, I think it would be best to have the cupy dependency optional to limit complexity for most users. I'll try to test this soon |
When it comes to dependency management, I do agree that cupy being an optional dependency is the way to go. I also want to note that we've seen an increase of the maximum FPS. We went from around 12 FPS (grab/retrieve on CPU), to around 15 when we started keeping the data on the GPU (even with 2 Testing session's findingsI spent some time testing it in depth, and I saw some confusing behavior that I would love your help about. @numba.njit(nogil=True, cache=True)
def almost_eq(a: Union[int, float], b: Union[int, float], e: Union[int, float]) -> bool:
if not (math.isfinite(a) and math.isfinite(b)):
# If either is nan, consider them equal
return True
return abs(a - b) <= e
@numba.njit(nogil=True, cache=True)
def count_almost_eq(mat1: np.ndarray, mat2: np.ndarray, eupsilon: float) -> None:
'''
This method is needed (instead of np.count_nonzero(mat1 == mat2)) becauce
np.nan == np.nan >>> False. Thus we need to manually check the elements.
'''
sh1 = mat1.shape
sh2 = mat2.shape
assert sh1 == sh2
assert mat1.size == mat2.size
h, w, _ = sh1
eq = mat1.size
for j in range(h):
for i in range(w):
x_eq = almost_eq(mat1[j][i][X], mat2[j][i][X], eupsilon)
y_eq = almost_eq(mat1[j][i][Y], mat2[j][i][Y], eupsilon)
z_eq = almost_eq(mat1[j][i][Z], mat2[j][i][Z], eupsilon)
eq -= 1 * (not (x_eq and y_eq and z_eq))
return eq
# Retrieve the image/measure
gpu_pcl = pcl_mat.get_data(memory_type=sl.MEM.GPU)
gpu_frame = frame_mat.get_data(memory_type=sl.MEM.GPU)
# Bring the data to Host memory to visualize it and test its correctness
gpu_pcl_np = gpu_pcl.get()
gpu_frame_np = gpu_frame.get()
print(f'PCL -- {(count_almost_eq(gpu_pcl, gpu_pcl_np, 0.000001) / pcl_cpu.size) * 100.0:.2f} % almost identical')
print(f'RGB -- {(count_almost_eq(cpu_frame, gpu_frame_np, 1) / cpu_frame.size) * 100.0:.2f} % almost identical') I tested with live streams from a The confusing behavior is that in Additionally, when I visualize the Frames (after converting Cupy arr to Numpy arr), the GPU one looks kinda shifted (for I tried:
But nothing seems to fix this, Am I missing something about how CUDA works, or how the frame data is “organized” in memory ? Thank you for indulging me. |
@adujardin I was wondering if you had time to test this out, or maybe if you can point me towards why this issue might be happening. @numba.njit(cache=True, parallel=True)
def xyzrgba2rgba(src: np.ndarray, dst: np.ndarray) -> None:
'''
Fill dst matrix with RGBA data from the XYZRGBA pcl.
'''
RGBA_AXIS = 3
BYTE_MASK = 0xFF
R, G, B = (0, 1, 2)
RED_MASK = np.uint32(BYTE_MASK << (R * 8))
GREEN_MASK = np.uint32(BYTE_MASK << (G * 8))
BLUE_MASK = np.uint32(BYTE_MASK << (B * 8))
h, w, _ = dst.shape
for i in numba.prange(w):
for j in numba.prange(h):
if not np.isfinite(src[j][i][RGBA_AXIS]):
continue
rgba_bin = np.asarray(src[j][i][RGBA_AXIS], np.float32).view(np.uint32)
dst[j][i][R] = np.uint8((rgba_bin & RED_MASK) >> (R * 8))
dst[j][i][G] = np.uint8((rgba_bin & GREEN_MASK) >> (G * 8))
dst[j][i][B] = np.uint8((rgba_bin & BLUE_MASK) >> (B * 8))
dst[j][i][RGBA_AXIS] = BYTE_MASK
def image_from_pcl_rgba(pcl):
im = np.ones_like(pcl, dtype=np.uint8)
xyzrgba2rgba(pcl, im)
return im And it gave me the next two images ( Do you have any ideas for me to try ? |
Hi! I'm using just now the python API for a project, and I realized that one of the main bottlenecks is, also for me, the double passage of textures from gpu to cpu. |
Hey, @Neeklow!
It's to note that I only tested this with the |
Thanks a lot @Rad-hi !
Followed by a mega Cython error from ( Any help is really appreciated!! |
Okay so after inspecting the build log, I think I can see the issue: When running this line,
Two possibilities:
from libcpp.map cimport map
# Add the next two lines to sl.pyx, ~ line 27
import sys
sys.path.append('/ABSOLUTE/PATH/TO/ZED/API/zed-python-api/src/pyzed')
# And modify the import: note the pyzed.sl_c
from pyzed.sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString Make sure to update the path with the appropriate value (I see you're using windows, and I have no idea how it works) Hope this helps. |
@Rad-hi thansk for the tips. That defineteyl helped but it seems I'm having troubles with MSCV compiler:
Is where I get to (i've uploaded the whole build error.txt again just in case). dea where to look for solution I'd really appreciate (but I understand you're not on Win and I seem to understand this is a strictly windows Issue)! |
I see @Neeklow, you seem to be encountering the same issues I encountered when trying to build this for the Orin (an NVIDIA board I am working on). The actual error you're encountering is (line ~303 in the build log):
And the quickest fix for me was to comment out the Beware of the changes to the # This is added because sl_c.pxd wasn't being found
import os
import sys
HOME = os.path.expanduser('~')
sys.path.append(os.path.join(HOME, 'Desktop/zed-python-api/src')) Since I placed the API under my desktop at the time of testing this. Apply the changes that you applied in the past fix here instead of keeping it this way. I hope this helps and the API builds. |
Keep the
retrieved
data on GPUDescription
Through the utilization of cupy.ndarrays (extra dependency), we were able to keep the retrieved data on GPU and provide a view of the memory on the Python side.
This PR introduces a slight modification to the
get_data()
function in a way that extends the API, but doesn't break any of its current functionalities.Notes
In an effort to stay true to the original API, data on the GPU could be either
deeply copied
, orviewed
by the Python API consumer.A list of references is kept as a comment above the added code for potential enhancements, and/or better understanding of the feature.
Tests
This was tested with the
ZED SDK 4.0
, on both SVOs (recorded in HD2K-15), and on a live stream from aZED Mini
, with a custom wrapper of the Python API, on aUbuntu 20.04.6 LTS
system, with an Nvidia GTX 1650 4Gb graphics card.Additionally, we are attempting its test on an
Nvidia AGX Orin 32Gb developer kit
, though we're having some Cython compilation errors that are prohibiting us from proceeding with the tests (would love to discuss, but not our topic).Disclaimers
This is not thoroughly tested, and this PR is intended to spark a discussion around how this feature might be developed by the community, rather than being a final product (though we hope it's ready as is).