GPU matmul refactoring and optimization #53

SkibidiProduction · 2024-07-06T15:58:16Z

Overview

Firstly, by the time this code is executed, based on the NDArray_Matmul method, we know that both arrays are on the same device.

Based on the fact that we know that array "a" is on the GPU, we can say that both arrays are on the GPU. Therefore, the preprocessor directive to check for the presence of CUBLAS does not make sense, since placing the array in GPU memory is not possible without the presence of CUBLAS. Therefore this directive has been removed.

Secondly, based on the first point, we know that both arrays are already placed in GPU memory, therefore there is no need to allocate additional memory and copy them. Therefore, the cudaMalloc, cudaMemcpy and cudaFree functions for input arrays have been removed.

The name of the resulting array has been changed from d_C to deviceResult to make the code more clear.

These changes led to an increase in the performance of the matmul operation.

Benchmark before changes:

NDArray

Measurement	Value
Number of measurements	100
Mean	0.84835189755758
Standard Deviation	0.01839222232213

Benchmark after changes:

NDArray

Measurement	Value
Number of measurements	100
Mean	0.43721009731293
Standard Deviation	0.026118019744953

Pytorch (for comparison)

Measurement	Value
Number of measurements	100
Mean	0.4293128824234
Standard Deviation	0.03115384458902

Visualization of the multiplication rate as the number of iterations increases for NDArray and Pytorch.

Note: after the 50th iteration, the speed started to drop for both libraries. These performance changes correlate with the graphics card heating up.

Test stand

#	Value
OS	Ubuntu
PHP version	8.3.0
NumPower version	0.5.1
Python version	3.11.5
Pytorch version	2.2.0
CUDA version	11.6.2
NVIDIA driver	550.90
GPU	GeForce GTX 980M 4Gb
Matrix shape	8192x8192

src/ndmath/linalg.c

fix: Instead of copying the data, it frees the result buffer and overwrites the pointer with the deviceResult address. fix: gpu_alloc leak typo.

henrique-borba · 2024-07-08T21:46:56Z

I made some changes while testing:

Instead of using cudaMalloc, cudaFree, use the methods available in gpu_alloc.h (vmalloc, vfree). This allows us to detect leaks when the NDARRAY_VCHECK environment variable is set.
Instead of copying the sgemm result memory to the result buffer, we free the result buffer with vfree and then just overwrite the buffer address with the sgemm result address.

henrique-borba

Approved. It will be available in version 0.5.2.

FMatmul refactoring

1836dcb

henrique-borba added this to the 0.5.2 milestone Jul 8, 2024

henrique-borba requested changes Jul 8, 2024

View reviewed changes

src/ndmath/linalg.c Outdated Show resolved Hide resolved

henrique-borba added the optimization Optimizes a method, impacting its execution time or its use of computational resources label Jul 8, 2024

henrique-borba assigned henrique-borba and SkibidiProduction Jul 8, 2024

SkibidiProduction and others added 3 commits July 9, 2024 00:08

HAVE_CUBLAS is returned

2133e66

Merge branch 'main' into feat/fmatmul-refactoring

0e209b6

fix: Uses gpu_alloc.h methods to allocate VRAM.

7216717

fix: Instead of copying the data, it frees the result buffer and overwrites the pointer with the deviceResult address. fix: gpu_alloc leak typo.

henrique-borba approved these changes Jul 8, 2024

View reviewed changes

henrique-borba changed the title ~~FMatmul refactoring~~ GPU matmul refactoring and optimization Jul 8, 2024

henrique-borba merged commit 11dbea0 into NumPower:main Jul 8, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU matmul refactoring and optimization #53

GPU matmul refactoring and optimization #53

SkibidiProduction commented Jul 6, 2024 •

edited

Loading

henrique-borba commented Jul 8, 2024

henrique-borba left a comment

GPU matmul refactoring and optimization #53

GPU matmul refactoring and optimization #53

Conversation

SkibidiProduction commented Jul 6, 2024 • edited Loading

Overview

Benchmark before changes:

NDArray

Benchmark after changes:

NDArray

Pytorch (for comparison)

Visualization of the multiplication rate as the number of iterations increases for NDArray and Pytorch.

henrique-borba commented Jul 8, 2024

henrique-borba left a comment

Choose a reason for hiding this comment

SkibidiProduction commented Jul 6, 2024 •

edited

Loading