GPU matmul refactoring and optimization #53
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
Firstly, by the time this code is executed, based on the NDArray_Matmul method, we know that both arrays are on the same device.
Based on the fact that we know that array "a" is on the GPU, we can say that both arrays are on the GPU. Therefore, the preprocessor directive to check for the presence of CUBLAS does not make sense, since placing the array in GPU memory is not possible without the presence of CUBLAS. Therefore this directive has been removed.
Secondly, based on the first point, we know that both arrays are already placed in GPU memory, therefore there is no need to allocate additional memory and copy them. Therefore, the cudaMalloc, cudaMemcpy and cudaFree functions for input arrays have been removed.
The name of the resulting array has been changed from d_C to deviceResult to make the code more clear.
These changes led to an increase in the performance of the matmul operation.
Benchmark before changes:
NDArray
Benchmark after changes:
NDArray
Pytorch (for comparison)
Visualization of the multiplication rate as the number of iterations increases for NDArray and Pytorch.
Note: after the 50th iteration, the speed started to drop for both libraries. These performance changes correlate with the graphics card heating up.
Test stand