Content

How To
Configuration
Time Measurement
Tasks
Results

How To

Create github account (if not exists);
Make sure SSH clone & commit is working (Connecting to GitHub with SSH);
Fork this repo (just click Fork button on the top of the page, detailed instructions here)
Clone your forked repo into your local machine, use your user instead of username:

git clone git@github.com:username/cuda-2024.git
cd cuda-2024

Go to your group folder, e.g.:

cd 3821B1FI1

Go to needed task folder, e.g.:

cd 1_gelu_omp

Create new folder with your surname and name (make sure it's the same for all tasks), e.g.:

mkdir petrov_ivan

Copy your task source/header files (including main program) into this folder (use copy instead of cp on Windows), e.g.:

cd petrov_ivan
cp /home/usr/lab/*.cpp .
cp /home/usr/lab/*.h .

Push your sources to github repo, e.g.:

cd ..
git add .
git commit -m "1_gelu_omp task"
git push

Go to your repo in browser, click Contribute button on the top of page, then Open pull request. Provide meaningfull request title and description, then Create pull request (see details here).

Time Measurement

The following scheme is used to measure task execution time:

int main() {
    // ...

    // Warming-up
    Task(input, size / 8);

    // Performance Measuring
    auto start = std::chrono::high_resolution_clock::now();
    auto c = Task(input, size);
    auto end = std::chrono::high_resolution_clock::now();

    // ...
}

Configuration

CPU: Intel Core i5 12600K (4 cores, 4 threads)
RAM: 16 GB
GPU: NVIDIA RTX 4060 (8 GB)
Host Compiler: GCC 11.4.0
CUDA: 12.6

Tasks

Task #1: OpenMP GELU Implementation

The Gaussian Error Linear Unit (GELU) is an activation function frequently used in Deep Neural Networks (DNNs) and can be thought of as a smoother ReLU.

To approximate GELU function, use the following formula:

GELU(x) = $0.5x(1 + tanh(\sqrt{2 / \pi}(x + 0.044715 * x^3)))$

Implement the function with the following interface in C++:

std::vector<float> GeluOMP(const std::vector<float>& input);

Size of result vector should be the same as for input. Use OpenMP technology to make your function parallel & fast.

Two files are expected to be uploaded:

gelu_omp.h

#ifndef __GELU_OMP_H
#define __GELU_OMP_H

#include <vector>

std::vector<float> GeluOMP(const std::vector<float>& input);

#endif // __GELU_OMP_H

gelu_omp.cpp

#include "gelu_omp.h"

std::vector<float> GeluOMP(const std::vector<float>& input) {
    // Place your implementation here
}

Task #2: CUDA GELU Implementation

Implement the function with the following interface in CUDA C++ using the formula described above:

std::vector<float> GeluCUDA(const std::vector<float>& input);

Size of result vector should be the same as for input. Use CUDA technology to make your function work on NVIDIA GPU. Try to make it fast.

Two files are expected to be uploaded:

gelu_cuda.h

#ifndef __GELU_CUDA_H
#define __GELU_CUDA_H

#include <vector>

std::vector<float> GeluCUDA(const std::vector<float>& input);

#endif // __GELU_CUDA_H

gelu_cuda.cu

#include "gelu_cuda.h"

std::vector<float> GeluCUDA(const std::vector<float>& input) {
    // Place your implementation here
}

Task #3: Naive Matrix Multiplication using OpenMP

General matrix multiplication (GEMM) is a very basic and broadly used linear algebra operation applied in high performance computing (HPC), statistics, deep learning and other domains. There are a lot of GEMM algorithms with different mathematical complexity form $O(n^3)$ for naive and block approaches to $O(n^{2.371552})$ for the method descibed by Williams et al. in 2024 [1]. But despite a variety of algorithms with low complexity, block matrix multiplication remains the most used implementation in practice since it fits to modern HW better.

To start learning matrix multiplication smoother, let us start with naive approach here. To compute matrix multiplication result C for matricies A and B, where C = A * B and the size for all matricies are $n*n$, one should use the following formula for each element of C (will consider only square matricies for simplicity):

$c_{ij}=\sum_{k=1}^na_{ik}b_{kj}$

To complete the task one should implement a function that multiplies two square matricies using OpenMP with the following interface:

std::vector<float> NaiveGemmOMP(const std::vector<float>& a,
                                const std::vector<float>& b,
                                int n);

Each matrix must be stored in a linear array by rows, so that a.size()==n*n. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.

For simplicity, let's consider matrix size is always power of 2.

Two files are expected to be uploaded:

naive_gemm_omp.h:

#ifndef __NAIVE_GEMM_OMP_H
#define __NAIVE_GEMM_OMP_H

#include <vector>

std::vector<float> NaiveGemmOMP(const std::vector<float>& a,
                                const std::vector<float>& b,
                                int n);

#endif // __NAIVE_GEMM_OMP_H

naive_gemm_omp.cpp:

#include "naive_gemm_omp.h"

std::vector<float> NaiveGemmOMP(const std::vector<float>& a,
                                const std::vector<float>& b,
                                int n) {
    // Place your implementation here
}

Task #4: Naive Matrix Multiplication using CUDA

In this task one should implement naive approach for matrix multiplication in CUDA trying to make it fast enough (pay attention to global memory accesses in your code).

Each matrix must be stored in a linear array by rows, so that a.size()==n*n. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.

For simplicity, let's consider matrix size is always power of 2.

Two files are expected to be uploaded:

naive_gemm_cuda.h:

#ifndef __NAIVE_GEMM_CUDA_H
#define __NAIVE_GEMM_CUDA_H

#include <vector>

std::vector<float> NaiveGemmCUDA(const std::vector<float>& a,
                                 const std::vector<float>& b,
                                 int n);

#endif // __NAIVE_GEMM_CUDA_H

naive_gemm_cuda.cu:

#include "naive_gemm_cuda.h"

std::vector<float> NaiveGemmCUDA(const std::vector<float>& a,
                                 const std::vector<float>& b,
                                 int n) {
    // Place your implementation here
}

Task #5: Block Matrix Multiplication using OpenMP

In real applications block-based approach for matrix multiplication can get multiple times faster execution comparing with naive version due to cache friendly approach. To prove this in practice, implement such a version in C++ using OpenMP.

In block version algorithm could be divided into three stages:

Split matricies into blocks (block size normally affects performance significantly so choose it consciously);
Multiply two blocks to get partial result;
Replay step 2 for all row/column blocks accumulating values into a single result block.

From math perspective, block matrix multiplication could be described by the following formula, where $C_{IJ}$, $A_{IK}$ and $B_{KJ}$ are sub-matricies with the size $block_size*block_size$:

$C_{IJ}=\sum_{k=1}^{block_count}A_{IK}B_{KJ}$

Each matrix must be stored in a linear array by rows, so that a.size()==n*n. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.

For simplicity, let's consider matrix size is always power of 2.

Two files are expected to be uploaded:

block_gemm_omp.h:

#ifndef __BLOCK_GEMM_OMP_H
#define __BLOCK_GEMM_OMP_H

#include <vector>

std::vector<float> BlockGemmOMP(const std::vector<float>& a,
                                const std::vector<float>& b,
                                int n);

#endif // __BLOCK_GEMM_OMP_H

block_gemm_omp.cpp:

#include "block_gemm_omp.h"

std::vector<float> BlockGemmOMP(const std::vector<float>& a,
                                const std::vector<float>& b,
                                int n) {
    // Place your implementation here
}

As in previous task, let us consider all matricies are square.

Task #6: Block Matrix Multiplication using CUDA

In CUDA C++ block-based approach looks similar. But to get better performance one should use CUDA shared memory to store each particular block while computations. With this consideration, algorithm will be the following:

A single CUDA block should compute a single block of result matrix C, a single CUDA thread - a single matrix C element;
For each A block in a row and B block in a column:
1. Load A block into shared memory;
2. Load B block into shared memory;
3. Synchronize over all threads in block;
4. Compute BlockA * BlockB and accumulate into C block in shared memory;
5. Synchronize over all threads in block;
Dump block C from shared to global memory.

Each matrix must be stored in a linear array by rows, so that a.size()==n*n. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.

For simplicity, let's consider matrix size is always power of 2.

Two files are expected to be uploaded:

block_gemm_cuda.h:

#ifndef __BLOCK_GEMM_CUDA_H
#define __BLOCK_GEMM_CUDA_H

#include <vector>

std::vector<float> BlockGemmCUDA(const std::vector<float>& a,
                                 const std::vector<float>& b,
                                 int n);

#endif // __BLOCK_GEMM_CUDA_H

block_gemm_cuda.cu:

#include "block_gemm_cuda.h"

std::vector<float> BlockGemmCUDA(const std::vector<float>& a,
                                 const std::vector<float>& b,
                                 int n) {
    // Place your implementation here
}

Task #7: Matrix Multiplication using cuBLAS

The most performant way to multiply two matrices on particular hardware is to use vendor-provided library for this purpose. In CUDA it's cuBLAS. Try to use cuBLAS API to implement general matrix multiplication in most performant way.

Each matrix must be stored in a linear array by rows, so that a.size()==n*n. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.

For simplicity, let's consider matrix size is always power of 2.

Note, that in cuBLAS API matrix is expected to be stored by columns, so additional transpose may be required.

Two files are expected to be uploaded:

gemm_cublas.h:

#ifndef __GEMM_CUBLAS_H
#define __GEMM_CUBLAS_H

#include <vector>

std::vector<float> GemmCUBLAS(const std::vector<float>& a,
                              const std::vector<float>& b,
                              int n);

#endif // __GEMM_CUBLAS_H

gemm_cublas.cu:

#include "gemm_cublas.h"

std::vector<float> GemmCUBLAS(const std::vector<float>& a,
                              const std::vector<float>& b,
                              int n) {
    // Place your implementation here
}

Task #8: FFT (Fast Fourier Transform) using cuFFT

Another widely used operation in HPC & signal processing is discrete Fourier Transform. Naive approach (by definition) has $O(n^2)$ complexity and is not used in practice due to its slowness. Better way is Fast Fourier Transform (FFT) algorithm with $O(n*log(n))$ complexity.

Due to its frequent use, FFT algorithm implementation is normally a part of vendor-optimized solutions for various hardware chips. For NVIDIA GPUs one should take cuFFT library.

To pass the task one should implement a funtion that takes $batch$ signals of $n$ complex elements, and performs complex-to-complex forward and than inverse Fourier transform for them. For better performance use cuFFT API.

Required function should have the following prototype:

std::vector<float> FffCUFFT(const std::vector<float>& input, int batch);

Here $batch$ is a number of independent signals, $input$ contains complex values in the format of $(real, imaginary)$ pairs of floats storing pair by pair. So $input$ array size must be equal to $2 * n * batch$.

The function should perform the following actions:

Compute forward Fourier transform for $input$;
Compute inverse Fourier transform for the result of step 1;
Normalize result of step 2 by $n$.

Returned array must store result of step 3 in the same format of $(real, imaginary)$ pairs as $input$ and have the same size.

Note, that due to Fourier Transform math properties, result array will have the same values as input one. This specificity could be used for self-checking.

Two files are expected to be uploaded:

fft_cufft.h:

#ifndef __FFT_CUFFT_H
#define __FFT_CUFFT_H

#include <vector>

std::vector<float> FffCUFFT(const std::vector<float>& input, int batch);

#endif // __FFT_CUFFT_H

fft_cufft.cu:

#include "fft_cufft.h"

std::vector<float> FffCUFFT(const std::vector<float>& input, int batch) {
    // Place your implementation here
}

Task #9: OpenCL GELU Implementation

Implement GELU function with the following interface in OpenCL using the formula described in task #1:

std::vector<float> GeluOCL(const std::vector<float>& input);

Size of result vector should be the same as for input. Use OpenCL technology to make your function work on NVIDIA GPU. Try to make it fast.

Use CL_DEVICE_GPU flag to choose GPU device. Use zero platform and zero device. Store your OpenCL kernel in a string constant.

Two files are expected to be uploaded:

gelu_ocl.h

#ifndef __GELU_OCL_H
#define __GELU_OCL_H

#include <vector>

std::vector<float> GeluOCL(const std::vector<float>& input);

#endif // __GELU_OCL_H

gelu_ocl.cpp

#include "gelu_ocl.h"

std::vector<float> GeluOCL(const std::vector<float>& input) {
    // Place your implementation here
}

Results

1_gelu_omp (134217728 elements)

Group	Name	Result
3821B1FI3	kuznetsov_artyom	0.2609
REF	REF	0.8126

2_gelu_cuda (134217728 elements)

Group	Name	Result
3821B1FI3	kuznetsov_artyom	0.2407
REF	REF	0.2598

3_naive_gemm_omp (1024 elements)

Group	Name	Result
REF	REF	0.8379
3821B1FI3	kuznetsov_artyom	BUILD FAILED

4_naive_gemm_cuda (8192 elements)

Group	Name	Result
REF	REF	0.1620

5_block_gemm_omp (1024 elements)

Group	Name	Result
REF	REF	0.1980

6_block_gemm_cuda (8192 elements)

Group	Name	Result
REF	REF	0.1650

7_gemm_cublas (8192 elements)

Group	Name	Result
REF	REF	0.3054

8_fft_cufft (131072 elements)

Group	Name	Result
REF	REF	0.2309

9_gelu_ocl (134217728 elements)

Group	Name	Result
REF	REF	0.2621

Tasks Done

3821B1FI3

Group	Name	Passed
3821B1FI3	kuznetsov_artyom	2/9

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
3821B1FI1		3821B1FI1
3821B1FI2		3821B1FI2
3821B1FI3		3821B1FI3
3821B1PE1		3821B1PE1
3821B1PE2		3821B1PE2
3821B1PE3		3821B1PE3
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content

How To

Time Measurement

Configuration

Tasks

Task #1: OpenMP GELU Implementation

Task #2: CUDA GELU Implementation

Task #3: Naive Matrix Multiplication using OpenMP

Task #4: Naive Matrix Multiplication using CUDA

Task #5: Block Matrix Multiplication using OpenMP

Task #6: Block Matrix Multiplication using CUDA

Task #7: Matrix Multiplication using cuBLAS

Task #8: FFT (Fast Fourier Transform) using cuFFT

Task #9: OpenCL GELU Implementation

Results

1_gelu_omp (134217728 elements)

2_gelu_cuda (134217728 elements)

3_naive_gemm_omp (1024 elements)

4_naive_gemm_cuda (8192 elements)

5_block_gemm_omp (1024 elements)

6_block_gemm_cuda (8192 elements)

7_gemm_cublas (8192 elements)

8_fft_cufft (131072 elements)

9_gelu_ocl (134217728 elements)

Tasks Done

3821B1FI3

About

Releases

Packages

Languages

vadimbelan/cuda-2024

Folders and files

Latest commit

History

Repository files navigation

Content

How To

Time Measurement

Configuration

Tasks

Task #1: OpenMP GELU Implementation

Task #2: CUDA GELU Implementation

Task #3: Naive Matrix Multiplication using OpenMP

Task #4: Naive Matrix Multiplication using CUDA

Task #5: Block Matrix Multiplication using OpenMP

Task #6: Block Matrix Multiplication using CUDA

Task #7: Matrix Multiplication using cuBLAS

Task #8: FFT (Fast Fourier Transform) using cuFFT

Task #9: OpenCL GELU Implementation

Results

1_gelu_omp (134217728 elements)

2_gelu_cuda (134217728 elements)

3_naive_gemm_omp (1024 elements)

4_naive_gemm_cuda (8192 elements)

5_block_gemm_omp (1024 elements)

6_block_gemm_cuda (8192 elements)

7_gemm_cublas (8192 elements)

8_fft_cufft (131072 elements)

9_gelu_ocl (134217728 elements)

Tasks Done

3821B1FI3

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages