cuda-compress

A CUDA version of CvxCompress. I have written this code to practice my CUDA skills. The code itself is pretty much useless at the moment and is highly under construction.

Supported features

8 x 8 x 8 Forward and inverse transform.
32 x 32 x 32 Forward and inverse transform.

NOTE: Only the forward transform has been optimized.

Test with pre-generated input

Test all forward and inverse kernels with random input data

 ./test_all_blocks.x 
wl79_8x8x8 	 [8, 8, 8] [11, 9, 8] 
OK
wl79_32x32x32 	 [32, 32, 32] [11, 9, 8] 
OK
opt1wl79_32x32x32 	 [32, 32, 32] [11, 9, 8] 
OK
opt2wl79_32x32x32 	 [32, 32, 32] [11, 9, 8] 
OK
opt3wl79_32x32x32 	 [32, 32, 32] [11, 9, 8] 
OK
opt4wl79_32x32x32 	 [32, 32, 32] [11, 9, 8] 
OK
opt5wl79_32x32x32 	 [32, 32, 32] [11, 9, 8] 
OK
opt6wl79_32x32x32 	 [32, 32, 32] [11, 9, 8] 
FAILED
opt7wl79_32x32x32 	 [32, 32, 32] [11, 9, 8] 
OK

Optimization version 6 does not work and it is only not very fast :).

Test with user-generated input

This test lets you load in a binary data array from disk.

All tests check that the forward followed by the inverse transform recovers the identity transform. The CPU test uses a single block only. I mainly use this test during development. This test itself is not strong enough to guarantee the correctness of the code. For example, if the computation does nothing in both the forward or inverse kernels, the test will pass! I plan to add more tests.

Generate example data

$ ./write_volume.x rand32.bin 32 32 32 1 2 3 3
Generating random grid 
Writing 1x2x3 blocks of dimension 32x32x32 to rand32.bin

Run test

$ ./test_wavelet_transform_slow.x x32.bin 
reading: rand32.bin 
block dimension: 32 32 32 
number of blocks: 1 2 3 
[32, 32, 32] Computing GPU forward transform... 
Throughput: 4788.78 Mcells/s 
[32, 32, 32] Computing GPU inverse transform... 
Throughput: 263.827 Mcells/s 
Running error checking... 
abs. l2 error = 0.000101344 l1 error = 0.0368315 linf error = 9.53674e-07 
rel. l2 error = 3.96186e-07 l1 error = 3.74868e-07 linf error = 9.5368e-07

Throughput

Test the throughput performance of each kernel. I obtained the output below using a RTX 2080 Ti card.

./test_wavelet_transform_with_input.x 
Kernel name       	 Wavelet transform 	 Block dimension 	 Grid dimension 	 Throughput
wl79_32x32x32        	 Forward 	         [32, 32, 32] 	         [320, 384, 416] 	 1968.43 Mcells/s
opt1wl79_32x32x32    	 Forward 	         [32, 32, 32] 	         [320, 384, 416] 	 8027.09 Mcells/s
opt2wl79_32x32x32    	 Forward 	         [32, 32, 32] 	         [320, 384, 416] 	 15068.8 Mcells/s
opt3wl79_32x32x32    	 Forward 	         [32, 32, 32] 	         [320, 384, 416] 	 19548.7 Mcells/s
opt4wl79_32x32x32    	 Forward 	         [32, 32, 32] 	         [320, 384, 416] 	 29149.3 Mcells/s
opt5wl79_32x32x32    	 Forward 	         [32, 32, 32] 	         [320, 384, 416] 	 35942 Mcells/s
opt6wl79_32x32x32    	 Forward 	         [32, 32, 32] 	         [320, 384, 416] 	 33031.6 Mcells/s
opt7wl79_32x32x32    	 Forward 	         [32, 32, 32] 	         [320, 384, 416] 	 54877.9 Mcells/s

Only the 32x32x32 kernel has been optimized at the moment. Only the forward transform has been optimized for versions 5 and up. Version 6 doesn't work yet. More optimizations to come...

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Makefile		Makefile
README.md		README.md
compare.h		compare.h
compute_ans.txt		compute_ans.txt
cuda_err_check.h		cuda_err_check.h
diff.h		diff.h
generate.py		generate.py
generate_opt7.py		generate_opt7.py
init_random.h		init_random.h
init_x.h		init_x.h
init_y.h		init_y.h
init_z.h		init_z.h
norms.h		norms.h
opt32.cuh		opt32.cuh
opt_32.cuh		opt_32.cuh
opt_32_2.cuh		opt_32_2.cuh
opt_32_3.cuh		opt_32_3.cuh
opt_32_4.cuh		opt_32_4.cuh
opt_32_5.cuh		opt_32_5.cuh
opt_32_6.cuh		opt_32_6.cuh
opt_32_7.cuh		opt_32_7.cuh
opt_32_7_extra.cuh		opt_32_7_extra.cuh
printing.cuh		printing.cuh
read_volume.c		read_volume.c
read_volume.h		read_volume.h
run_slow_transform.c		run_slow_transform.c
test_all_blocks.cu		test_all_blocks.cu
test_wavelet_transform_slow.cu		test_wavelet_transform_slow.cu
test_wavelet_transform_with_input.cu		test_wavelet_transform_with_input.cu
wavelet_slow.cuh		wavelet_slow.cuh
wavelet_slow.h		wavelet_slow.h
write_volume.c		write_volume.c
write_volume.h		write_volume.h
write_volume.py		write_volume.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cuda-compress

Supported features

Test with pre-generated input

Test with user-generated input

Throughput

About

Releases

Packages

Languages

ooreilly/cuda-compress

Folders and files

Latest commit

History

Repository files navigation

cuda-compress

Supported features

Test with pre-generated input

Test with user-generated input

Throughput

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages