Performance tracking #105

pca006132 · 2022-05-07T15:31:54Z

As this library is intended to be fast, I guess we should probably open an issue to track its performance?

I did some microbenchmark using the perfTest binary and a unionPerfTest code that I wrote to test lazy union (union 100 spheres with diameter=2.5 with varying separation distance, i.e. may or may not overlap). I've tested single thread using CPP backend, OpenMP backend and CUDA backend all on the same laptop with i5-8300H and GTX1050 Mobile. Here is my spreadsheet, and here is my branch for some optimizations that looks quite effective for CUDA and small meshes :). I will open a separate PR for the branch, after the build script PR is merged.

From the results, we can see that:

For small meshes, OMP > CPP > CUDA. Wondering if we can run the thrust APIs on OMP for operations involving only < 1k vertices or something.
For more complicated models, using the backend GPU may exhaust GPU memory. Is it possible to fallback to CPP mode after that?
We have to be careful about copying vector, eliminating unnecessary copying reduces the execution time for CUDA backend significantly (-30% in the microbenchmark). One way would be to disable copy constructor for VecDH and provide an explicit copy method.
It seems that the multithread cannot help much in our case, even though my computer has 4 cores 8 threads. Perhaps we should look into this because quite a lot of users may not have a GPU with CUDA (my main laptop is using AMD, so no CUDA here).

The text was updated successfully, but these errors were encountered:

elalish · 2022-05-09T16:04:37Z

I'm open to these optimizations, but I would recommend avoiding too much code complexity, especially as related to Thrust. I'd prefer to switch from Thrust to C++17 parallel algorithms, which will allow us to remove the VecDH class entirely and let the compiler handle these copies/moves internally. Hopefully the compilers might even take care of switching backends based on data size? I'm also hoping we'll get compilers that can target AMD GPUs from the parallel STL, though I'm not sure how far out that is.

pca006132 · 2022-05-09T16:21:16Z

The main performance improvement comes from these two patches: 168fd07 and a3dbe15 , which are not thrust specific, so this should be fine. Will submit a PR soon.

For the parallel STL, I've tried using them with clang, which uses TBB for parallel execution and the performance seems pretty good (for the case that I've tested). It seems that nvcc also supports parallel stl with their nvc++ compiler? but not sure if that is in cudatoolkit or something else.

pca006132 · 2022-05-09T17:01:24Z

Another interesting behavior I just found: Compiling with -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_OMP makes the code 2~3x SLOWER than the sequential CPP backend, really interesting...

elalish · 2022-05-09T17:09:07Z

Yeah, I quit using it because it seemed like OMP was not getting much love from Thrust. Maybe file an issue on their repo and see if they have any insights?

elalish · 2022-06-04T05:15:14Z

Seems like you've addressed most of this and more, and the lazy boolean has its own issue. Should we call this fixed?

pca006132 · 2022-06-04T05:19:06Z

Yes I think we can call this fixed.

pca006132 mentioned this issue May 9, 2022

Optimizations #106

Merged

elalish closed this as completed Jun 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tracking #105

Performance tracking #105

pca006132 commented May 7, 2022

elalish commented May 9, 2022

pca006132 commented May 9, 2022

pca006132 commented May 9, 2022

elalish commented May 9, 2022

elalish commented Jun 4, 2022

pca006132 commented Jun 4, 2022

Performance tracking #105

Performance tracking #105

Comments

pca006132 commented May 7, 2022

elalish commented May 9, 2022

pca006132 commented May 9, 2022

pca006132 commented May 9, 2022

elalish commented May 9, 2022

elalish commented Jun 4, 2022

pca006132 commented Jun 4, 2022