Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tracking #105

Closed
pca006132 opened this issue May 7, 2022 · 6 comments
Closed

Performance tracking #105

pca006132 opened this issue May 7, 2022 · 6 comments

Comments

@pca006132
Copy link
Collaborator

As this library is intended to be fast, I guess we should probably open an issue to track its performance?

I did some microbenchmark using the perfTest binary and a unionPerfTest code that I wrote to test lazy union (union 100 spheres with diameter=2.5 with varying separation distance, i.e. may or may not overlap). I've tested single thread using CPP backend, OpenMP backend and CUDA backend all on the same laptop with i5-8300H and GTX1050 Mobile. Here is my spreadsheet, and here is my branch for some optimizations that looks quite effective for CUDA and small meshes :). I will open a separate PR for the branch, after the build script PR is merged.

From the results, we can see that:

  1. For small meshes, OMP > CPP > CUDA. Wondering if we can run the thrust APIs on OMP for operations involving only < 1k vertices or something.
  2. For more complicated models, using the backend GPU may exhaust GPU memory. Is it possible to fallback to CPP mode after that?
  3. We have to be careful about copying vector, eliminating unnecessary copying reduces the execution time for CUDA backend significantly (-30% in the microbenchmark). One way would be to disable copy constructor for VecDH and provide an explicit copy method.
  4. It seems that the multithread cannot help much in our case, even though my computer has 4 cores 8 threads. Perhaps we should look into this because quite a lot of users may not have a GPU with CUDA (my main laptop is using AMD, so no CUDA here).
@elalish
Copy link
Owner

elalish commented May 9, 2022

I'm open to these optimizations, but I would recommend avoiding too much code complexity, especially as related to Thrust. I'd prefer to switch from Thrust to C++17 parallel algorithms, which will allow us to remove the VecDH class entirely and let the compiler handle these copies/moves internally. Hopefully the compilers might even take care of switching backends based on data size? I'm also hoping we'll get compilers that can target AMD GPUs from the parallel STL, though I'm not sure how far out that is.

@pca006132
Copy link
Collaborator Author

The main performance improvement comes from these two patches: 168fd07 and a3dbe15 , which are not thrust specific, so this should be fine. Will submit a PR soon.

For the parallel STL, I've tried using them with clang, which uses TBB for parallel execution and the performance seems pretty good (for the case that I've tested). It seems that nvcc also supports parallel stl with their nvc++ compiler? but not sure if that is in cudatoolkit or something else.

@pca006132 pca006132 mentioned this issue May 9, 2022
@pca006132
Copy link
Collaborator Author

Another interesting behavior I just found: Compiling with -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_OMP makes the code 2~3x SLOWER than the sequential CPP backend, really interesting...

@elalish
Copy link
Owner

elalish commented May 9, 2022

Yeah, I quit using it because it seemed like OMP was not getting much love from Thrust. Maybe file an issue on their repo and see if they have any insights?

@elalish
Copy link
Owner

elalish commented Jun 4, 2022

Seems like you've addressed most of this and more, and the lazy boolean has its own issue. Should we call this fixed?

@pca006132
Copy link
Collaborator Author

Yes I think we can call this fixed.

@elalish elalish closed this as completed Jun 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants