Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize collision detection #235

Merged
merged 1 commit into from
Oct 10, 2022
Merged

optimize collision detection #235

merged 1 commit into from
Oct 10, 2022

Conversation

pca006132
Copy link
Collaborator

Reduces the number of atomic instructions needed and only perform allocation once. This improves the performance for complicated models by ~10%. The idea is to perform the collision detection in 2 phases: the first one for determining the number of collisions for each query, the second one actually store the collision information. Exclusive scan is used for computing the total count and index offset after the first phase, so we can get unique indices without atomic.

(just thought of this when I am lying on bed, it turns out it really provides some performance improvement)

@pca006132 pca006132 requested a review from elalish October 9, 2022 18:50
@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2022

Codecov Report

Base: 92.12% // Head: 92.13% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (0490c36) compared to base (bc83f65).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #235      +/-   ##
==========================================
+ Coverage   92.12%   92.13%   +0.01%     
==========================================
  Files          32       32              
  Lines        3429     3421       -8     
==========================================
- Hits         3159     3152       -7     
+ Misses        270      269       -1     
Impacted Files Coverage Δ
src/collider/src/collider.cpp 95.68% <100.00%> (-0.22%) ⬇️
src/utilities/include/vec_dh.h 82.50% <0.00%> (+0.40%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Owner

@elalish elalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting, thanks!

@@ -157,26 +157,28 @@ struct CreateRadixTree {
}
};

template <typename T>
template <typename T, const bool preallocate>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe allocateOnly?

int traverse2 = RecordCollision(child2, query);
if (traverse2 < 0) return;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this no longer useful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because RecordCollision can no longer do early exit due to insufficient memory, it will calculate the number of elements and allocate exactly that

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right; I'd forgotten about that. Thanks!

// actually recording collisions
for_each_n(
policy, zip(querriesIn.cbegin(), countAt(0)), querriesIn.size(),
FindCollisions<T, false>({querryTri.ptrDpq(), counts.ptrD(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little surprising to me that running this function twice is faster than running it once simply because of removing the atomics. Would you mind doing a touch more perf comparison? I'd like to know how it compares for CPU and GPU, for meshes with lots of intersections (menger sponge is good) and few (our perf spheres are good).

int maxOverlaps = querriesIn.size() * 4;
SparseIndices querryTri(maxOverlaps);
int nOverlaps = 0;
while (1) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly nice to remove this; I use similar logic in LevelSet; I'm curious if the same applies, or if it becomes a perf problem when the SDF function is computationally heavy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry forgot to address this. I think it should be possible, but will need some tests to see if there is significant performance improvement.

@pca006132
Copy link
Collaborator Author

Regarding performance: I measured the last iteration kernel launch for the older version and each kernel launch for the newer version.

diff --git a/src/collider/src/collider.cpp b/src/collider/src/collider.cpp
index c13f33d..9f0b816 100644
--- a/src/collider/src/collider.cpp
+++ b/src/collider/src/collider.cpp
@@ -21,6 +21,8 @@
 #include <intrin.h>
 #endif
 
+#include <chrono>
+
 // Adjustable parameters
 constexpr int kInitialLength = 128;
 constexpr int kLengthMultiple = 4;
@@ -277,18 +279,45 @@ SparseIndices Collider::Collisions(const VecDH<T>& querriesIn) const {
   auto policy = autoPolicy(querriesIn.size());
   // compute the number of collisions to determine the size for allocation and
   // offset, this avoids the need for atomic
+  auto start = std::chrono::high_resolution_clock::now();
   for_each_n(policy, zip(querriesIn.cbegin(), countAt(0)), querriesIn.size(),
              FindCollisions<T, true>(
                  {thrust::pair<int*, int*>(nullptr, nullptr), counts.ptrD(),
                   nodeBBox_.ptrD(), internalChildren_.ptrD()}));
+  auto allocateEnd = std::chrono::high_resolution_clock::now();
   // compute start index for each query and total count
-  exclusive_scan(policy, counts.begin(), counts.end(), counts.begin());
+  exclusive_scan(ExecutionPolicy::Par, counts.begin(), counts.end(),
+                 counts.begin());
   SparseIndices querryTri(counts.back());
+  auto scanEnd = std::chrono::high_resolution_clock::now();
   // actually recording collisions
   for_each_n(
       policy, zip(querriesIn.cbegin(), countAt(0)), querriesIn.size(),
       FindCollisions<T, false>({querryTri.ptrDpq(), counts.ptrD(),
                                 nodeBBox_.ptrD(), internalChildren_.ptrD()}));
+  auto opEnd = std::chrono::high_resolution_clock::now();
+
+  std::cout << "First pass:"
+            << std::chrono::duration_cast<std::chrono::microseconds>(
+                   allocateEnd - start)
+                   .count()
+            << "us" << std::endl;
+  std::cout << "Exclusive scan: "
+            << std::chrono::duration_cast<std::chrono::microseconds>(
+                   scanEnd - allocateEnd)
+                   .count()
+            << "us" << std::endl;
+  std::cout << "Second pass: "
+            << std::chrono::duration_cast<std::chrono::microseconds>(opEnd -
+                                                                     scanEnd)
+                   .count()
+            << "us" << std::endl;
+  std::cout << "Total: "
+            << std::chrono::duration_cast<std::chrono::microseconds>(opEnd -
+                                                                     start)
+                   .count()
+            << "us" << std::endl
+            << std::endl;
   return querryTri;
 }

and

diff --git a/src/collider/src/collider.cpp b/src/collider/src/collider.cpp
index 9fd62f7..d9b1399 100644
--- a/src/collider/src/collider.cpp
+++ b/src/collider/src/collider.cpp
@@ -21,6 +21,8 @@
 #include <intrin.h>
 #endif
 
+#include <chrono>
+
 // Adjustable parameters
 constexpr int kInitialLength = 128;
 constexpr int kLengthMultiple = 4;
@@ -271,15 +273,20 @@ SparseIndices Collider::Collisions(const VecDH<T>& querriesIn) const {
   int maxOverlaps = querriesIn.size() * 4;
   SparseIndices querryTri(maxOverlaps);
   int nOverlaps = 0;
+
+  long us = 0;
   while (1) {
     // scalar number of overlaps found
     VecDH<int> nOverlapsD(1, 0);
     // calculate Bounding Box overlaps
+    auto start = std::chrono::high_resolution_clock::now();
     for_each_n(
         autoPolicy(querriesIn.size()), zip(querriesIn.cbegin(), countAt(0)),
         querriesIn.size(),
         FindCollisions<T>({querryTri.ptrDpq(), nOverlapsD.ptrD(), maxOverlaps,
                            nodeBBox_.ptrD(), internalChildren_.ptrD()}));
+    auto end = std::chrono::high_resolution_clock::now();
+    us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
     nOverlaps = nOverlapsD[0];
     if (nOverlaps <= maxOverlaps)
       break;
@@ -293,6 +300,7 @@ SparseIndices Collider::Collisions(const VecDH<T>& querriesIn) const {
       querryTri.Resize(maxOverlaps);
     }
   }
+  std::cout << "Last pass: " << us << "us" << std::endl;
   // remove unused part of array
   querryTri.Resize(nOverlaps);
   return querryTri;

Measurements:

Measurement for Samples.Sponge4 (GPU enabled, old version first, with intermediate timing removed for better comparison):

Last pass: 14956us
Last pass: 14989us
Last pass: 28us
Last pass: 25us
Last pass: 148704us
Last pass: 12597us
Last pass: 131944us
Last pass: 616us
Last pass: 2775us
Last pass: 524us
Last pass: 27us
Last pass: 133us
Last pass: 745us
Last pass: 7198us
Last pass: 102us
Last pass: 0us
[       OK ] Samples.Sponge4 (5358 ms)
-----------------------------------------------
Total: 2595us
Total: 2438us
Total: 41us
Total: 26us
Total: 40638us
Total: 12752us
Total: 35032us
Total: 3438us
Total: 9165us
Total: 1686us
Total: 23us
Total: 1155us
Total: 2263us
Total: 13014us
Total: 1151us
Total: 167us
[       OK ] Samples.Sponge4 (4929 ms)

Samples.Sponge4 with GPU disabled:

Last pass: 15515us
Last pass: 13730us
Last pass: 14us
Last pass: 31us
Last pass: 136345us
Last pass: 149043us
Last pass: 125201us
Last pass: 14583us
Last pass: 2537us
Last pass: 4690us
Last pass: 74us
Last pass: 15320us
Last pass: 34985us
Last pass: 4748us
Last pass: 14861us
Last pass: 0us
[       OK ] Samples.Sponge4 (6689 ms)
-----------------------------------------------
Total: 2328us
Total: 2254us
Total: 21us
Total: 24us
Total: 37263us
Total: 26838us
Total: 36122us
Total: 4203us
Total: 4684us
Total: 3844us
Total: 60us
Total: 1859us
Total: 4060us
Total: 8444us
Total: 1206us
Total: 1us
[       OK ] Samples.Sponge4 (5815 ms)

perfTest with GPU enabled:

Last pass: 46us
Last pass: 45us
Last pass: 31us
Last pass: 27us
nTri = 512, time = 0.00192921 sec
Last pass: 163us
Last pass: 155us
Last pass: 171us
Last pass: 164us
nTri = 2048, time = 0.00514149 sec
Last pass: 63us
Last pass: 62us
Last pass: 235us
Last pass: 233us
nTri = 8192, time = 0.0115131 sec
Last pass: 569us
Last pass: 391us
Last pass: 1296us
Last pass: 1189us
nTri = 32768, time = 0.0382821 sec
Last pass: 487us
Last pass: 469us
Last pass: 1362us
Last pass: 1263us
nTri = 131072, time = 0.0665376 sec
Last pass: 670us
Last pass: 592us
Last pass: 3167us
Last pass: 3491us
nTri = 524288, time = 0.160342 sec
Last pass: 882us
Last pass: 1079us
Last pass: 12695us
Last pass: 13856us
nTri = 2097152, time = 0.482283 sec
-----------------------------------------------
Total: 71us
Total: 71us
Total: 57us
Total: 52us
nTri = 512, time = 0.00196068 sec
Total: 214us
Total: 217us
Total: 276us
Total: 277us
nTri = 2048, time = 0.00514988 sec
Total: 86us
Total: 82us
Total: 147us
Total: 221us
nTri = 8192, time = 0.0108598 sec
Total: 195us
Total: 203us
Total: 754us
Total: 619us
nTri = 32768, time = 0.0357975 sec
Total: 789us
Total: 709us
Total: 1399us
Total: 1621us
nTri = 131072, time = 0.0606809 sec
Total: 1772us
Total: 1771us
Total: 5726us
Total: 5415us
nTri = 524288, time = 0.161661 sec
Total: 5931us
Total: 6019us
Total: 20939us
Total: 21027us
nTri = 2097152, time = 0.506978 sec

perfTest with GPU disabled:

Last pass: 47us
Last pass: 46us
Last pass: 32us
Last pass: 28us
nTri = 512, time = 0.00121911 sec
Last pass: 179us
Last pass: 196us
Last pass: 175us
Last pass: 178us
nTri = 2048, time = 0.00480173 sec
Last pass: 55us
Last pass: 60us
Last pass: 265us
Last pass: 354us
nTri = 8192, time = 0.0105302 sec
Last pass: 145us
Last pass: 159us
Last pass: 977us
Last pass: 1206us
nTri = 32768, time = 0.0321581 sec
Last pass: 433us
Last pass: 470us
Last pass: 4077us
Last pass: 3797us
nTri = 131072, time = 0.0570639 sec
Last pass: 1251us
Last pass: 1346us
Last pass: 15618us
Last pass: 15108us
nTri = 524288, time = 0.174977 sec
Last pass: 4647us
Last pass: 4740us
Last pass: 62250us
Last pass: 61359us
nTri = 2097152, time = 0.743967 sec
Last pass: 18869us
Last pass: 19588us
Last pass: 260384us
Last pass: 260623us
nTri = 8388608, time = 3.17449 sec
-----------------------------------------------
Total: 67us
Total: 70us
Total: 58us
Total: 50us
nTri = 512, time = 0.00125373 sec
Total: 219us
Total: 213us
Total: 290us
Total: 289us
nTri = 2048, time = 0.00445196 sec
Total: 78us
Total: 78us
Total: 141us
Total: 313us
nTri = 8192, time = 0.0100573 sec
Total: 188us
Total: 195us
Total: 644us
Total: 741us
nTri = 32768, time = 0.0299541 sec
Total: 537us
Total: 584us
Total: 2096us
Total: 2159us
nTri = 131072, time = 0.05087 sec
Total: 1916us
Total: 2071us
Total: 8933us
Total: 9114us
nTri = 524288, time = 0.162212 sec
Total: 7092us
Total: 7066us
Total: 40636us
Total: 41369us
nTri = 2097152, time = 0.711159 sec
Total: 30125us
Total: 30681us
Total: 190289us
Total: 195106us
nTri = 8388608, time = 3.10329 sec

TLDR: The newer version is a bit slower for perfTest with GPU enabled, but is consistently faster than the old version for other cases.

Reduces the number of atomic instructions needed and only perform
allocation once. This improves the performance for complicated models by
~10%.
Copy link
Owner

@elalish elalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the perf comparison; looks great!

int traverse2 = RecordCollision(child2, query);
if (traverse2 < 0) return;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right; I'd forgotten about that. Thanks!

@elalish elalish merged commit d15b872 into elalish:master Oct 10, 2022
@pca006132 pca006132 deleted the cuda-fix2 branch August 15, 2023 12:54
cartesian-theatrics pushed a commit to SovereignShop/manifold that referenced this pull request Mar 11, 2024
collider: improve performance

Reduces the number of atomic instructions needed and only perform
allocation once. This improves the performance for complicated models by
~10%.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants