universal vector #121

pca006132 · 2022-05-15T12:31:32Z

Use universal vector to implement VecDH. Closes #120. Gives some performance improvement, reduced memory overhead and allows GPU to do demand-paging, i.e. will not OOM when the GPU memory is insufficient.

Note:

thrust::universal_vector<T>::push_back is very slow, already filed an issue to the upstream. I workaround this problem by implementing a cache that is only used when we push elements to a vector or reserve memory.
This requires a sufficiently recent CUDA toolkit (11.4) for the universal_vector header. I haven't yet tried compiling using the thrust in submodule with an older version of CUDA, not sure if that is gonna work.
We now have to manually annotate the thrust functions for executor. This might be a feature because we might be able to choose the backend depending on the workload?

Benchmark (note that my computer only has 8GB of RAM, and the nTri=8388608 was using swap and very slow for previous version, so the improvement is not actually that high):

====== CPP, Host Device Vector =======

nTri = 512, time = 0.00166264 sec
nTri = 2048, time = 0.00448657 sec
nTri = 8192, time = 0.0138731 sec
nTri = 32768, time = 0.0546407 sec
nTri = 131072, time = 0.219077 sec
nTri = 524288, time = 0.898074 sec
nTri = 2097152, time = 3.75272 sec
nTri = 8388608, time = 29.3012 sec
	Command being timed: "./perfTest"
	User time (seconds): 25.34
	System time (seconds): 7.71
	Percent of CPU this job got: 79%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:41.75
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6927824
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 7640
	Minor (reclaiming a frame) page faults: 6355521
	Voluntary context switches: 8696
	Involuntary context switches: 2278
	Swaps: 0
	File system inputs: 464256
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

====== CPP, Unified Memory =======

nTri = 512, time = 0.00234139 sec
nTri = 2048, time = 0.00402178 sec
nTri = 8192, time = 0.0129426 sec
nTri = 32768, time = 0.0496874 sec
nTri = 131072, time = 0.201387 sec
nTri = 524288, time = 0.824124 sec
nTri = 2097152, time = 3.29724 sec
nTri = 8388608, time = 14.9187 sec
	Command being timed: "./perfTest"
	User time (seconds): 22.84
	System time (seconds): 3.47
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.31
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 5378220
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 2
	Minor (reclaiming a frame) page faults: 4284852
	Voluntary context switches: 2
	Involuntary context switches: 79
	Swaps: 0
	File system inputs: 136
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0



====== OMP, Host Device Vector =======
nTri = 512, time = 0.00135095 sec
nTri = 2048, time = 0.00298485 sec
nTri = 8192, time = 0.00762013 sec
nTri = 32768, time = 0.0314378 sec
nTri = 131072, time = 0.134311 sec
nTri = 524288, time = 0.606134 sec
nTri = 2097152, time = 2.51857 sec
nTri = 8388608, time = 23.9099 sec
	Command being timed: "./perfTest"
	User time (seconds): 141.85
	System time (seconds): 16.85
	Percent of CPU this job got: 483%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.84
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 7058568
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 18459
	Minor (reclaiming a frame) page faults: 7395832
	Voluntary context switches: 20115
	Involuntary context switches: 10690
	Swaps: 0
	File system inputs: 966904
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

====== OMP, Unified Memory =======
nTri = 512, time = 0.00122855 sec
nTri = 2048, time = 0.00253237 sec
nTri = 8192, time = 0.00616688 sec
nTri = 32768, time = 0.0220143 sec
nTri = 131072, time = 0.0980233 sec
nTri = 524288, time = 0.463016 sec
nTri = 2097152, time = 1.98967 sec
nTri = 8388608, time = 8.25589 sec
	Command being timed: "./perfTest"
	User time (seconds): 106.40
	System time (seconds): 11.07
	Percent of CPU this job got: 765%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.34
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 5657400
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 15
	Minor (reclaiming a frame) page faults: 5149280
	Voluntary context switches: 232
	Involuntary context switches: 1642
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0



====== CUDA, Host Device Vector =======
nTri = 512, time = 0.00396219 sec
nTri = 2048, time = 0.00530954 sec
nTri = 8192, time = 0.00963976 sec
nTri = 32768, time = 0.0254835 sec
nTri = 131072, time = 0.0854017 sec
nTri = 524288, time = 0.319703 sec
nTri = 2097152, time = 1.2162 sec
(OOMed)
	Command being timed: "./perfTest"
	User time (seconds): 2.95
	System time (seconds): 0.87
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.83
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1379072
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 854796
	Voluntary context switches: 82
	Involuntary context switches: 9
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

====== CUDA, Unified Memory =======
nTri = 512, time = 0.0123526 sec
nTri = 2048, time = 0.0141026 sec
nTri = 8192, time = 0.0195818 sec
nTri = 32768, time = 0.0374303 sec
nTri = 131072, time = 0.120768 sec
nTri = 524288, time = 0.266435 sec
nTri = 2097152, time = 0.924646 sec
nTri = 8388608, time = 4.01982 sec
	Command being timed: "./perfTest"
	User time (seconds): 5.78
	System time (seconds): 1.68
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.48
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1744420
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 12576
	Minor (reclaiming a frame) page faults: 216926
	Voluntary context switches: 133
	Involuntary context switches: 49
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

elalish

Excellent job working around thrust's bug. The improvements here in performance, memory, and code quality are huge! Let's clean it up a little and merge it.

elalish · 2022-05-15T17:22:06Z

manifold/src/boolean_result.cpp

-  VecDH<bool> wholeHalfedgeP(inP_.halfedge_.size(), true);
-  VecDH<bool> wholeHalfedgeQ(inQ_.halfedge_.size(), true);
+  VecDH<char> wholeHalfedgeP(inP_.halfedge_.size(), true);
+  VecDH<char> wholeHalfedgeQ(inQ_.halfedge_.size(), true);


I suppose because CUDA doesn't support addressing single bits? Works for me.

No, this is due to std::vector<T> is being used for the cache, and you know the problem with std::vector<bool>

manifold/src/boolean3.cpp

manifold/src/edge_op.cpp

manifold/src/face_op.cpp

utilities/include/structs.h

elalish · 2022-05-15T17:34:22Z

utilities/include/vec_dh.h

+ * performance.
+ * Note that it is *NOT SAFE* to first obtain a host(device) pointer, perform
+ * some device(host) modification, and then read the host(device) pointer again
+ * (on the same vector). The memory will be inconsistent in that case.


utilities/include/vec_dh.h

pca006132 · 2022-05-15T18:00:35Z

And I'm thinking about if it is good to add a simple wrapper over some thrust functions like for_each, and choose whether to run it on the host or on the device. As we use unified memory now, it is safe to run them on the host if they can be run on the device.
IIRC thrust can also specify which backend to use, but I have nof yet tried that. I guess it would be useful if we can implement dynamic backend by this.

elalish · 2022-05-15T21:13:57Z

A wrapper for a dynamic backend could be cool. How would you choose between device and host? Length of vector? Let's save that for a follow-on PR.

pca006132 · 2022-05-16T05:46:14Z

For dynamic backend, the idea is just to use an enum for the backend tag, and fallback to supported backend if the one specified is not available. It seems that thrust uses custom types for each execution policy, so we can't do this with their API, but have to write some templates to generate code for different backends and select the appropriate one.

pca006132 · 2022-05-16T13:10:58Z

One issue with the CUDA backend: allocation of VecDH is now slower than before, ~~not sure why~~.

It seems that it is generating a lot of page faults that requires IO. Investigating...

elalish · 2022-05-17T06:03:36Z

utilities/include/vec_dh.h

+#include <iostream>
+#include <thrust/universal_vector.h>
+#include <thrust/execution_policy.h>
+#include "structs.h"


do we need structs or iostream in here?

elalish · 2022-05-17T06:04:33Z

utilities/include/sparse.h

@@ -18,6 +18,7 @@
 #include <thrust/remove.h>
 #include <thrust/sort.h>
 #include <thrust/unique.h>
+#include <thrust/execution_policy.h>


nit: can remove this too

elalish

Looking great, thank you!

universal vector

universal vector

c316836

elalish reviewed May 15, 2022

View reviewed changes

vec_dh cleanup

dbbcdbb

pca006132 force-pushed the universal-vector branch from e0be1d5 to dbbcdbb Compare May 16, 2022 06:34

Merge branch 'master' into universal-vector

2a86198

fixed removed cache

d229f5f

elalish reviewed May 17, 2022

View reviewed changes

elalish approved these changes May 17, 2022

View reviewed changes

elalish merged commit bfc8e5c into elalish:master May 17, 2022

pca006132 deleted the universal-vector branch December 22, 2022 05:35

cartesian-theatrics pushed a commit to SovereignShop/manifold that referenced this pull request Mar 11, 2024

Merge pull request elalish#121 from pca006132/universal-vector

ff6fd1d

universal vector

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

universal vector #121

universal vector #121

pca006132 commented May 15, 2022

elalish left a comment

elalish May 15, 2022

pca006132 May 15, 2022 •

edited

Loading

elalish May 15, 2022

pca006132 commented May 15, 2022

elalish commented May 15, 2022

pca006132 commented May 16, 2022

pca006132 commented May 16, 2022 •

edited

Loading

elalish May 17, 2022

elalish May 17, 2022

elalish left a comment

universal vector #121

universal vector #121

Conversation

pca006132 commented May 15, 2022

elalish left a comment

Choose a reason for hiding this comment

elalish May 15, 2022

Choose a reason for hiding this comment

pca006132 May 15, 2022 • edited Loading

Choose a reason for hiding this comment

elalish May 15, 2022

Choose a reason for hiding this comment

pca006132 commented May 15, 2022

elalish commented May 15, 2022

pca006132 commented May 16, 2022

pca006132 commented May 16, 2022 • edited Loading

elalish May 17, 2022

Choose a reason for hiding this comment

elalish May 17, 2022

Choose a reason for hiding this comment

elalish left a comment

Choose a reason for hiding this comment

pca006132 May 15, 2022 •

edited

Loading

pca006132 commented May 16, 2022 •

edited

Loading