Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

universal vector #121

Merged
merged 4 commits into from
May 17, 2022
Merged

universal vector #121

merged 4 commits into from
May 17, 2022

Conversation

pca006132
Copy link
Collaborator

Use universal vector to implement VecDH. Closes #120. Gives some performance improvement, reduced memory overhead and allows GPU to do demand-paging, i.e. will not OOM when the GPU memory is insufficient.

Note:

  1. thrust::universal_vector<T>::push_back is very slow, already filed an issue to the upstream. I workaround this problem by implementing a cache that is only used when we push elements to a vector or reserve memory.
  2. This requires a sufficiently recent CUDA toolkit (11.4) for the universal_vector header. I haven't yet tried compiling using the thrust in submodule with an older version of CUDA, not sure if that is gonna work.
  3. We now have to manually annotate the thrust functions for executor. This might be a feature because we might be able to choose the backend depending on the workload?

Benchmark (note that my computer only has 8GB of RAM, and the nTri=8388608 was using swap and very slow for previous version, so the improvement is not actually that high):

====== CPP, Host Device Vector =======

nTri = 512, time = 0.00166264 sec
nTri = 2048, time = 0.00448657 sec
nTri = 8192, time = 0.0138731 sec
nTri = 32768, time = 0.0546407 sec
nTri = 131072, time = 0.219077 sec
nTri = 524288, time = 0.898074 sec
nTri = 2097152, time = 3.75272 sec
nTri = 8388608, time = 29.3012 sec
	Command being timed: "./perfTest"
	User time (seconds): 25.34
	System time (seconds): 7.71
	Percent of CPU this job got: 79%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:41.75
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6927824
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 7640
	Minor (reclaiming a frame) page faults: 6355521
	Voluntary context switches: 8696
	Involuntary context switches: 2278
	Swaps: 0
	File system inputs: 464256
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

====== CPP, Unified Memory =======

nTri = 512, time = 0.00234139 sec
nTri = 2048, time = 0.00402178 sec
nTri = 8192, time = 0.0129426 sec
nTri = 32768, time = 0.0496874 sec
nTri = 131072, time = 0.201387 sec
nTri = 524288, time = 0.824124 sec
nTri = 2097152, time = 3.29724 sec
nTri = 8388608, time = 14.9187 sec
	Command being timed: "./perfTest"
	User time (seconds): 22.84
	System time (seconds): 3.47
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.31
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 5378220
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 2
	Minor (reclaiming a frame) page faults: 4284852
	Voluntary context switches: 2
	Involuntary context switches: 79
	Swaps: 0
	File system inputs: 136
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0



====== OMP, Host Device Vector =======
nTri = 512, time = 0.00135095 sec
nTri = 2048, time = 0.00298485 sec
nTri = 8192, time = 0.00762013 sec
nTri = 32768, time = 0.0314378 sec
nTri = 131072, time = 0.134311 sec
nTri = 524288, time = 0.606134 sec
nTri = 2097152, time = 2.51857 sec
nTri = 8388608, time = 23.9099 sec
	Command being timed: "./perfTest"
	User time (seconds): 141.85
	System time (seconds): 16.85
	Percent of CPU this job got: 483%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.84
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 7058568
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 18459
	Minor (reclaiming a frame) page faults: 7395832
	Voluntary context switches: 20115
	Involuntary context switches: 10690
	Swaps: 0
	File system inputs: 966904
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

====== OMP, Unified Memory =======
nTri = 512, time = 0.00122855 sec
nTri = 2048, time = 0.00253237 sec
nTri = 8192, time = 0.00616688 sec
nTri = 32768, time = 0.0220143 sec
nTri = 131072, time = 0.0980233 sec
nTri = 524288, time = 0.463016 sec
nTri = 2097152, time = 1.98967 sec
nTri = 8388608, time = 8.25589 sec
	Command being timed: "./perfTest"
	User time (seconds): 106.40
	System time (seconds): 11.07
	Percent of CPU this job got: 765%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.34
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 5657400
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 15
	Minor (reclaiming a frame) page faults: 5149280
	Voluntary context switches: 232
	Involuntary context switches: 1642
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0



====== CUDA, Host Device Vector =======
nTri = 512, time = 0.00396219 sec
nTri = 2048, time = 0.00530954 sec
nTri = 8192, time = 0.00963976 sec
nTri = 32768, time = 0.0254835 sec
nTri = 131072, time = 0.0854017 sec
nTri = 524288, time = 0.319703 sec
nTri = 2097152, time = 1.2162 sec
(OOMed)
	Command being timed: "./perfTest"
	User time (seconds): 2.95
	System time (seconds): 0.87
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.83
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1379072
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 854796
	Voluntary context switches: 82
	Involuntary context switches: 9
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

====== CUDA, Unified Memory =======
nTri = 512, time = 0.0123526 sec
nTri = 2048, time = 0.0141026 sec
nTri = 8192, time = 0.0195818 sec
nTri = 32768, time = 0.0374303 sec
nTri = 131072, time = 0.120768 sec
nTri = 524288, time = 0.266435 sec
nTri = 2097152, time = 0.924646 sec
nTri = 8388608, time = 4.01982 sec
	Command being timed: "./perfTest"
	User time (seconds): 5.78
	System time (seconds): 1.68
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.48
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1744420
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 12576
	Minor (reclaiming a frame) page faults: 216926
	Voluntary context switches: 133
	Involuntary context switches: 49
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Copy link
Owner

@elalish elalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job working around thrust's bug. The improvements here in performance, memory, and code quality are huge! Let's clean it up a little and merge it.

VecDH<bool> wholeHalfedgeP(inP_.halfedge_.size(), true);
VecDH<bool> wholeHalfedgeQ(inQ_.halfedge_.size(), true);
VecDH<char> wholeHalfedgeP(inP_.halfedge_.size(), true);
VecDH<char> wholeHalfedgeQ(inQ_.halfedge_.size(), true);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose because CUDA doesn't support addressing single bits? Works for me.

Copy link
Collaborator Author

@pca006132 pca006132 May 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is due to std::vector<T> is being used for the cache, and you know the problem with std::vector<bool>

manifold/src/boolean3.cpp Outdated Show resolved Hide resolved
manifold/src/edge_op.cpp Show resolved Hide resolved
manifold/src/edge_op.cpp Outdated Show resolved Hide resolved
manifold/src/face_op.cpp Outdated Show resolved Hide resolved
utilities/include/structs.h Outdated Show resolved Hide resolved
* performance.
* Note that it is *NOT SAFE* to first obtain a host(device) pointer, perform
* some device(host) modification, and then read the host(device) pointer again
* (on the same vector). The memory will be inconsistent in that case.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

utilities/include/vec_dh.h Outdated Show resolved Hide resolved
utilities/include/vec_dh.h Show resolved Hide resolved
@pca006132
Copy link
Collaborator Author

And I'm thinking about if it is good to add a simple wrapper over some thrust functions like for_each, and choose whether to run it on the host or on the device. As we use unified memory now, it is safe to run them on the host if they can be run on the device.
IIRC thrust can also specify which backend to use, but I have nof yet tried that. I guess it would be useful if we can implement dynamic backend by this.

@elalish
Copy link
Owner

elalish commented May 15, 2022

A wrapper for a dynamic backend could be cool. How would you choose between device and host? Length of vector? Let's save that for a follow-on PR.

@pca006132
Copy link
Collaborator Author

For dynamic backend, the idea is just to use an enum for the backend tag, and fallback to supported backend if the one specified is not available. It seems that thrust uses custom types for each execution policy, so we can't do this with their API, but have to write some templates to generate code for different backends and select the appropriate one.

@pca006132
Copy link
Collaborator Author

pca006132 commented May 16, 2022

One issue with the CUDA backend: allocation of VecDH is now slower than before, not sure why.

It seems that it is generating a lot of page faults that requires IO. Investigating...

#include <iostream>
#include <thrust/universal_vector.h>
#include <thrust/execution_policy.h>
#include "structs.h"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need structs or iostream in here?

@@ -18,6 +18,7 @@
#include <thrust/remove.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/execution_policy.h>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can remove this too

Copy link
Owner

@elalish elalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, thank you!

@elalish elalish merged commit bfc8e5c into elalish:master May 17, 2022
@pca006132 pca006132 deleted the universal-vector branch December 22, 2022 05:35
cartesian-theatrics pushed a commit to SovereignShop/manifold that referenced this pull request Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

thrust universal vector
2 participants