-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gzp spawns one less threads than CPUs, which hurts performance #11
Comments
Thanks for the detailed issue! I think this comes down to accounting and the fact that Looking at pigz, it oversubscribes threads: https://github.com/madler/pigz/blob/b6da942b9ca15eb9149837f07b2b3b6ff21d9845/pigz.c#L2206, in that it will spawn as many threads as their are cores + a writer thread + the main thread.
Do you have any thoughts on what would make the most sense for What I'm thinking at the moment is that I'll change the documentation / function names so that This branch of |
The The profile linked above shows that the main thread and writer thread together do not occupy an entire core, so one core out of 4 ends up being mostly idle in my configuration. crc32 and writing are very fast, it seems. I believe it's best to match the number of compression threads to the number of CPUs, like pigz does already. |
Ah! Sorry about that, changes have been pushed. |
I don't have that exact setup anymore, but I've tried it on the same machine with a different Linux OS and the results are inconclusive: having more threads seems to help on higher compression ratios but hinder at lower ones. Benches, with the same Shakespeare file repeated 100 times:
pigz 2.4 from Ubuntu 18.04 repos is still faster:
|
Perhaps the difference in performance comes down to the differences in the underlying zlib implementation? Is there a flag for |
Weird. Here's my results, not in thread limited environment, but limiting with flags:
That's Results for
Even giving both The default zlib library for So to get apples-to-apples (ish? does pigz link to system zlib?) zlib, change the
I can't imagine that it matters that much, but what version of rust are you running? |
Right now comparing with pigz 2.4 installed via apt on Ubuntu 18.04 For crabz I use a git checkout and then
Overcommit seems to help my 4-core system, but just barely:
In single-threaded mode
But pigz overtakes crabz when using 4 threads:
Removing overcommit hurts performance slightly in case of
I've tried the Rust backend instead of zlib-ng and saw the exact same performance in both single-threaded and multi-threaded mode. So I guess the actionable takeaways are:
I'll test a dual-core system next and see if 1 or 2 threads works best there. |
I just pushed a new commit to This gave an appreciable performance bump on my system. Building after I agree on point 2 though. I want to re-test things no that work is getting to the compressors faster that the zlib library doesn't make a difference, but if it's narrow enough I'd rather have an all rust backend. Thanks for sharing the profile info, looking at that now. |
On my quad-core Ryzen overcommit is a toss-up. However, preliminary results indicate that having 2 compression threads on a dual-core system increases performance dramatically. I'll post the full dual-core results shortly. Full timings from the quad-core Ryzen with the buffer size changes:
|
Having 2 compression threads instead of 1 seems to be greatly beneficial on a dual-core system. On a dual-core AMD Stoney Ridge system
|
Here's a profile of the latest code on my 4-core machine with 4 threads: https://share.firefox.dev/2WnspHl I've also enabled debug info in release mode to make the profile more detailed. |
Weird. I'm not sure what else to try at the moment to figure out why |
As to why, I see that
This indicates that |
That makes sense.... and is a flaw.
|
I just pushed another set of changes to |
Oh yeah, that did the trick for the quad-core! All 4 compression threads are utilized now, and performance is either on par with pigz (for |
The dual-core system took a noticeable hit, but still beats
Also, why do you use |
Dual-core profiles: |
The "after" profile shows the time spent in Flume in the checksumming thread go up from 1s to 1.8s, so I wonder if crossbeam-deque would perform better under the high contention? Since it's already in the binary because of rayon, it's probably worth a shot. |
flume was a holdover from the initial versions of I did try one more thing though, which stripped out rayon entirely. I think it should bring back that 2 core performance. The cost though is that instead of letting rayon manage a threadpool, this keeps Same branch if you are interested! Also, thanks for bearing with me through this, your feedback has been extremely helpful 👍 |
I'll test it out! If crossbeam-deque and flume provide identical performance, I'd stick with flume because it has dramatically less unsafe code in it. Crossbeam is really quite complex due to the custom lock-free algorithms, and it's all unsafe code, naturally. If that complexity can be avoided, I'm all for it. |
Also, speaking of dependencies, I've run |
On my quad-core
When I see numbers like these I usually assume I messed up correctness and the program actually does less work than it's supposed to. But no, the round-tripped file decompresses to the original data correctly! 🎉 🚀 🥳 |
Dual-core is back to the original numbers for
|
That's awesome! I've moved to flume only. I need to do more rigourous testing to decide on a deafult backend between zlib-ng, zlib, and the rust backend. Regarding Thanks again for working on this! I'll be putting out new releases of both |
Thanks to you for acting on this! |
See |
I've run some tests comparing
crabz
topigz
using the benchmarking setup described in crabz readme. On a 4-core system with no hyperthreadingcrabz
was measurably slower.I've profiled both using
perf
and it turned out thatcrabz
spends the vast majority of the time in zlib compression, so parallelization overhead is not an issue. However, crabz only spawned 3 threads performing compression while pigz spawned 4 compression threads. After passing-p3
topigz
so that it would only spawn 3 compression threads, the compression time became identical tocrabz
.I suspect this is also why you're not seeing any parallelization gains on dual-core systems.
Technical details
crabz
profile: https://share.firefox.dev/3zeVRxNpigz
profile: https://share.firefox.dev/2WeYe4Vcrabz
installed viacargo install crabz
on a clean Ubuntu 20.04 installation,pigz
installed via apt.The text was updated successfully, but these errors were encountered: