Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use spdk mempool per-core cache for io objects pool #1612

Merged
merged 1 commit into from
Mar 26, 2024

Conversation

dsharma-dc
Copy link
Contributor

No description provided.

Signed-off-by: Diwakar Sharma <diwakar.sharma@datacore.com>
@dsharma-dc
Copy link
Contributor Author

With some simple fio run locally on malloc'ed pool - 4k bs, 8 fio jobs, 32 io depth, randwrite workload.
I see improvements in throughput as shown below. Though this shows a stark improvement, it's not always this much, but with a number of runs mostly better.

With per-core cache:
Jobs: 8 (f=8): [w(8)][79.2%][w=834MiB/s][w=213k IOPS][eta 00m:25s]
Jobs: 8 (f=8): [w(8)][89.9%][w=787MiB/s][w=201k IOPS][eta 00m:12s]

Vanilla:
Jobs: 8 (f=8): [w(8)][76.7%][w=576MiB/s][w=148k IOPS][eta 00m:28s]
Jobs: 8 (f=8): [w(8)][89.9%][w=537MiB/s][w=137k IOPS][eta 00m:12s]

@tiagolobocastro
Copy link
Contributor

hmm I see slightly slower casperf performance with single-core null device, example:
Total : 10909417 IO/s 5326: MB/s
Total : 11440796 IO/s 5586: MB/s
Haven't checked multi-core yet, I think I've got a patch somewhere to test that

@dsharma-dc
Copy link
Contributor Author

hmm I see slightly slower casperf performance with single-core null device, example: Total : 10909417 IO/s 5326: MB/s Total : 11440796 IO/s 5586: MB/s Haven't checked multi-core yet, I think I've got a patch somewhere to test that

I wouldn't expect this to show improvements with single core because the caching is un-necessary in case of single core. Since cache holds 512 objects, there will be more cache misses. With multi-core, it'll benefit to avoid threads dipping their hands into common pool and contending.

@tiagolobocastro
Copy link
Contributor

hmm I see slightly slower casperf performance with single-core null device, example: Total : 10909417 IO/s 5326: MB/s Total : 11440796 IO/s 5586: MB/s Haven't checked multi-core yet, I think I've got a patch somewhere to test that

I wouldn't expect this to show improvements with single core because the caching is un-necessary in case of single core. Since cache holds 512 objects, there will be more cache misses. With multi-core, it'll benefit to avoid threads dipping their hands into common pool and contending.

Yes but seems it's decreasing single-core performance, or am I missing something?

@dsharma-dc
Copy link
Contributor Author

dsharma-dc commented Mar 25, 2024

hmm I see slightly slower casperf performance with single-core null device, example: Total : 10909417 IO/s 5326: MB/s Total : 11440796 IO/s 5586: MB/s Haven't checked multi-core yet, I think I've got a patch somewhere to test that

I wouldn't expect this to show improvements with single core because the caching is un-necessary in case of single core. Since cache holds 512 objects, there will be more cache misses. With multi-core, it'll benefit to avoid threads dipping their hands into common pool and contending.

Yes but seems it's decreasing single-core performance, or am I missing something?

hmm, I'm unclear what's our most reliable benchmark. I ran the same test that I did earlier, now on a single core io-engine instance(3 runs) and I see this:

With cache:
WRITE: bw=417MiB/s (438MB/s), IOPS ~107k , 48.6MiB/s-56.0MiB/s (51.0MB/s-58.7MB/s), io=24.5GiB (26.3GB), run=60001-60002msec
WRITE: bw=418MiB/s (438MB/s), IOPS ~108k, 49.1MiB/s-57.6MiB/s (51.5MB/s-60.4MB/s), io=24.5GiB (26.3GB), run=60001-60001msec
WRITE: bw=423MiB/s (443MB/s), IOPS ~108k , 45.7MiB/s-60.1MiB/s (48.0MB/s-63.0MB/s), io=24.8GiB (26.6GB), run=60001-60002msec

Vanilla:
WRITE: bw=404MiB/s (424MB/s), IOPS ~103k, 46.3MiB/s-53.9MiB/s (48.6MB/s-56.6MB/s), io=23.7GiB (25.4GB), run=60001-60002msec
WRITE: bw=410MiB/s (430MB/s), IOPS ~105k, 46.1MiB/s-57.4MiB/s (48.4MB/s-60.2MB/s), io=24.0GiB (25.8GB), run=60001-60002msec
WRITE: bw=411MiB/s (431MB/s), IOPS ~105k, 47.9MiB/s-53.8MiB/s (50.2MB/s-56.5MB/s), io=24.1GiB (25.9GB), run=60001-60001msec

@dsharma-dc
Copy link
Contributor Author

Also , theoretically I would think that read workloads should see more improvements because in general reads have lower path length which would mean cache objects will be returned back quicker and hence lesser chances of dipping into common pool.

@tiagolobocastro
Copy link
Contributor

With fio I seem to get consistent results for multi-core, always ~10k IOPS more with the cache, at a cost of 3x2MiB hugepages (4 core config), so seems the tradeoff is worth it!
For single core with fio seems a little more volatile, sometimes better sometimes worse, so perhaps just noise afterall

@dsharma-dc
Copy link
Contributor Author

bors merge

@bors-openebs-mayastor
Copy link

Build succeeded:

@bors-openebs-mayastor bors-openebs-mayastor bot merged commit bf6450d into openebs:develop Mar 26, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants