Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Hash Index #2615

Merged
merged 3 commits into from
Jan 3, 2024
Merged

Parallel Hash Index #2615

merged 3 commits into from
Jan 3, 2024

Conversation

Riolku
Copy link
Collaborator

@Riolku Riolku commented Dec 23, 2023

Depends on #2612, #2613, and #2614. Disables 3 RDF tests. Commit log is below.

processor: use queue-based index building

This also moves index building to its own file. Future work may move it
to its own standalone operator.

These changes break RDF tests, so they have been disabled. They cause
higher memory usage, so LDBC and LSQB buffer pool sizes have been
adjusted. They vastly increase the performance - ingesting 100 million
integers from a parquet file with 64 threads takes about 90 seconds on
master, but about 5 seconds with this change.

storage: use parallel hash index

The design is quite simple: every hash index is now represented
internally as 256 hash indexes. This way, when copying, we can easily
operator on multiple indexes at once without locking.

function: use splitmix64 for hashing

SplitMix64 is an excellent integer hashing function. According to [this
blog][1], it is the main function to beat in terms of hashing. It is
simple and provides much better output than our previous ones.

In particular, this function does a good job of shuffling the higher
bits of the output, a property critical for the new hash index design.

@Riolku Riolku force-pushed the multiple-hash-index-no-distribution branch from 35d746a to 07e5966 Compare December 23, 2023 03:35
@Riolku Riolku changed the title Multiple hash index no distribution Parallel Hash Index Dec 23, 2023
Copy link

codecov bot commented Dec 23, 2023

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (508dc50) 93.40% compared to head (da0e70f) 91.31%.

Files Patch % Lines
...rc/processor/operator/persistent/index_builder.cpp 89.88% 9 Missing ⚠️
src/include/storage/index/hash_index_builder.h 57.89% 8 Missing ⚠️
src/common/file_system/local_file_system.cpp 0.00% 1 Missing ⚠️
src/processor/operator/persistent/copy_node.cpp 96.87% 1 Missing ⚠️
src/storage/index/hash_index_builder.cpp 97.56% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2615      +/-   ##
==========================================
- Coverage   93.40%   91.31%   -2.10%     
==========================================
  Files        1041     1046       +5     
  Lines       39002    39181     +179     
==========================================
- Hits        36431    35779     -652     
- Misses       2571     3402     +831     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks! It will be good to keep a summary of our benchmark result here as a record, which can be revisited easily. Besides, can you also open an issue on the index file size bloating problems when the table is small?

auto guard = inMemOverflowFile ?
std::make_optional<MutexGuard<InMemFile>>(inMemOverflowFile->lock()) :
std::nullopt;
auto memFile = guard ? guard->get() : nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue to optimize locks on memFile for string keys insertion? We can discuss how to do this. One way is to let each thread grab a page at a time, so reducing the total num of lock acquisitions.

@@ -44,6 +44,7 @@ struct KUZU_API SystemConfig {
*/
class Database {
friend class EmbeddedShell;
friend class ClientContext;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this change here for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make ClientContext get the correct number of threads (i.e. the number from the system config).

src/include/storage/index/hash_index_builder.h Outdated Show resolved Hide resolved
src/processor/operator/persistent/index_builder.cpp Outdated Show resolved Hide resolved
src/processor/operator/persistent/index_builder.cpp Outdated Show resolved Hide resolved
@Riolku
Copy link
Collaborator Author

Riolku commented Dec 29, 2023

Forgot to bump the storage version..

@Riolku Riolku force-pushed the multiple-hash-index-no-distribution branch 4 times, most recently from 0b568be to 327adbf Compare December 29, 2023 05:08
SplitMix64 is an excellent integer hashing function. According to [this
blog][1], it is the main function to beat in terms of hashing. It is
simple and provides much better output than our previous ones.

In particular, this function does a good job of shuffling the higher
bits of the output, a property critical for the new hash index design.
The design is quite simple: every hash index is now represented
internally as 256 hash indexes. This way, when copying, we can easily
operator on multiple indexes at once without locking.
@Riolku Riolku force-pushed the multiple-hash-index-no-distribution branch from 327adbf to 90c2afc Compare January 3, 2024 06:09
This also moves index building to its own file. Future work may move it
to its own standalone operator.

These changes break RDF tests, so they have been disabled. They cause
higher memory usage, so LDBC and LSQB buffer pool sizes have been
adjusted. They vastly increase the performance - ingesting 100 million
integers from a parquet file with 64 threads takes about 90 seconds on
master, but about 5 seconds with this change.
@Riolku Riolku force-pushed the multiple-hash-index-no-distribution branch from 90c2afc to da0e70f Compare January 3, 2024 06:15
@Riolku
Copy link
Collaborator Author

Riolku commented Jan 3, 2024

Dirty Benchmark Results on loading of 100 million integers from a parquet file:

master, 64 threads: 90s.
branch, 1 thread: 30s (cache locality!)
branch, 2 threads: 18s
branch, 4 threads: 14s
branch: 8 threads: 10s
branch, 64 threads: 8s

LDBC 10GB:
master, 64 threads: 15s
branch, 64 threads: 10s

LDBC 10GB Parquet:
master, 64 threads: 14s
branch, 64 threads: 7.2s

Notably, the remaining bottleneck in our copy pipeline is counting the rows in the CSV file.

@Riolku Riolku merged commit 226b441 into master Jan 3, 2024
14 checks passed
@Riolku Riolku deleted the multiple-hash-index-no-distribution branch January 3, 2024 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants