-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Hash Index #2615
Parallel Hash Index #2615
Conversation
35d746a
to
07e5966
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #2615 +/- ##
==========================================
- Coverage 93.40% 91.31% -2.10%
==========================================
Files 1041 1046 +5
Lines 39002 39181 +179
==========================================
- Hits 36431 35779 -652
- Misses 2571 3402 +831 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks! It will be good to keep a summary of our benchmark result here as a record, which can be revisited easily. Besides, can you also open an issue on the index file size bloating problems when the table is small?
auto guard = inMemOverflowFile ? | ||
std::make_optional<MutexGuard<InMemFile>>(inMemOverflowFile->lock()) : | ||
std::nullopt; | ||
auto memFile = guard ? guard->get() : nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open an issue to optimize locks on memFile for string keys insertion? We can discuss how to do this. One way is to let each thread grab a page at a time, so reducing the total num of lock acquisitions.
@@ -44,6 +44,7 @@ struct KUZU_API SystemConfig { | |||
*/ | |||
class Database { | |||
friend class EmbeddedShell; | |||
friend class ClientContext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this change here for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to make ClientContext get the correct number of threads (i.e. the number from the system config).
Forgot to bump the storage version.. |
0b568be
to
327adbf
Compare
SplitMix64 is an excellent integer hashing function. According to [this blog][1], it is the main function to beat in terms of hashing. It is simple and provides much better output than our previous ones. In particular, this function does a good job of shuffling the higher bits of the output, a property critical for the new hash index design.
The design is quite simple: every hash index is now represented internally as 256 hash indexes. This way, when copying, we can easily operator on multiple indexes at once without locking.
327adbf
to
90c2afc
Compare
This also moves index building to its own file. Future work may move it to its own standalone operator. These changes break RDF tests, so they have been disabled. They cause higher memory usage, so LDBC and LSQB buffer pool sizes have been adjusted. They vastly increase the performance - ingesting 100 million integers from a parquet file with 64 threads takes about 90 seconds on master, but about 5 seconds with this change.
90c2afc
to
da0e70f
Compare
Dirty Benchmark Results on loading of 100 million integers from a parquet file: master, 64 threads: 90s. LDBC 10GB: LDBC 10GB Parquet: Notably, the remaining bottleneck in our copy pipeline is counting the rows in the CSV file. |
Depends on #2612, #2613, and #2614. Disables 3 RDF tests. Commit log is below.