Parallel Hash Index #2615

Riolku · 2023-12-23T03:34:12Z

Depends on #2612, #2613, and #2614. Disables 3 RDF tests. Commit log is below.

processor: use queue-based index building

This also moves index building to its own file. Future work may move it
to its own standalone operator.

These changes break RDF tests, so they have been disabled. They cause
higher memory usage, so LDBC and LSQB buffer pool sizes have been
adjusted. They vastly increase the performance - ingesting 100 million
integers from a parquet file with 64 threads takes about 90 seconds on
master, but about 5 seconds with this change.

storage: use parallel hash index

The design is quite simple: every hash index is now represented
internally as 256 hash indexes. This way, when copying, we can easily
operator on multiple indexes at once without locking.

function: use splitmix64 for hashing

SplitMix64 is an excellent integer hashing function. According to [this
blog][1], it is the main function to beat in terms of hashing. It is
simple and provides much better output than our previous ones.

In particular, this function does a good job of shuffling the higher
bits of the output, a property critical for the new hash index design.

codecov · 2023-12-23T03:45:37Z

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (508dc50) 93.40% compared to head (da0e70f) 91.31%.

Files	Patch %	Lines
...rc/processor/operator/persistent/index_builder.cpp	89.88%	9 Missing ⚠️
src/include/storage/index/hash_index_builder.h	57.89%	8 Missing ⚠️
src/common/file_system/local_file_system.cpp	0.00%	1 Missing ⚠️
src/processor/operator/persistent/copy_node.cpp	96.87%	1 Missing ⚠️
src/storage/index/hash_index_builder.cpp	97.56%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2615      +/-   ##
==========================================
- Coverage   93.40%   91.31%   -2.10%     
==========================================
  Files        1041     1046       +5     
  Lines       39002    39181     +179     
==========================================
- Hits        36431    35779     -652     
- Misses       2571     3402     +831

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ray6080

LGTM! Thanks! It will be good to keep a summary of our benchmark result here as a record, which can be revisited easily. Besides, can you also open an issue on the index file size bloating problems when the table is small?

ray6080 · 2023-12-24T06:16:35Z

src/storage/index/hash_index_builder.cpp

+    auto guard = inMemOverflowFile ?
+                     std::make_optional<MutexGuard<InMemFile>>(inMemOverflowFile->lock()) :
+                     std::nullopt;
+    auto memFile = guard ? guard->get() : nullptr;


Can you open an issue to optimize locks on memFile for string keys insertion? We can discuss how to do this. One way is to let each thread grab a page at a time, so reducing the total num of lock acquisitions.

ray6080 · 2023-12-24T06:20:14Z

src/include/main/database.h

@@ -44,6 +44,7 @@ struct KUZU_API SystemConfig {
 */
 class Database {
    friend class EmbeddedShell;
+    friend class ClientContext;


What's this change here for?

This is to make ClientContext get the correct number of threads (i.e. the number from the system config).

src/include/processor/operator/persistent/index_builder.h

src/include/storage/index/hash_index_builder.h

src/processor/operator/persistent/index_builder.cpp

Riolku · 2023-12-29T01:59:34Z

Forgot to bump the storage version..

SplitMix64 is an excellent integer hashing function. According to [this blog][1], it is the main function to beat in terms of hashing. It is simple and provides much better output than our previous ones. In particular, this function does a good job of shuffling the higher bits of the output, a property critical for the new hash index design.

The design is quite simple: every hash index is now represented internally as 256 hash indexes. This way, when copying, we can easily operator on multiple indexes at once without locking.

This also moves index building to its own file. Future work may move it to its own standalone operator. These changes break RDF tests, so they have been disabled. They cause higher memory usage, so LDBC and LSQB buffer pool sizes have been adjusted. They vastly increase the performance - ingesting 100 million integers from a parquet file with 64 threads takes about 90 seconds on master, but about 5 seconds with this change.

Riolku · 2024-01-03T07:27:16Z

Dirty Benchmark Results on loading of 100 million integers from a parquet file:

master, 64 threads: 90s.
branch, 1 thread: 30s (cache locality!)
branch, 2 threads: 18s
branch, 4 threads: 14s
branch: 8 threads: 10s
branch, 64 threads: 8s

LDBC 10GB:
master, 64 threads: 15s
branch, 64 threads: 10s

LDBC 10GB Parquet:
master, 64 threads: 14s
branch, 64 threads: 7.2s

Notably, the remaining bottleneck in our copy pipeline is counting the rows in the CSV file.

Riolku force-pushed the multiple-hash-index-no-distribution branch from 35d746a to 07e5966 Compare December 23, 2023 03:35

Riolku changed the title ~~Multiple hash index no distribution~~ Parallel Hash Index Dec 23, 2023

ray6080 approved these changes Dec 24, 2023

View reviewed changes

Riolku force-pushed the multiple-hash-index-no-distribution branch 4 times, most recently from 0b568be to 327adbf Compare December 29, 2023 05:08

Riolku added 2 commits January 3, 2024 01:09

storage: use parallel hash index

d9746ca

The design is quite simple: every hash index is now represented internally as 256 hash indexes. This way, when copying, we can easily operator on multiple indexes at once without locking.

Riolku force-pushed the multiple-hash-index-no-distribution branch from 327adbf to 90c2afc Compare January 3, 2024 06:09

Riolku force-pushed the multiple-hash-index-no-distribution branch from 90c2afc to da0e70f Compare January 3, 2024 06:15

This was referenced Jan 3, 2024

Parallel Hash Index Database Bloat #2625

Open

Per-index Memfile Locking Contention #2626

Closed

Riolku merged commit 226b441 into master Jan 3, 2024
14 checks passed

Riolku deleted the multiple-hash-index-no-distribution branch January 3, 2024 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Hash Index #2615

Parallel Hash Index #2615

Riolku commented Dec 23, 2023

codecov bot commented Dec 23, 2023 •

edited

Loading

ray6080 left a comment •

edited

Loading

ray6080 Dec 24, 2023

ray6080 Dec 24, 2023

Riolku Dec 24, 2023

Riolku commented Dec 29, 2023

Riolku commented Jan 3, 2024

Parallel Hash Index #2615

Parallel Hash Index #2615

Conversation

Riolku commented Dec 23, 2023

codecov bot commented Dec 23, 2023 • edited Loading

Codecov Report

ray6080 left a comment • edited Loading

Choose a reason for hiding this comment

ray6080 Dec 24, 2023

Choose a reason for hiding this comment

ray6080 Dec 24, 2023

Choose a reason for hiding this comment

Riolku Dec 24, 2023

Choose a reason for hiding this comment

Riolku commented Dec 29, 2023

Riolku commented Jan 3, 2024

codecov bot commented Dec 23, 2023 •

edited

Loading

ray6080 left a comment •

edited

Loading