Improve efficiency of merging bulk insertions into the hash index #3403

benjaminwinger · 2024-04-29T15:01:29Z

This improves the performance of the bulk insert by avoiding re-hashing entries multiple times and using a method that scales with the size of the number of entries to insert rather than the number of slots on disk.

The difference is somewhat less than I was expecting for small inserts, and the second copy performance has degraded since the last time I recorded it in #2938 (comment). I suspect this is due at least in part to the locking added in #3388.

Copy benchmarks (all are copies into a 60 million node table containing a single integer primary key column):

Number of Nodes inserted	Before	After
1	~530ms	~500ms
1024	~590ms	~530ms
60000	~1650ms	~1400ms
600000	~8400ms	~5200ms
6000000	~12400ms	~7600ms
60000000	~21600ms	~17300ms

Edit: This is when doing the second copy in the same process. Subsequent testing has revealed that running the second copy in a new process (this was done with kuzu_shell) yields significantly better performance for small copies (e.g. ~20ms for a single tuple instead of ~500ms) but that was also the case before this change.

Copying a single node is more or less identical in performance to inserting, so I will work on merging the implementations in the hash index to just use the bulk storage. That will be a larger and messier PR though.

ray6080 · 2024-04-29T20:03:11Z

src/storage/index/hash_index.cpp

+                    if (!diskSlotPage) {
+                        diskSlotPage = diskSlotId / NUM_SLOTS_PER_PAGE;
+                    }
+                    if (diskSlotId / NUM_SLOTS_PER_PAGE == diskSlotPage) {


though may not make a big difference, could we do early stop if the diskSlotId is not equal to diskSlotPage?

No, even if one of the in-memory slots doesn't have any elements that match a given page, later ones could.

benjaminwinger force-pushed the hash-index-scaling branch from 0630136 to 0e3fa33 Compare April 29, 2024 17:24

Improve efficiency of merging bulk insertions into the hash index

962c329

benjaminwinger force-pushed the hash-index-scaling branch from 0e3fa33 to 962c329 Compare April 29, 2024 18:42

ray6080 approved these changes Apr 29, 2024

View reviewed changes

ray6080 merged commit 88cc154 into master Apr 29, 2024
17 checks passed

ray6080 deleted the hash-index-scaling branch April 29, 2024 22:05

benjaminwinger mentioned this pull request May 1, 2024

Remove unnecessary calls to WAL::flushAllPages and clear the dirty flag when flushing pages #3427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of merging bulk insertions into the hash index #3403

Improve efficiency of merging bulk insertions into the hash index #3403

benjaminwinger commented Apr 29, 2024 •

edited

Loading

ray6080 Apr 29, 2024

benjaminwinger Apr 29, 2024

Improve efficiency of merging bulk insertions into the hash index #3403

Improve efficiency of merging bulk insertions into the hash index #3403

Conversation

benjaminwinger commented Apr 29, 2024 • edited Loading

ray6080 Apr 29, 2024

Choose a reason for hiding this comment

benjaminwinger Apr 29, 2024

Choose a reason for hiding this comment

benjaminwinger commented Apr 29, 2024 •

edited

Loading