More efficient ColumnChunk string dictionary caching #2994
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Something from #2979 when I was testing in-memory compression, as the original implementation of the DictionaryChunk's index performed poorly on that dataset (largely due to the size and number of the chunks in the partitioner magnifying the overhead of the dictionary's cache).
This changes the storage of the DictionaryChunk's indexTable to store just the index and use custom comparison and hashing functions instead of storing a second copy of the string. In addition to removing one extra copy of the string data, it reduces the memory overhead of the cache from 32 bytes per entry (
sizeof(std::string)
, though this may also include the string data for short strings, so this change won't necessarily reduce memory by>=28
bytes per string in the dictionary) to just 4 bytes per entry (plus whatever overhead is required forstd::unordered_set
, but that hasn't really changed).I did a few benchmarks, and when copying 60000000 strings with random numbers, performance improved from ~25s to ~16s for long strings (~20 characters long), and 12s to 10s for short strings (both should have no duplication, so this tests the worst case where all entries get added to the dictionary; the table has one string column with a serial primary key).