Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip reading null data if the property is known to have no nulls #1945

Merged
merged 3 commits into from
Aug 31, 2023

Conversation

benjaminwinger
Copy link
Collaborator

No description provided.

@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Patch coverage: 95.91% and project coverage change: +0.02% 🎉

Comparison is base (58ee272) 89.68% compared to head (0991de8) 89.70%.
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1945      +/-   ##
==========================================
+ Coverage   89.68%   89.70%   +0.02%     
==========================================
  Files         877      879       +2     
  Lines       32319    32394      +75     
==========================================
+ Hits        28984    29058      +74     
- Misses       3335     3336       +1     
Files Changed Coverage Δ
src/include/common/null_mask.h 100.00% <ø> (ø)
src/include/transaction/transaction.h 100.00% <ø> (ø)
src/storage/copier/column_chunk.cpp 84.09% <75.00%> (+0.15%) ⬆️
src/common/null_mask.cpp 93.24% <80.00%> (-0.96%) ⬇️
...storage/store/nodes_statistics_and_deleted_ids.cpp 84.61% <90.90%> (+0.61%) ⬆️
src/storage/store/node_column.cpp 78.24% <91.17%> (-0.99%) ⬇️
src/include/common/vector/value_vector.h 100.00% <100.00%> (ø)
src/include/storage/copier/column_chunk.h 100.00% <100.00%> (ø)
src/include/storage/store/node_column.h 100.00% <100.00%> (ø)
...e/storage/store/nodes_statistics_and_deleted_ids.h 100.00% <100.00%> (+5.45%) ⬆️
... and 10 more

... and 18 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me. But before we merge, can we get some performance number regarding to COPY and read-only queries? and we should document them within the PR description so it can be easily traced back. (Thus, I'm holding this PR for now, and also see my comments for necessary changes)

We can run the COPY and a set of queries on our local machine (document your configuration when reporting). The dataset is ldbc SF10, and if you prefer a larger one, ldbc SF30 should be good enough. The comparison is between this branch (after you rebase latest master), and the latest master branch.
We should report the COPY time of several relatively large node tables, including Comment, Post, and Person. For read-only queries, we should report end-to-end query evaluation time (use our benchmark tool is the best option) for LSQB queries (see test_files/lsqb) and some LDBC queries (see test_files/ldbc).

Let's discuss further once you get started on this. Thanks!

src/include/storage/copier/column_chunk.h Outdated Show resolved Hide resolved
src/include/storage/copier/column_chunk.h Outdated Show resolved Hide resolved
src/include/storage/store/table_statistics.h Outdated Show resolved Hide resolved
src/include/storage/store/table_statistics.h Show resolved Hide resolved
src/storage/store/property_statistics.cpp Outdated Show resolved Hide resolved
src/include/storage/store/table_statistics.h Outdated Show resolved Hide resolved
src/storage/store/node_column.cpp Outdated Show resolved Hide resolved
src/storage/store/node_column.cpp Outdated Show resolved Hide resolved
src/storage/store/node_column.cpp Outdated Show resolved Hide resolved
@benjaminwinger benjaminwinger force-pushed the null-read-opt branch 2 times, most recently from c75ee27 to 4c9f231 Compare August 23, 2023 13:45
@benjaminwinger
Copy link
Collaborator Author

i tried benchmarking on progressively larger versions of the LDBC dataset, but there weren't any consistent speedups, which I eventually realised after examining the nodes.statistics_and_deleted.ids file was because every single property was being marked as having nulls, which I'm fairly sure shouldn't be correct, so something is wrong, though I'm not sure what yet.

@benjaminwinger
Copy link
Collaborator Author

benjaminwinger commented Aug 30, 2023

I played a bit more with benchmarks, but I can't see any clear performance benefits which are noticeable through the variation between runs with the LDBC datasets.

However I'm not sure we should really expect a significant performance improvement. The benchmarks are run repeatedly, so the actual cost of reading the data from disk/ssd is presumably minimized by OS caching, while null data is packed into a single bit per value, so the total data to read is, well, 64 times smaller than for a 64-bit integer property.

The best I've managed was to create a dataset with an int64 primary key and a bool and a huge number of entries (I was hoping that it might be able to skip reading the primary key, as the query in question was just counting the number of nodes where the boolean is true, but since the performance difference is so small, I guess it must be reading the primary keys too). I ran it twice (50 iterations each, with standard deviation indicated by the black bars), since there is still some significant variation between runs.
null-benchmark.pdf.
The performance benefit is small, but noticeable.

Increasing to 4 boolean columns to increase the ratio of null data to regular data makes the difference a little more pronounced:
null-benchmark-4-bools.pdf.

@benjaminwinger benjaminwinger merged commit 9aaff49 into master Aug 31, 2023
9 of 10 checks passed
@benjaminwinger benjaminwinger deleted the null-read-opt branch August 31, 2023 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants