Skip reading null data if the property is known to have no nulls #1945

benjaminwinger · 2023-08-16T21:10:47Z

No description provided.

codecov · 2023-08-16T21:32:39Z

Codecov Report

Patch coverage: 95.91% and project coverage change: +0.02% 🎉

Comparison is base (58ee272) 89.68% compared to head (0991de8) 89.70%.
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1945      +/-   ##
==========================================
+ Coverage   89.68%   89.70%   +0.02%     
==========================================
  Files         877      879       +2     
  Lines       32319    32394      +75     
==========================================
+ Hits        28984    29058      +74     
- Misses       3335     3336       +1

Files Changed	Coverage Δ
src/include/common/null_mask.h	`100.00% <ø> (ø)`
src/include/transaction/transaction.h	`100.00% <ø> (ø)`
src/storage/copier/column_chunk.cpp	`84.09% <75.00%> (+0.15%)`	⬆️
src/common/null_mask.cpp	`93.24% <80.00%> (-0.96%)`	⬇️
...storage/store/nodes_statistics_and_deleted_ids.cpp	`84.61% <90.90%> (+0.61%)`	⬆️
src/storage/store/node_column.cpp	`78.24% <91.17%> (-0.99%)`	⬇️
src/include/common/vector/value_vector.h	`100.00% <100.00%> (ø)`
src/include/storage/copier/column_chunk.h	`100.00% <100.00%> (ø)`
src/include/storage/store/node_column.h	`100.00% <100.00%> (ø)`
...e/storage/store/nodes_statistics_and_deleted_ids.h	`100.00% <100.00%> (+5.45%)`	⬆️
... and 10 more

... and 18 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ray6080

This PR looks good to me. But before we merge, can we get some performance number regarding to COPY and read-only queries? and we should document them within the PR description so it can be easily traced back. (Thus, I'm holding this PR for now, and also see my comments for necessary changes)

We can run the COPY and a set of queries on our local machine (document your configuration when reporting). The dataset is ldbc SF10, and if you prefer a larger one, ldbc SF30 should be good enough. The comparison is between this branch (after you rebase latest master), and the latest master branch.
We should report the COPY time of several relatively large node tables, including Comment, Post, and Person. For read-only queries, we should report end-to-end query evaluation time (use our benchmark tool is the best option) for LSQB queries (see test_files/lsqb) and some LDBC queries (see test_files/ldbc).

Let's discuss further once you get started on this. Thanks!

src/include/storage/copier/column_chunk.h

src/include/storage/store/table_statistics.h

src/include/storage/store/nodes_statistics_and_deleted_ids.h

src/storage/store/property_statistics.cpp

src/include/storage/store/table_statistics.h

src/storage/store/node_column.cpp

benjaminwinger · 2023-08-23T13:55:05Z

i tried benchmarking on progressively larger versions of the LDBC dataset, but there weren't any consistent speedups, which I eventually realised after examining the nodes.statistics_and_deleted.ids file was because every single property was being marked as having nulls, which I'm fairly sure shouldn't be correct, so something is wrong, though I'm not sure what yet.

src/include/storage/store/table_statistics.h

benjaminwinger · 2023-08-30T21:41:16Z

I played a bit more with benchmarks, but I can't see any clear performance benefits which are noticeable through the variation between runs with the LDBC datasets.

However I'm not sure we should really expect a significant performance improvement. The benchmarks are run repeatedly, so the actual cost of reading the data from disk/ssd is presumably minimized by OS caching, while null data is packed into a single bit per value, so the total data to read is, well, 64 times smaller than for a 64-bit integer property.

The best I've managed was to create a dataset with an int64 primary key and a bool and a huge number of entries (I was hoping that it might be able to skip reading the primary key, as the query in question was just counting the number of nodes where the boolean is true, but since the performance difference is so small, I guess it must be reading the primary keys too). I ran it twice (50 iterations each, with standard deviation indicated by the black bars), since there is still some significant variation between runs.
null-benchmark.pdf.
The performance benefit is small, but noticeable.

Increasing to 4 boolean columns to increase the ratio of null data to regular data makes the difference a little more pronounced:
null-benchmark-4-bools.pdf.

benjaminwinger force-pushed the null-read-opt branch from f6eb709 to 732229d Compare August 16, 2023 21:12

ray6080 requested changes Aug 18, 2023

View reviewed changes

benjaminwinger force-pushed the null-read-opt branch 2 times, most recently from c75ee27 to 4c9f231 Compare August 23, 2023 13:45

benjaminwinger commented Aug 29, 2023

View reviewed changes

src/include/storage/store/table_statistics.h Outdated Show resolved Hide resolved

benjaminwinger and others added 3 commits August 30, 2023 14:09

Added PropertyStatistics structure for tracking nulls

ac72bb0

Use transactions to determine the read/write data to use

89b9497

fix concurrency issue

0991de8

benjaminwinger force-pushed the null-read-opt branch from 383ce82 to 0991de8 Compare August 30, 2023 18:10

ray6080 approved these changes Aug 31, 2023

View reviewed changes

benjaminwinger merged commit 9aaff49 into master Aug 31, 2023
9 of 10 checks passed

benjaminwinger deleted the null-read-opt branch August 31, 2023 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip reading null data if the property is known to have no nulls #1945

Skip reading null data if the property is known to have no nulls #1945

benjaminwinger commented Aug 16, 2023

codecov bot commented Aug 16, 2023 •

edited

Loading

ray6080 left a comment

benjaminwinger commented Aug 23, 2023

benjaminwinger commented Aug 30, 2023 •

edited

Loading

Skip reading null data if the property is known to have no nulls #1945

Skip reading null data if the property is known to have no nulls #1945

Conversation

benjaminwinger commented Aug 16, 2023

codecov bot commented Aug 16, 2023 • edited Loading

Codecov Report

ray6080 left a comment

Choose a reason for hiding this comment

benjaminwinger commented Aug 23, 2023

benjaminwinger commented Aug 30, 2023 • edited Loading

codecov bot commented Aug 16, 2023 •

edited

Loading

benjaminwinger commented Aug 30, 2023 •

edited

Loading