-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip reading null data if the property is known to have no nulls #1945
Conversation
f6eb709
to
732229d
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1945 +/- ##
==========================================
+ Coverage 89.68% 89.70% +0.02%
==========================================
Files 877 879 +2
Lines 32319 32394 +75
==========================================
+ Hits 28984 29058 +74
- Misses 3335 3336 +1
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good to me. But before we merge, can we get some performance number regarding to COPY
and read-only queries? and we should document them within the PR description so it can be easily traced back. (Thus, I'm holding this PR for now, and also see my comments for necessary changes)
We can run the COPY
and a set of queries on our local machine (document your configuration when reporting). The dataset is ldbc SF10
, and if you prefer a larger one, ldbc SF30
should be good enough. The comparison is between this branch (after you rebase latest master), and the latest master branch.
We should report the COPY
time of several relatively large node tables, including Comment, Post, and Person. For read-only queries, we should report end-to-end query evaluation time (use our benchmark tool is the best option) for LSQB queries (see test_files/lsqb
) and some LDBC queries (see test_files/ldbc
).
Let's discuss further once you get started on this. Thanks!
c75ee27
to
4c9f231
Compare
i tried benchmarking on progressively larger versions of the LDBC dataset, but there weren't any consistent speedups, which I eventually realised after examining the |
383ce82
to
0991de8
Compare
I played a bit more with benchmarks, but I can't see any clear performance benefits which are noticeable through the variation between runs with the LDBC datasets. However I'm not sure we should really expect a significant performance improvement. The benchmarks are run repeatedly, so the actual cost of reading the data from disk/ssd is presumably minimized by OS caching, while null data is packed into a single bit per value, so the total data to read is, well, 64 times smaller than for a 64-bit integer property. The best I've managed was to create a dataset with an int64 primary key and a bool and a huge number of entries (I was hoping that it might be able to skip reading the primary key, as the query in question was just counting the number of nodes where the boolean is true, but since the performance difference is so small, I guess it must be reading the primary keys too). I ran it twice (50 iterations each, with standard deviation indicated by the black bars), since there is still some significant variation between runs. Increasing to 4 boolean columns to increase the ratio of null data to regular data makes the difference a little more pronounced: |
No description provided.