Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential read for StringPropertyList #1327

Merged
merged 1 commit into from
Mar 14, 2023

Conversation

rfdavid
Copy link
Collaborator

@rfdavid rfdavid commented Feb 28, 2023

This PR implements overflow cache in the DiskOverflowFile::readStringToVector function in the same way that was implemented in scanSequentialStringOverflow., so it can be reused in different parts of the code.

Benchmark

  • Node table with two fields STRING and INT64 CREATE NODE TABLE City(name STRING, population INT64, PRIMARY KEY (name))
  • total 986125 rows
  • name string has ~ 25kb for each node property
  • Processor: M2 Max, 32GB ram

Results of SCAN_NODE_PROPERTY after running PROFILE MATCH (c:City) RETURN c.name; 5 times on master branch and on this branch:

Run 1 2 3 4 5
Master 37181.326 1571.216 1405.098 1409.182 1417.050
Branch 37127.368 1562.977 1395.112 1518.187 1394.889

Related to #756 (note: this only addresses string overflow, not lists)

I have read and agree to the terms under CLA.md

src/include/storage/storage_structure/disk_overflow_file.h Outdated Show resolved Hide resolved
src/include/storage/storage_structure/column.h Outdated Show resolved Hide resolved
src/include/storage/storage_structure/disk_overflow_file.h Outdated Show resolved Hide resolved
src/include/storage/storage_structure/disk_overflow_file.h Outdated Show resolved Hide resolved
void DiskOverflowFile::readStringsToVector(TransactionType trxType, ValueVector& valueVector) {
assert(!valueVector.state->isFlat());
OverflowCache overflowCache;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slight concerned whether this object creation will cause overhead. Usually the compiler optimize these things so it shouldn't matter.

But can u try to benchmark a bit to see if we become slower if we wrap things as a struct? If we do become slower I would use three primitive types instead.

src/storage/storage_structure/disk_overflow_file.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @rfdavid Looks good to me.
Just two minor comments:

  1. you wrote in the description name string has ~ 25kb for each node property, did you mean 25bytes here? I don't think we can store strings whose length are 25KB right now.
  2. can you add one more performance numbers comparison on lists string reading? my understanding is that we should see some performance gains in that case.

We can take another quick look, then we can get this in. Thanks!

void readStringsToVector(
transaction::TransactionType trxType, common::ValueVector& valueVector);
void readStringToVector(transaction::TransactionType trxType, common::ku_string_t& kuStr,
struct OverflowPageCache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this struct to private? I guess it doesn't need to be exposed as public for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thank you so much for your comments.

  1. You are right, I actually created a csv with a very long string, but after querying, I realized it was stored only 4096 bytes.
  2. I'll benchmark it soon. I believe the numbers should be better, considering the cache is implemented for the child type STRING (unless the object creating is causing some overhead, as mentioned by @andyfengHKU)

unpinOverflowPageCache(overflowPageCache);
overflowPageCache.frame = bufferManager.pin(*fileHandleToPin, pageIdxToPin);
overflowPageCache.fileHandle = fileHandleToPin;
overflowPageCache.pageIdx = pageIdxToPin;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider wrapping these three lines as pinOverflowPageCache, but i don't have a strong opinion on this, up to you.

        overflowPageCache.frame = bufferManager.pin(*fileHandleToPin, pageIdxToPin);
        overflowPageCache.fileHandle = fileHandleToPin;
        overflowPageCache.pageIdx = pageIdxToPin;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense since I already have unpinOverflowPageCache, which definitely increases readability. thanks

Also: I have read and agree to the terms under CLA.md
@andyfengHKU andyfengHKU merged commit c8a4542 into kuzudb:master Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants