Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialize scan results correctly when columns are not present in the segments #16619

Merged
merged 7 commits into from
Jun 23, 2024

Conversation

LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Jun 17, 2024

Description

The query engine is unable to estimate the correct size in bytes of the subquery results when the scan query has columns which are missing from the segments. This is because the ScanQueryEngine receives all the columns of the scan query, and populates the row signature with null type if its unable to find the column in the segment.

This PR modifies the materializing logic to materialize the results of the columns whose types are known, and check that the columns whose types are unknown always have null values. This is helpful because:
a. If the type is unknown and the column contains all null values, we don't need to materialize the results
b. If the type is unknown and the column contains non-null values in any row, we are running into the case of missing types, and we should throw an error.

Release note

Fixes a bug causing maxSubqueryBytes to not work when segments have missing columns.


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

while (populateCursor()) { // Do till we don't have any more rows, or the next row isn't compatible with the current row
if (!frameWriter.addSelection()) { // Add the cursor's row to the frame, till the frame is full
break;
}

for (Integer columnNumber : nullTypedColumns) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I wonder why use a fastutil IntList - if it gets iterated with a foreach ; plain get?
this could be moved into some method like validateRow - that will naturally do a CSE of the currentRows.get(currentRowIndex) so that it will be only evaluated once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to use FastUtil IntList as such. I just thought it might be faster to create than an arraylist.

this could be moved into some method like validateRow - that will naturally do a CSE of the currentRows.get(currentRowIndex) so that it will be only evaluated once

It is getting evaluated once here right? Unless I misinterpreted your comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex) ; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex) as that's the row :)

@@ -200,26 +229,33 @@ public FrameSignaturePair next()
// start all the processing
populateCursor();
boolean firstRowWritten = false;
// While calling populateCursor() repeatedly, currentRowSignature might change. Therefore we store the signature
// While calling populateCursor() repeatedly, currentRowSignature might change. Therefore, we store the signature
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

....what if the signature changes - is that a problem? shouldn't that be an Exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are two cursors, CursorA with RowSignatureA and CursorB with RowSignatureB and the cursor is at the last row of CursorA, populate call will return false, i.e. the two cursors cannot be batched together, and set currentRowSignature to the RowSignatureB (i.e. prepare the variables for the next write). We still want to return the old frame with the old signature therefore we need to preserve the signature with which we have written the frame.
Per your previous suggestion, frameWriterFactory.signature() would be sufficient and cleaner, and I will use that instead.

Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - left some minor notes

while (populateCursor()) { // Do till we don't have any more rows, or the next row isn't compatible with the current row
if (!frameWriter.addSelection()) { // Add the cursor's row to the frame, till the frame is full
break;
}

for (Integer columnNumber : nullTypedColumns) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex) ; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex) as that's the row :)

firstRowWritten = true;
// Check that the columns with the null types are actually null before advancing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: isn't this comment misplaced? (note: this detail is not necessary - but it could live as an apidoc of the validateRow if that would be around)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up the code

@LakshSingla
Copy link
Contributor Author

Thanks for the review! @kgyrtkirk

@LakshSingla LakshSingla merged commit 00c9643 into apache:master Jun 23, 2024
86 of 87 checks passed
@LakshSingla LakshSingla deleted the missing-col-frames branch June 23, 2024 17:45
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants