Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to first/last aggregators for numeric types during ingestion #10949

Closed
wants to merge 19 commits into from

Conversation

FrankChen021
Copy link
Member

@FrankChen021 FrankChen021 commented Mar 5, 2021

Fixes #10702

Description

This PR fixes #10702 by adding support to doubleFirst/floatFirst/longFirst and doubleLast/floatLast/longLast during ingestion phase. And also reverts #10794 to bring back the UI.

The implementation is inspired by current stringFirst/stringLast implementation, so the code looks like similar. But this PR does not refactor current stringFirst/stringLast implementation to share the code with double/float/long. That might be done in the future.


Key changed/added classes in this PR

  • AbstractSerializableLongObjectPairSerde is provided to share serialization code for type of long/double/float
  • GenericFirstAggregateCombiner is provided to share first aggregator code for type of long/double/float
  • GenericLastAggregateCombiner is provided to share last aggregator code for type of long/double/float

What's not included in this PR

  1. stringFirst/stringLast should also share the three base classes listed above, I will open a new PR to do this to keep changes in this PR as less as possible.
  2. SQL query on re-indexed columns with double/float/long first and last aggregators WON'T work. This involves some changes in complex type handling which might be better in another PR.

Test Scenario

This PR contains UT and IT cases to cover all doubleFirst/doubleLast, floatFirst/floatLast, longFirst/longLast aggregators, including:

  • buffer aggregation
  • aggregate combiner
  • serdes
  • aggregation during index
  • aggregation during re-index
  • native query

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Signed-off-by: frank chen <frank.chen021@outlook.com>
Signed-off-by: frank chen <frank.chen021@outlook.com>
Signed-off-by: frank chen <frank.chen021@outlook.com>
Signed-off-by: frank chen <frank.chen021@outlook.com>
@FrankChen021
Copy link
Member Author

One tricky problem left is that first/last aggregators is not supported by SQL query on reindexed long/float/double columns while these aggregators work well in a native query.

The type of reindexed double/float/long/string first/last columns are marked as COMPLEX in schema, and the underlying type is lost when the type is converted into RelDataType.

// Loses information about exactly what kind of complex column this is.

Since the underlying data type of the column is lost during SQL planning, current EarliestLatestReturnTypeInference also is not able to infer correct return type, and is unable to create correct type of aggregator for double/float/long.

One way I can come up with is to define some macros such as double_latest for different data types at the SQL layer.
@gianm @clintropolis Do you have any other suggestions ?

Signed-off-by: frank chen <frank.chen021@outlook.com>
@clintropolis
Copy link
Member

One way I can come up with is to define some macros such as double_latest for different data types at the SQL layer.
@gianm @clintropolis Do you have any other suggestions ?

Hmm, so what I have had in mind to solve this is to be able to determine whether a RowSignature should be "finalized" or not in terms of the aggregator types. #9638 added some of the pieces needed for this (getFinalizedType, etc) and touches on this idea in the PR description, I just haven't yet got back this core refactoring work, or quite had time to fully think through how to determine when we need the 'finalized' signature or not.

#10277 also added tracking of the "name" of the complex type on ColumnCapabilities (which typically is what populates the RowSignature) so that is potentially available to give greater detail than ValueType.COMPLEX, but I think the finalized type would be the useful information here.

That said, I haven't had a look at this PR at all yet. I will try to get to it sometime soon, maybe I will have some ideas while looking over the code.

@FrankChen021
Copy link
Member Author

Hi @clintropolis , Thanks for your suggestion. I'll try to solve it.

@clintropolis
Copy link
Member

Hi @clintropolis , Thanks for your suggestion. I'll try to solve it.

Depending on how big of a change this is, it might be worth splitting out a separate PR to go in before this one. I'll try to think about this a bit as well.

@FrankChen021
Copy link
Member Author

Hmm, so what I have had in mind to solve this is to be able to determine whether a RowSignature should be "finalized" or not in terms of the aggregator types. #9638 added some of the pieces needed for this (getFinalizedType, etc) and touches on this idea in the PR description, I just haven't yet got back this core refactoring work, or quite had time to fully think through how to determine when we need the 'finalized' signature or not.

#10277 also added tracking of the "name" of the complex type on ColumnCapabilities (which typically is what populates the RowSignature) so that is potentially available to give greater detail than ValueType.COMPLEX, but I think the finalized type would be the useful information here.

That said, I haven't had a look at this PR at all yet. I will try to get to it sometime soon, maybe I will have some ideas while looking over the code.

The name of complex type has been set in first/last aggregator

https://github.com/apache/druid/pull/10949/files#diff-9fedc71bcede0adcbb1deadbef33e9e1de175ee11209eb4d5d676580104f2c03R231

And I checked the code about how RowSignature works today, and found that there's no way to get that name from RowSignature because when RowSignature is instantiated , that name is not passed to RowSignature

rowSignatureBuilder.add(entry.getKey(), valueType);

So, is it reasonable to make some changes here to pass the type name as well as its value type to RowSignature if its value type is COMPLEX ?

@FrankChen021
Copy link
Member Author

Hi @clintropolis @suneet-s , Could you review this PR at any time you're convenient ? Since this PR is a little large, I think the SQL problem could be separated in another PR.

@clintropolis
Copy link
Member

Hi @clintropolis @suneet-s , Could you review this PR at any time you're convenient ? Since this PR is a little large, I think the SQL problem could be separated in another PR.

Sorry, I will try to get to this soon! I think I have a similar problem to solve with complex types in a different thing I'm working on, so will be thinking about how we can deal with differences between intermediary and finalized types a bit better as well.

@FrankChen021
Copy link
Member Author

FrankChen021 commented Apr 1, 2021

Sorry, I will try to get to this soon! I think I have a similar problem to solve with complex types in a different thing I'm working on, so will be thinking about how we can deal with differences between intermediary and finalized types a bit better as well.

If there're any ideas or progress about solving complex types, could you let me know ? I'm also working on the sql problem.

@suneet-s
Copy link
Contributor

suneet-s commented Apr 1, 2021

@FrankChen021 thanks for bringing this back to the top of my radar. I will look through this over the next week or so.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed about 12 files. Posting an incomplete review

import java.nio.ByteBuffer;

/**
* The class serializes a Pair<Long, ?> object for double/float/longFirst and double/float/longLast aggregators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe why you chose not to also make SerializablePairLongStringSerde not extend this class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's reasonable to refactor SerializablePairLongStringSerde to be subclass of AbstractSerializablePairSerde, but I think this kind of change is not tightly related to this PR because this PR is already a little bit large. I think a coming PR would resolve this problem once this PR is merged.

Comment on lines 113 to 115
protected abstract T toPairObject(ByteBuffer buffer);

protected abstract byte[] pairToBytes(T val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadocs please

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/**
* The class serializes a Pair<Long, ?> object for double/float/longFirst and double/float/longLast aggregators
*/
public abstract class AbstractSerializablePairSerde<T extends SerializablePair<Long, ?>> extends ComplexMetricSerde
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change class name to AbstractSerializableLongObjectPairSerde

Suggested change
public abstract class AbstractSerializablePairSerde<T extends SerializablePair<Long, ?>> extends ComplexMetricSerde
public abstract class AbstractSerializableLongObjectPairSerde<T extends SerializablePair<Long, ?>> extends ComplexMetricSerde

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Override
public int compare(@Nullable T o1, @Nullable T o2)
{
return Longs.compare(o1.lhs, o2.lhs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not correctly handle if either o1 or o2 is null. See StringFirstAggregatorFactory#VALUE_COMPARATOR, we'll want a similar behavior here.

Would it be possible to update the integration tests that were added to surface this error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check if UT is able to check this error first because IT cases share some data sources with other cases.


protected abstract T toPairObject(ByteBuffer buffer);

protected abstract byte[] pairToBytes(T val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected abstract byte[] pairToBytes(T val);
protected abstract byte[] pairToBytes(@Nullable T val);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new commit, null check is added to the caller of this method, so it's not necessary to declare this parameter as Nullable

}

@Override
public byte[] toBytes(T val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public byte[] toBytes(T val)
public byte[] toBytes(@Nullable T val)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

@Override
protected byte[] pairToBytes(SerializablePairLongFloat val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the super class says val can be null, all these implementations should be able to handle a null val.

I haven't dug in yet to know what this means, but this same pattern exists in all 3 implementations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the new commit, null check is added to the caller of this method, so no need to handle null in this method

@suneet-s
Copy link
Contributor

@FrankChen021 To help with reviewing this PR, could you update the PR description to include some notes how you chose to implement the solution for this. For example, it looks like the AbstractSerializablePairSerde was based off the SerializablePairLongStringSerde.

Also, it would help if you included a description of the different scenarios you tested, and known unsupported conditions - like your comment about first last aggregators not working in SQL queries.

@FrankChen021
Copy link
Member Author

@suneet-s Thanks for your review. I will address all the comments you left and update this PR later this day.

@FrankChen021
Copy link
Member Author

@FrankChen021 To help with reviewing this PR, could you update the PR description to include some notes how you chose to implement the solution for this. For example, it looks like the AbstractSerializablePairSerde was based off the SerializablePairLongStringSerde.

Also, it would help if you included a description of the different scenarios you tested, and known unsupported conditions - like your comment about first last aggregators not working in SQL queries.

Description of this PR has been updated. Let me know if there's anything left.

@suneet-s
Copy link
Contributor

@FrankChen021 To help with reviewing this PR, could you update the PR description to include some notes how you chose to implement the solution for this. For example, it looks like the AbstractSerializablePairSerde was based off the SerializablePairLongStringSerde.
Also, it would help if you included a description of the different scenarios you tested, and known unsupported conditions - like your comment about first last aggregators not working in SQL queries.

Description of this PR has been updated. Let me know if there's anything left.

Thanks @FrankChen021 I will take a look again this week!

@suneet-s
Copy link
Contributor

SQL query on re-indexed columns with double/float/long first and last aggregators WON'T work. This involves some changes in complex type handling which might be better in another PR.

What happens when a user issues the EARLIEST(expr, maxBytesPerString) function on a longFirst column - do we expect that it will fail? Is the error message in this case clear?

I'm asking because it looks like #10332 added handling of complex type columns, which used to be ok because stringFirst/Last was the only type of complex column. But now that we've introduced these column types, the expected behavior is less clear. Perhaps you can add some tests to CalciteQueryTest to validate the behavior that we want users to see when they issue sql queries on these column types.

Signed-off-by: frank chen <frank.chen021@outlook.com>
@FrankChen021
Copy link
Member Author

SQL query on re-indexed columns with double/float/long first and last aggregators WON'T work. This involves some changes in complex type handling which might be better in another PR.

What happens when a user issues the EARLIEST(expr, maxBytesPerString) function on a longFirst column - do we expect that it will fail? Is the error message in this case clear?

I'm asking because it looks like #10332 added handling of complex type columns, which used to be ok because stringFirst/Last was the only type of complex column. But now that we've introduced these column types, the expected behavior is less clear. Perhaps you can add some tests to CalciteQueryTest to validate the behavior that we want users to see when they issue sql queries on these column types.

EARLIEST/LATEST both work well for stringFirst/Last columns. They also work for none double/float/longFirst/Last columns.

For double/long/floatFirst/Last columns, following exception message is returned

Error: Plan validation failed

org.apache.calcite.runtime.CalciteContextException: From line 3, column 3 to line 3, column 24: Cannot apply 'EARLIEST' to arguments of type 'EARLIEST(<OTHER>)'. Supported form(s): 'EARLIEST(<NUMERIC>)' 'EARLIEST(<BOOLEAN>)' 'EARLIEST(expr, maxBytesPerString)'

org.apache.calcite.tools.ValidationException

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrankChen021 Thank you for this PR. I have some minor comments. do you also want to add

  • unit tests for the serde
  • documentation changes

@Override
public int compare(@Nullable T o1, @Nullable T o2)
{
return getLongObjectPairComparator().compare(o1, o2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to not create a new comparator object for each comparison op.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementations of getLongObjectPairComparator does not create new comparator object, they hold a static comparator object.

}

@Override
public ObjectStrategy<T> getObjectStrategy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be cleaner to have each subclass implement getObjectStrategy() and then we need not have three abstract methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good suggestion.

if (pair.lhs < firstTime) {
firstTime = pair.lhs;

// rhs might be NULL under SQL-compatibility mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit out of my depth here. what will happen if the aggregate was stored as null in segment since sql compatibility was on in the task writing the segment. But then sql compatability is turned off when the segment data is being read. should it still be read as null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I will check it later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this question, the short answer is yes. The query time processing of code is here

return new SerializablePair<>(((Number) map.get("lhs")).longValue(), null);

I think we should return the default value 0 for this case. What do you think @clintropolis ?

@@ -36,11 +36,16 @@
private static final int NULL_VALUE = -1;

/**
* Returns whether a given value selector *might* contain SerializablePairLongString objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class may require a rename now. May be FirstLastUtils?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, selectorNeedsFoldCheck method in StringFirstLastUtils is now shared by long/float/doubleFirst/Last, it should be extracted out of this class.

I have not made more changes to this class file because stringFirst/Last will be refactored in a new PR to share new abstract classes provided in this PR, which means this class is also involved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok. so you will make the changes in that PR. is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

AggregateCombiner floatFirstAggregateCombiner = combiningAggFactory.makeAggregateCombiner();

SerializablePair[] inputPairs = {
new SerializablePair<>(5L, 134.3f),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add tests with null values as input and/or expected result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. null related tests have been added in the latest commit

FrankChen021 and others added 2 commits April 16, 2021 09:54
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
@Override
public ObjectStrategy getObjectStrategy()
{
return new ObjectStrategy<SerializablePairLongDouble>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a static ObjectStrategy since it is stateless. It seems right now we are creating a new object for every deserialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -36,11 +36,16 @@
private static final int NULL_VALUE = -1;

/**
* Returns whether a given value selector *might* contain SerializablePairLongString objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok. so you will make the changes in that PR. is that right?

@abhishekagarwal87
Copy link
Contributor

@FrankChen021 - have you run any performance tests/benchmark for ingestion time rollup? That will be handy to rule out any perf bug.

@FrankChen021
Copy link
Member Author

@FrankChen021 - have you run any performance tests/benchmark for ingestion time rollup? That will be handy to rule out any perf bug.

I have not. But I will. Thanks for pointing out this.

@FrankChen021 FrankChen021 mentioned this pull request Oct 17, 2021
6 tasks
@stale
Copy link

stale bot commented Apr 28, 2022

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Apr 28, 2022
@abhishekagarwal87
Copy link
Contributor

@FrankChen021 - This is a useful capability. would you be able to take this to completion? Perf tests will be nice but they need not block this PR.

@stale
Copy link

stale bot commented Aug 4, 2022

This issue is no longer marked as stale.

@stale stale bot removed the stale label Aug 4, 2022
@FrankChen021
Copy link
Member Author

Hi @abhishekagarwal87 , the changes in this PR was previously used in our team. I'm not working on that project now, so I don't have large chunk of time to resolve the conflicts.

Also SQL functions are not supported on rollup first/last column, this is also a restriction that needs to be resolved. (See comment above #10949 (comment)) At the time this PR was created, it was not able to do that. So we use native query to bypass this problem.

It would be great if you can pick up this PR.

@github-actions
Copy link

github-actions bot commented Oct 9, 2023

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 9, 2023
Copy link

github-actions bot commented Nov 6, 2023

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DoubleFirstAggregatorFactory is not supported during ingestion for rollup
4 participants