Add support to first/last aggregators for numeric types during ingestion #10949

FrankChen021 · 2021-03-05T09:22:27Z

Description

This PR fixes #10702 by adding support to doubleFirst/floatFirst/longFirst and doubleLast/floatLast/longLast during ingestion phase. And also reverts #10794 to bring back the UI.

The implementation is inspired by current stringFirst/stringLast implementation, so the code looks like similar. But this PR does not refactor current stringFirst/stringLast implementation to share the code with double/float/long. That might be done in the future.

Key changed/added classes in this PR

AbstractSerializableLongObjectPairSerde is provided to share serialization code for type of long/double/float
GenericFirstAggregateCombiner is provided to share first aggregator code for type of long/double/float
GenericLastAggregateCombiner is provided to share last aggregator code for type of long/double/float

What's not included in this PR

stringFirst/stringLast should also share the three base classes listed above, I will open a new PR to do this to keep changes in this PR as less as possible.
SQL query on re-indexed columns with double/float/long first and last aggregators WON'T work. This involves some changes in complex type handling which might be better in another PR.

Test Scenario

This PR contains UT and IT cases to cover all doubleFirst/doubleLast, floatFirst/floatLast, longFirst/longLast aggregators, including:

buffer aggregation
aggregate combiner
serdes
aggregation during index
aggregation during re-index
native query

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

Signed-off-by: frank chen <frank.chen021@outlook.com>

This reverts commit 2a1e47a.

Signed-off-by: frank chen <frank.chen021@outlook.com>

FrankChen021 · 2021-03-05T09:40:28Z

One tricky problem left is that first/last aggregators is not supported by SQL query on reindexed long/float/double columns while these aggregators work well in a native query.

The type of reindexed double/float/long/string first/last columns are marked as COMPLEX in schema, and the underlying type is lost when the type is converted into RelDataType.

druid/sql/src/main/java/org/apache/druid/sql/calcite/table/RowSignatures.java

Line 135 in 16acd66

// Loses information about exactly what kind of complex column this is.

Since the underlying data type of the column is lost during SQL planning, current EarliestLatestReturnTypeInference also is not able to infer correct return type, and is unable to create correct type of aggregator for double/float/long.

One way I can come up with is to define some macros such as double_latest for different data types at the SQL layer.
@gianm @clintropolis Do you have any other suggestions ?

Signed-off-by: frank chen <frank.chen021@outlook.com>

clintropolis · 2021-03-11T08:29:29Z

One way I can come up with is to define some macros such as double_latest for different data types at the SQL layer.
@gianm @clintropolis Do you have any other suggestions ?

Hmm, so what I have had in mind to solve this is to be able to determine whether a RowSignature should be "finalized" or not in terms of the aggregator types. #9638 added some of the pieces needed for this (getFinalizedType, etc) and touches on this idea in the PR description, I just haven't yet got back this core refactoring work, or quite had time to fully think through how to determine when we need the 'finalized' signature or not.

#10277 also added tracking of the "name" of the complex type on ColumnCapabilities (which typically is what populates the RowSignature) so that is potentially available to give greater detail than ValueType.COMPLEX, but I think the finalized type would be the useful information here.

That said, I haven't had a look at this PR at all yet. I will try to get to it sometime soon, maybe I will have some ideas while looking over the code.

FrankChen021 · 2021-03-12T01:34:38Z

Hi @clintropolis , Thanks for your suggestion. I'll try to solve it.

clintropolis · 2021-03-12T01:37:32Z

Hi @clintropolis , Thanks for your suggestion. I'll try to solve it.

Depending on how big of a change this is, it might be worth splitting out a separate PR to go in before this one. I'll try to think about this a bit as well.

FrankChen021 · 2021-03-22T09:59:04Z

Hmm, so what I have had in mind to solve this is to be able to determine whether a RowSignature should be "finalized" or not in terms of the aggregator types. #9638 added some of the pieces needed for this (getFinalizedType, etc) and touches on this idea in the PR description, I just haven't yet got back this core refactoring work, or quite had time to fully think through how to determine when we need the 'finalized' signature or not.

#10277 also added tracking of the "name" of the complex type on ColumnCapabilities (which typically is what populates the RowSignature) so that is potentially available to give greater detail than ValueType.COMPLEX, but I think the finalized type would be the useful information here.

That said, I haven't had a look at this PR at all yet. I will try to get to it sometime soon, maybe I will have some ideas while looking over the code.

The name of complex type has been set in first/last aggregator

https://github.com/apache/druid/pull/10949/files#diff-9fedc71bcede0adcbb1deadbef33e9e1de175ee11209eb4d5d676580104f2c03R231

And I checked the code about how RowSignature works today, and found that there's no way to get that name from RowSignature because when RowSignature is instantiated , that name is not passed to RowSignature

druid/sql/src/main/java/org/apache/druid/sql/calcite/schema/DruidSchema.java

Line 698 in 51d2c61

rowSignatureBuilder.add(entry.getKey(), valueType);

So, is it reasonable to make some changes here to pass the type name as well as its value type to RowSignature if its value type is COMPLEX ?

FrankChen021 · 2021-03-30T08:12:33Z

Hi @clintropolis @suneet-s , Could you review this PR at any time you're convenient ? Since this PR is a little large, I think the SQL problem could be separated in another PR.

clintropolis · 2021-03-31T21:53:29Z

Hi @clintropolis @suneet-s , Could you review this PR at any time you're convenient ? Since this PR is a little large, I think the SQL problem could be separated in another PR.

Sorry, I will try to get to this soon! I think I have a similar problem to solve with complex types in a different thing I'm working on, so will be thinking about how we can deal with differences between intermediary and finalized types a bit better as well.

FrankChen021 · 2021-04-01T02:17:06Z

Sorry, I will try to get to this soon! I think I have a similar problem to solve with complex types in a different thing I'm working on, so will be thinking about how we can deal with differences between intermediary and finalized types a bit better as well.

If there're any ideas or progress about solving complex types, could you let me know ? I'm also working on the sql problem.

suneet-s · 2021-04-01T13:40:46Z

@FrankChen021 thanks for bringing this back to the top of my radar. I will look through this over the next week or so.

suneet-s

Reviewed about 12 files. Posting an incomplete review

suneet-s · 2021-04-10T01:29:37Z