Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert ColumnAnalysis type, add typeSignature and use it for DruidSchema #11895

Merged
merged 11 commits into from
Nov 11, 2021

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Nov 9, 2021

Description

#11713 changed the segment metadata query column analysis type field, but this might not be very friendly to actual users (e.g. not DruidSchema) who are calling this query directly because, merging query results of mixed versions of Druid would result in an error due to mismatched types.

This PR reverts type to its original behavior, and adds a new typeSignature which is used by DruidSchema. Error messaging when merging ColumnAnalysis has also been improved to include the mismatched types involved in the merge.


Key changed/added classes in this PR
  • ColumnAnalysis
  • SegmentAnalyzer

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Dimension columns will have type `STRING`, `FLOAT`, `DOUBLE`, or `LONG`.
Metric columns will have type `FLOAT`, `DOUBLE`, or `LONG`, or the name of the underlying complex type such as `hyperUnique` in case of COMPLEX metric.
Timestamp column will have type `LONG`.
All columns will contain a `typeSignature` which is the Druid internal representation of the type information for this column, is what is show in [`INFORMATION_SCHEMA.COLUMNS`](../querying/sql.md#columns-table) table in SQL, and is typically the value used to supply Druid with JSON type information at query or ingest time. This value will be `STRING`, `FLOAT`, `DOUBLE`, `LONG`, or `COMPLEX<typeName>` (e.g. `COMPLEX<hyperUnique>`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show -> shown

Timestamp column will have type `LONG`.
All columns will contain a `typeSignature` which is the Druid internal representation of the type information for this column, is what is show in [`INFORMATION_SCHEMA.COLUMNS`](../querying/sql.md#columns-table) table in SQL, and is typically the value used to supply Druid with JSON type information at query or ingest time. This value will be `STRING`, `FLOAT`, `DOUBLE`, `LONG`, or `COMPLEX<typeName>` (e.g. `COMPLEX<hyperUnique>`).

Additionally, columns will have a legacy friendly `type` name. This might match `typeSignature` for some column types (`STRING`, `FLOAT`, `DOUBLE`, or `LONG`) but for COMPLEX columns will only contain the name of the underlying complex type such as `hyperUnique`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding:

New applications should use typeSignature, not type.

@@ -73,6 +78,12 @@ public String getType()
return type;
}

@JsonProperty
public ColumnType getTypeSignature()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider placing this getter above getType, which I think will put it earlier in the JSON and make it seem more "official".

@@ -181,6 +205,7 @@ public String toString()
{
return "ColumnAnalysis{" +
"type='" + type + '\'' +
", columnType=" + typeSignature +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call this typeSignature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, 👍

);
}

if (!typeSignature.equals(rhs.getTypeSignature())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll NPE if typeSignature is null — i.e. if it came from a server that is on an older version. I think this is OK, since people are supposed to update Brokers after segment-serving-servers, but I'd still feel better if this code was resilient to null typeSignature.

Which reminds me: I think we'll also need a trivial change in the cache key to make sure SegmentAnalysis objects from shared caches are not re-used.

Dimension columns will have type `STRING`, `FLOAT`, `DOUBLE`, or `LONG`.
Metric columns will have type `FLOAT`, `DOUBLE`, or `LONG`, or the name of the underlying complex type such as `hyperUnique` in case of COMPLEX metric.
Timestamp column will have type `LONG`.
All columns will contain a `typeSignature` which is the Druid internal representation of the type information for this column, is what is show in [`INFORMATION_SCHEMA.COLUMNS`](../querying/sql.md#columns-table) table in SQL, and is typically the value used to supply Druid with JSON type information at query or ingest time. This value will be `STRING`, `FLOAT`, `DOUBLE`, `LONG`, or `COMPLEX<typeName>` (e.g. `COMPLEX<hyperUnique>`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All columns will contain a `typeSignature` which is the Druid internal representation of the type information for this column, is what is show in [`INFORMATION_SCHEMA.COLUMNS`](../querying/sql.md#columns-table) table in SQL, and is typically the value used to supply Druid with JSON type information at query or ingest time. This value will be `STRING`, `FLOAT`, `DOUBLE`, `LONG`, or `COMPLEX<typeName>` (e.g. `COMPLEX<hyperUnique>`).
All columns contain a `typeSignature` that Druid uses to represent the column type information internally. The `typeSignature` is typically the same value used to identify the JSON type information at query or ingest time. One of: STRING`, `FLOAT`, `DOUBLE`, `LONG`, or `COMPLEX<typeName>`, e.g. `COMPLEX<hyperUnique>`.


Additionally, columns will have a legacy friendly `type` name. This might match `typeSignature` for some column types (`STRING`, `FLOAT`, `DOUBLE`, or `LONG`) but for COMPLEX columns will only contain the name of the underlying complex type such as `hyperUnique`.

The timestamp column will always have `typeSignature` and `type` as `LONG`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The timestamp column will always have `typeSignature` and `type` as `LONG`.
The `typeSignature` and `type` for the timestamp column is always `LONG`.

Timestamp column will have type `LONG`.
All columns will contain a `typeSignature` which is the Druid internal representation of the type information for this column, is what is show in [`INFORMATION_SCHEMA.COLUMNS`](../querying/sql.md#columns-table) table in SQL, and is typically the value used to supply Druid with JSON type information at query or ingest time. This value will be `STRING`, `FLOAT`, `DOUBLE`, `LONG`, or `COMPLEX<typeName>` (e.g. `COMPLEX<hyperUnique>`).

Additionally, columns will have a legacy friendly `type` name. This might match `typeSignature` for some column types (`STRING`, `FLOAT`, `DOUBLE`, or `LONG`) but for COMPLEX columns will only contain the name of the underlying complex type such as `hyperUnique`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Additionally, columns will have a legacy friendly `type` name. This might match `typeSignature` for some column types (`STRING`, `FLOAT`, `DOUBLE`, or `LONG`) but for COMPLEX columns will only contain the name of the underlying complex type such as `hyperUnique`.
Druid retains the legacy friendly `type` name. For some column types, the value may match the `typeSignature` (`STRING`, `FLOAT`, `DOUBLE`, or `LONG`). For `COMPLEX` columns, the `type` only contains the name of the underlying complex type such as `hyperUnique`.
New applications should use `typeSignature`, not `type`.

@lgtm-com
Copy link

lgtm-com bot commented Nov 9, 2021

This pull request introduces 2 alerts when merging bff926e into a5bd0b8 - view on LGTM.com

new alerts:

  • 2 for Dereferenced variable may be null

@lgtm-com
Copy link

lgtm-com bot commented Nov 9, 2021

This pull request introduces 2 alerts when merging be1c2ba into a5bd0b8 - view on LGTM.com

new alerts:

  • 2 for Dereferenced variable may be null

@@ -47,6 +50,7 @@ public static ColumnAnalysis error(String reason)

@JsonCreator
public ColumnAnalysis(
@JsonProperty("typeSignature") ColumnType typeSignature,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this change cause a problem during upgrade? E.g. an historical is on older version while broker is on newer version so the broker gets old json from the historical when running segment metadata queries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the supported upgrade order is brokers last, but it would just be null, which shouldn't be a problem i think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/druid/pull/11895/files#diff-5aa1494dc5a13e82654e42ce8cdb3cb628c95233f8828312b3ede40dac1b8544R156 - shall this be modified to pick the one if the other is null instead of throwing an error. I am throwing darts in the dark here so please ignore if this is not really a problem :)

Copy link
Member Author

@clintropolis clintropolis Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, if we supported updating the broker before data servers then we might want to consider that, but we don't (so that new query types, filters, aggregators, etc cannot be issued before data servers can understand them)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes. you are right. thanks for clarifying.

Copy link
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit. Otherwise LGTM.

docs/querying/segmentmetadataquery.md Outdated Show resolved Hide resolved
@lgtm-com
Copy link

lgtm-com bot commented Nov 10, 2021

This pull request introduces 2 alerts when merging 9055810 into 6c196a5 - view on LGTM.com

new alerts:

  • 2 for Dereferenced variable may be null

@suneet-s suneet-s merged commit 5baa221 into apache:master Nov 11, 2021
@clintropolis clintropolis deleted the column-analysis-type-stuffs branch November 12, 2021 19:05
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants