Add IS_MULTIVALUE and COMPLEX_TYPE columns to INFORMATION_SCHEMA.COLUMNS #9707

JulianJaffePinterest · 2020-04-15T22:54:45Z

Description

Currently, the INFORMATION_SCHEMA.COLUMNS table in Druid does not have full information about the columns in a data source, specifically whether or not a column can contain multi-valued entries and detailed information about which complex serde a metric uses. This means that users attempting to interact with Druid metadata must either fall back to (potentially expensive) SegmentMetadataQueries to determine this information or implement complicated handling logic.

Motivation

It is currently difficult to obtain complete metadata about a Druid datasource. The only way to determine whether or not a column can contain multiple values or what complex serde was used to encode a metric is to use SegmentMetadataQueries. These queries can be expensive (if aggregating over a large number of segments) and can miss information if not all segments are queried (e.g. columns that don't appear in a given segment won't be returned by a SegmentMetadataQuery, and columns that don't have multiple values in a queried segment will return ..."hasMultipleValues" : false,...). This can be worked around to some degree via merging results, but this reintroduces the problem that SegmentMetadataQueries can be expensive when run over many segments. If the INFORMATION_SCHEMA.COLUMNS table were to be extended to include information about whether a column could contain multiple values and what serde was used for a complex metric, tools that interact with Druid metadata such as the Calcite Druid adapter, proposed Spark and Hive readers, and other third party integrations could issue simple SQL-based queries to determine data source metadata instead of needing to rely on SegmentMetadataQueries. This would also align with the Druid recommendation to use the INFORMATION_SCHEMA tables for metadata if you're using SQL.

The text was updated successfully, but these errors were encountered:

gianm · 2020-04-22T16:29:32Z

+1 on the concept. This would be a great addition. The DruidSchema is actually in a place already where it could know which columns are multivalue — it gets it from SegmentMetadata queries and caches the results to avoid performance issues. It just doesn't plumb the information into the INFORMATION_SCHEMA.

I don't recall if SegmentMetadata queries return info about what complex types are in play, but if they don't, they could be extended to do so.

@JulianJaffePinterest any interest in contributing something that does this?

JulianJaffePinterest · 2020-04-28T20:27:08Z

Segment Metadata queries do return the complex type name (e.g. thetaSketch or variance). I'll try to grab some time to work on this soon.

JulianJaffePinterest · 2020-04-29T18:03:05Z

After doing some digging, it seems like at least the COMPLEX_TYPE portion of this should wait until @clintropolis completes his ongoing type system refactoring (#9638) to avoid stepping on each other's toes. Clint's also mentioned extending INFORMATION_SCHEMA.COLUMNS in a few other issues, but they don't seem to have open pull requests, so there's less risk of a conflict so I'll take a stab.

gianm · 2020-04-30T03:58:04Z

AFAIK @clintropolis hasn't actively been working on the SQL part of the type work. So plumbing through the stuff that already exists (like complex type name) should be open ground. @clintropolis please let us know if that ends up not being accurate.

clintropolis · 2020-04-30T04:16:43Z

Ah yeah, @gianm is right, i haven't worked my way up to the SQL stuff yet. Go ahead if you've got a change in mind, I can work around whatever you get wired up here. I don't believe i'll be removing getTypeName or anything dramatic from the aggregators, at most just changing it to getComplexTypeName or something like that to make it clearer that after the change in my refactor that it's only real purpose is to map to a complex serde for writing/reading segments.

Even after that change, it still seems useful for the information schema, I'd be happy to review your PR once it exists 👍

JulianJaffePinterest · 2021-10-27T09:54:30Z

I believe #11713 addresses the complex type half of this request. Leaving this open to track the IS_MULTIVALUE half.

JulianJaffePinterest added the Feature/Change Description label Apr 15, 2020

gianm added the Area - SQL label Apr 22, 2020

JulianJaffePinterest mentioned this issue Apr 28, 2020

Support directly reading and writing Druid data from Spark #9780

Open

JulianJaffePinterest mentioned this issue Apr 16, 2021

Spark Direct Readers and Writers for Druid. #10920

Closed

8 tasks

JulianJaffePinterest mentioned this issue Jul 22, 2021

Set up, confs, mixins, and clients. #11474

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IS_MULTIVALUE and COMPLEX_TYPE columns to INFORMATION_SCHEMA.COLUMNS #9707

Add IS_MULTIVALUE and COMPLEX_TYPE columns to INFORMATION_SCHEMA.COLUMNS #9707

JulianJaffePinterest commented Apr 15, 2020

gianm commented Apr 22, 2020

JulianJaffePinterest commented Apr 28, 2020

JulianJaffePinterest commented Apr 29, 2020

gianm commented Apr 30, 2020

clintropolis commented Apr 30, 2020

JulianJaffePinterest commented Oct 27, 2021

Add IS_MULTIVALUE and COMPLEX_TYPE columns to INFORMATION_SCHEMA.COLUMNS #9707

Add IS_MULTIVALUE and COMPLEX_TYPE columns to INFORMATION_SCHEMA.COLUMNS #9707

Comments

JulianJaffePinterest commented Apr 15, 2020

Description

Motivation

gianm commented Apr 22, 2020

JulianJaffePinterest commented Apr 28, 2020

JulianJaffePinterest commented Apr 29, 2020

gianm commented Apr 30, 2020

clintropolis commented Apr 30, 2020

JulianJaffePinterest commented Oct 27, 2021