-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced MSQ table functions #13360
Enhanced MSQ table functions #13360
Conversation
HTTP, LOCALFILES and INLINE table functions powered by catalog metadata. Improvements to the Calcite test framework.
server/src/main/java/org/apache/druid/catalog/model/ModelProperties.java
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/catalog/model/PropertyAttributes.java
Outdated
Show resolved
Hide resolved
{ | ||
public static final String IS_SQL_FN_PARAM_KEY = "sqlFnArg"; | ||
public static final String IS_SQL_FN_OPTIONAL = "optional"; | ||
public static final String IS_PARAMETER = "param"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this is used to determine if a property is an external table parameter. Do other types of table not also have parameters? Wondering if we need to distinguish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather generic so it can be used in multiple places. As it turns out, the next revision will modify this a bit.
server/src/main/java/org/apache/druid/catalog/model/table/ExternalTableSpec.java
Outdated
Show resolved
Hide resolved
ImmutableMap.of(IS_SQL_FN_PARAM_KEY, true, IS_SQL_FN_OPTIONAL, true); | ||
public static final Map<String, Object> TABLE_PARAM = | ||
ImmutableMap.of(IS_PARAMETER, true); | ||
public static final Map<String, Object> SQL_AND_TABLE_PARAM = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what is the difference between SQL_FN_PARAM and IS_PARAMETER?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a bit messy in this version. The idea is that a property can be stored only in the catalog, can be only a SQL parameter, or can be either stored in the catalog or a SQL parameter. It became clear that this approach was becoming a bit too complex. So, the next revision uses a simpler system.
server/src/main/java/org/apache/druid/catalog/model/table/HttpTableDefn.java
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/catalog/model/table/InputFormats.java
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/catalog/model/table/InputFormats.java
Show resolved
Hide resolved
sql/src/main/java/org/apache/druid/sql/calcite/external/FunctionParameterImpl.java
Show resolved
Hide resolved
sql/src/main/java/org/apache/druid/sql/calcite/external/UserDefinedTableMacroFunction.java
Outdated
Show resolved
Hide resolved
sql/src/main/java/org/apache/druid/sql/calcite/external/ParameterizeOperator.java
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/catalog/model/PropertyAttributes.java
Show resolved
Hide resolved
To allow forward progress, I went ahead and stripped the Calcite test revisions out of the PR. They are in #13426. For this PR, there is some hacky temporary code to get things to work. That code will be replaced with the original code once the other PR is merged. The remaining code is almost all directly catalog related. |
A great improvement in usability! Specifying JSON-inside-SQL does get old. Thanks for improving this. After noticing this land in master, I realized I have a few comments arising from reading the new docs, so I'll write them here.
Btw, I also added the "Release Notes" label, since there is a "Release Notes" section in the PR description. The label helps the release manager find all the PRs that are associated with interesting release notes. |
@gian, thanks for the comments! A few answers:
Thanks for adding the release notes label. |
If you do choose to wire it up, that'd be nice — about 33% nicer for
Sounds good! Since it'll affect the user-facing syntax, if this is part of the plan, it'll be best to get that in before the next release. Otherwise we'll have compatibility to deal with.
Sounds good.
Ah. Nevertheless I feel we are better served by consistency here than by brevity. People get tripped up when the same thing has two different names in two different contexts. An option: we could have the JSON one accept the shorter name too: the JSON object could accept both the long and short name, and write out the old name on serialization for compatibility reasons. Then, we could document the short name for both. Thoughts, preferences?
I'm not sure. I don't know why it's not documented. IMO, unless there's good reason to document it now, it's better to leave it out of the table-format function, because often things are undocumented because they aren't meant to be supported parameters for end users.
Ah. I see what you mean. I don't think it's actually supposed to work that way, is the thing! Looking at the implementation of LocalInputSource, it seems like the intent is |
@gianm, to fix 4, how far do we want to go? I'm using they argument name "format" for the format. Should we change that to "inputFormat" to be consistent with existing code? Maybe not, since this particular field is not visible, except as an argument name to I'll change all the other parameters to the JSON forms. We can see if we get user feedback about name lengths, and can add new names later if so. Else, we just add to the confusion because we have to document back in the native ingest pages that fields that used to be called "veryLongOldName" are now just called "name". Consistency seems more important than brevity. For the HTTP input source, the JSON uses Also, note that the table function for the -- Current, unquotes
SELECT ... FROM TABLE(localFiles(...))
-- Consistent, but requires quotes
SELECT ... FROM TABLE("local"(...)) |
That's a good question. My goal here with is avoiding a situation where someone used to the JSON API switches to SQL, tries to use a familiar parameter, finds it isn't accepted, then eventually figures out it's just spelled differently and wonders why it was necessary to change it. (Or, worse, doesn't figure that out, and assumes the SQL function doesn't support that parameter.) I believe this risk is at its highest when dealing with parameters to SQL functions that have an "equivalent" entity in the JSON API. So, for the examples you brought up:
They will if they are in a position where they're using both the JSON and SQL APIs: perhaps they're migrating from one to the other, or perhaps they are in a situation where some of their jobs are run with JSON and some with SQL. They won't appreciate it if they're using the SQL API only. That is a tradeoff we have to make. A thought: is it possible for the SQL API to accept two spellings of the same parameter? (Or, to accept two parameters where it would be an error to specify both unless they matched.) In that case, we could solve for both desires by having SQL accept a short and long form. The short form would be preferred by users that exclusively use SQL; the long form would be there for people that have a foot in both SQL and JSON worlds. |
@gianm, Got the array parameter thing working. Required an horrible amount of Calcite hacking, but at least it works. This would be a good contribution back to Calcite, once things stabilize. See example below. We can accept parameter aliases, which is how I thought to address the long name issue, once things stabilize. It is a nuisance to introduce that extra bit of complexity while things are still in flux. For HTTP, I changed the parameter names back to For the local input source, I added the desired functionality to the catalog/table function layer. This allows a user to specify a base directory and format in the catalog, and a set of files in a table function which are relative to the base directory. The catalog code rewrites things into the awkward form required by the local input source itself. Example, using just a table function: -- Gives /tmp/a.csv and /tmp/b.csv
localfiles(
baseDir => '/tmp',
files => ARRAY['a.csv', 'b.csv'],
format => 'csv') With yet more Calcite hacking, array valued arguments can be given as parameters: localfiles(files => ?, format => 'csv') The actual list of files can be given as a |
In the next PR, INSERT INTO ...
SELECT ...
FROM TABLE(extern(
inputSource => '<input source>',
inputFormat => '<input format>'))
(a VARCHAR, b BIGINT, c DOUBLE, ...) |
Refined external table functions
This PR introduces the first step toward catalog integration with MSQ queries: a set of refined, easier-to-use table functions.
Simplified Table Functions
Recall that MSQ introduced the
extern
function:As explained in the Druid Catalog Proposal, we wish to make the user experience simpler by avoiding the use of JSON. We wish to use SQL syntax instead. To do that, we must provide a solution for the input source, input format and row signature.
The
extern
function handles all input sources. The first simplification is to create functions for each supported input source. This PR introduces three:inline
for theInlineInputSource
(type"input"
).http
for theHttpInputSource
(type"http"
).localFiles
for theLocalInputSource
(type"local"
). (Note that the name islocalFiles
becauselocal
is a SQL reserved word.)Second, we need a way to provide the various parameters for each input source. We do that with SQL named arguments. Example:
The various input formats are defined as optional arguments. The
format
argument choose the format.Finally, we need a way to specify the schema. As it turns out, Calcite introduced the
EXTEND
clause for Apache Phoenix (though it is not clear if Phoenix ever actually used that clause.) we repurpose that clause for our needs:The
EXTEND
list is simply a list of column definitions in standard SQL form.Implementation Notes
The set of arguments available for each function is defined by the catalog "meta-metadata" introduced in the recent Catalog Basics PR. A set of "operator conversion" classes create SQL parameter definitions based on the property definitions of each external table type defined by the catalog. Thinking ahead, this same metadata will allow us to parameterize catalog tables in the next PR.
The
EXTEND
notation required some amount of Calcite trickery. The Calcite code expects syntax of the form:However, we want something of the form:
We must change the parser to achieve the above. This is done via a Python script,
edit-parser.py
, which modifies the parser during the code generation phase. The parser-editing approach was taken because it is less intrusive than an approach based on copying and replacing the relevant bits of the Calcite grammar.As set of custom extensions to Calcite classes effectively "hides" our usage from Calcite by passing the schema behind Calcite's back. Specifically, we rewrite the
EXTEND
node to copy the schema into a temporary operator node, then discard theEXTEND
node so that it does not confuse Calcite.The plumbing required for the above made it possible to cache the external table produced by the table function to avoid the normal Calcite behavior which recreates the external table multiple times during planning.
Release Notes
Release notes should mention the availability of the the new functions. This is not the entire set of input sources: the most important one (S3) is not yet converted. Consult the documentation included in this PR for the details.
It turns out that serialization of the
LocalInputSource
is broken: no use of this class has likely worked in practice. See Issue #13359 for details.Other Changes
The location of the generated parser in SQL changed from
target/generated-sources/annotations
totarget/generated-sources
. The Maven step in which we generate the parser changed frominitialize
togenerate-sources
.This PR has: