Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced MSQ table functions #13360

Merged
merged 14 commits into from
Dec 8, 2022
Merged

Conversation

paul-rogers
Copy link
Contributor

@paul-rogers paul-rogers commented Nov 13, 2022

Refined external table functions

This PR introduces the first step toward catalog integration with MSQ queries: a set of refined, easier-to-use table functions.

Simplified Table Functions

Recall that MSQ introduced the extern function:

SELECT
 <column>
FROM TABLE(
  EXTERN(
    '<Druid input source>',
    '<Druid input format>',
    '<row signature>'
  )
)

As explained in the Druid Catalog Proposal, we wish to make the user experience simpler by avoiding the use of JSON. We wish to use SQL syntax instead. To do that, we must provide a solution for the input source, input format and row signature.

The extern function handles all input sources. The first simplification is to create functions for each supported input source. This PR introduces three:

  • inline for the InlineInputSource (type "input").
  • http for the HttpInputSource (type "http").
  • localFiles for the LocalInputSource (type "local"). (Note that the name is localFiles because local is a SQL reserved word.)

Second, we need a way to provide the various parameters for each input source. We do that with SQL named arguments. Example:

FROM TABLE(
  http(
    userName => 'bob',
    password => 'secret',
    uris => 'http:foo.com/bar.csv',
    format => 'csv'
    )
  )

The various input formats are defined as optional arguments. The format argument choose the format.

Finally, we need a way to specify the schema. As it turns out, Calcite introduced the EXTEND clause for Apache Phoenix (though it is not clear if Phoenix ever actually used that clause.) we repurpose that clause for our needs:

FROM TABLE(
  http(
    userName => 'bob',
    password => 'secret',
    uris => 'http:foo.com/bar.csv',
    format => 'csv'
    )
  ) EXTEND (x VARCHAR, y VARCHAR, z BIGINT)

The EXTEND list is simply a list of column definitions in standard SQL form.

Implementation Notes

The set of arguments available for each function is defined by the catalog "meta-metadata" introduced in the recent Catalog Basics PR. A set of "operator conversion" classes create SQL parameter definitions based on the property definitions of each external table type defined by the catalog. Thinking ahead, this same metadata will allow us to parameterize catalog tables in the next PR.

The EXTEND notation required some amount of Calcite trickery. The Calcite code expects syntax of the form:

<TABLENAME> EXTENDS <extend-list>

However, we want something of the form:

TABLE ( <name> ( <args ) ) EXTENDS <extend-list>

We must change the parser to achieve the above. This is done via a Python script, edit-parser.py, which modifies the parser during the code generation phase. The parser-editing approach was taken because it is less intrusive than an approach based on copying and replacing the relevant bits of the Calcite grammar.

As set of custom extensions to Calcite classes effectively "hides" our usage from Calcite by passing the schema behind Calcite's back. Specifically, we rewrite the EXTEND node to copy the schema into a temporary operator node, then discard the EXTEND node so that it does not confuse Calcite.

The plumbing required for the above made it possible to cache the external table produced by the table function to avoid the normal Calcite behavior which recreates the external table multiple times during planning.

Release Notes

Release notes should mention the availability of the the new functions. This is not the entire set of input sources: the most important one (S3) is not yet converted. Consult the documentation included in this PR for the details.

It turns out that serialization of the LocalInputSource is broken: no use of this class has likely worked in practice. See Issue #13359 for details.

Other Changes

The location of the generated parser in SQL changed from target/generated-sources/annotations to target/generated-sources. The Maven step in which we generate the parser changed from initialize to generate-sources.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

HTTP, LOCALFILES and INLINE table functions powered by
catalog metadata.

Improvements to the Calcite test framework.
@cryptoe cryptoe added Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - SQL labels Nov 13, 2022
{
public static final String IS_SQL_FN_PARAM_KEY = "sqlFnArg";
public static final String IS_SQL_FN_OPTIONAL = "optional";
public static final String IS_PARAMETER = "param";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this is used to determine if a property is an external table parameter. Do other types of table not also have parameters? Wondering if we need to distinguish?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather generic so it can be used in multiple places. As it turns out, the next revision will modify this a bit.

ImmutableMap.of(IS_SQL_FN_PARAM_KEY, true, IS_SQL_FN_OPTIONAL, true);
public static final Map<String, Object> TABLE_PARAM =
ImmutableMap.of(IS_PARAMETER, true);
public static final Map<String, Object> SQL_AND_TABLE_PARAM =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what is the difference between SQL_FN_PARAM and IS_PARAMETER?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a bit messy in this version. The idea is that a property can be stored only in the catalog, can be only a SQL parameter, or can be either stored in the catalog or a SQL parameter. It became clear that this approach was becoming a bit too complex. So, the next revision uses a simpler system.

@paul-rogers paul-rogers mentioned this pull request Nov 23, 2022
4 tasks
@paul-rogers
Copy link
Contributor Author

Moved the test framework changes to a separate PR: #13426. The changes will disappear out of this PR once #13426 is merged into master and this PR is rebased on top. Please review the planner framework changes in #13426 instead of here.

@paul-rogers
Copy link
Contributor Author

To allow forward progress, I went ahead and stripped the Calcite test revisions out of the PR. They are in #13426. For this PR, there is some hacky temporary code to get things to work. That code will be replaced with the original code once the other PR is merged. The remaining code is almost all directly catalog related.

@paul-rogers paul-rogers merged commit 013a12e into apache:master Dec 8, 2022
@paul-rogers paul-rogers deleted the 221112-table-fns branch December 8, 2022 21:56
@gianm
Copy link
Contributor

gianm commented Dec 9, 2022

A great improvement in usability! Specifying JSON-inside-SQL does get old. Thanks for improving this.

After noticing this land in master, I realized I have a few comments arising from reading the new docs, so I'll write them here.

  1. The new extension-column syntax is cool, and more SQLy. Would it be possible to use this with EXTERN too? Like, make the third argument (row signature) optional, and allow people to provide the column list in SQL?
  2. Comma-separated files and URIs are a drag, I think. Local files and URIs can both legitimately have commas in them. Given that SQL supports arrays, could these be SQL arrays instead?
  3. The bits about "this selects CsvInputFormat" and "this selects JsonInputFormat" should be reworded, because we shouldn't use Java class names in user-facing documentation. Instead we should use the type codes these classes have in the API, which would be "csv input format" and "json input format" respectively. These could link to the docs for those formats, i.e. ../ingestion/data-formats.html#csv and ../ingestion-data-formats.html#json.
  4. Some table-function-format parameter names are different from the corresponding input-format parameters in ways that don't seem necessary. Like, skipRows vs skipHeaderRows. I think it'd be best if these were the same, so people can have an easier time translating between table-function-formats and input-formats.
  5. For JSON, keepNullColumns isn't documented for the json input format, but keepNulls is documented for the table function. I'm not sure why it isn't documented for the input format. But I do note that the description here in the table function docs is confusing: what does it mean to "keep" or "not keep" null values? Given the confusingness of the behavior, and the fact that it isn't documented for input formats, I suggest we should remove it from the table-function-format too.
  6. The "due to JSON serialization of LocalInputSource changes meaning of files #13359 this functionality is broken" piece bugged me. It's not broken for the intended use case of a local input source, where you have a single-server setup, or a shared volume across all servers in a multi-server setup. In those cases, nothing bad happens by resolving relative paths upon first serialization. I mean, a user won't notice a difference, because all servers would resolve the path the same way anyway. So I was going to suggest removing this wording, or at least adjusting it such that it it's clear that the functionality works fine in the scenario it was designed for. However, I decided to simply fix it instead: LocalInputSource: Serialize File paths without forcing resolution. #13534. That PR also removes the doc wording, since it's no longer needed.

Btw, I also added the "Release Notes" label, since there is a "Release Notes" section in the PR description. The label helps the release manager find all the PRs that are associated with interesting release notes.

@paul-rogers
Copy link
Contributor Author

@gian, thanks for the comments! A few answers:

  1. Yes, it is possible to use the syntax with extern. I've not wired that up, but doing so should be straightforward.
  2. It is on the to-do list to use an array type for list arguments. Might be in the next PR.
  3. So, on the docs, my intent was to refer to the existing docs rather than repeating stuff. I'll change it to use whatever names we use for those things elsewhere.
  4. The argument names are deliberately shorter than the JSON field names. Folks will be typing this, the name space is small, so no need for the verbose names we use in JSON. The JSON names were not selected with typing usability in mind. httpAuthenticationUsername Really?
  5. Thanks for the note on JSON. Is the keepNullColumns meant to be supported? If so, we can clean up the JSON input format docs, and I'll reference that here. If it is not supported, I'll remove the function arg. (There are some new JSON format args that I've not yet added for nested types.)
  6. I guess I disagree with the analysis that the functionality is not broken. If I set my baseDir to "/tmp", and my files to "x.csv", I expect Druid to read "/tmp/x.csv" not "/Users/paul/druid/x.csv". Not all local input files are relative to Druid's product directly. That said, thanks for fixing the issue. I'll rerun the tests with your fix and correct the wording.

Thanks for adding the release notes label.

@gianm
Copy link
Contributor

gianm commented Dec 9, 2022

@gian, thanks for the comments! A few answers:

1. Yes, it is possible to use the syntax with `extern`. I've not wired that up, but doing so should be straightforward.

If you do choose to wire it up, that'd be nice — about 33% nicer for EXTERN IMO 🙂

2. It is on the to-do list to use an array type for list arguments. Might be in the next PR.

Sounds good! Since it'll affect the user-facing syntax, if this is part of the plan, it'll be best to get that in before the next release. Otherwise we'll have compatibility to deal with.

3. So, on the docs, my intent was to refer to the existing docs rather than repeating stuff. I'll change it to use whatever names we use for those things elsewhere.

Sounds good.

4. The argument names are deliberately shorter than the JSON field names. Folks will be typing this, the name space is small, so no need for the verbose names we use in JSON. The JSON names were not selected with typing usability in mind. `httpAuthenticationUsername` Really?

Ah. Nevertheless I feel we are better served by consistency here than by brevity. People get tripped up when the same thing has two different names in two different contexts. An option: we could have the JSON one accept the shorter name too: the JSON object could accept both the long and short name, and write out the old name on serialization for compatibility reasons. Then, we could document the short name for both. Thoughts, preferences?

5. Thanks for the note on JSON. Is the `keepNullColumns` meant to be supported? If so, we can clean up the JSON input format docs, and I'll reference that here. If it is _not_ supported, I'll remove the function arg. (There are some new JSON format args that I've not yet added for nested types.)

I'm not sure. I don't know why it's not documented. IMO, unless there's good reason to document it now, it's better to leave it out of the table-format function, because often things are undocumented because they aren't meant to be supported parameters for end users.

6. I guess I disagree with the analysis that the functionality is not broken. If I set my `baseDir` to "/tmp", and my files to "x.csv", I expect Druid to read "/tmp/x.csv" not "/Users/paul/druid/x.csv". Not all local input files are relative to Druid's product directly. That said, thanks for fixing the issue. I'll rerun the tests with your fix and correct the wording.

Ah. I see what you mean. I don't think it's actually supposed to work that way, is the thing! Looking at the implementation of LocalInputSource, it seems like the intent is baseDir + filter are one thing, and files are another thing, and they aren't connected. That is, the intent is that relative-path files are interpreted relative to the Druid working directory, not the provided baseDir. Would definitely be good to clarify this in the documentation, since I don't think it's clear.

@paul-rogers
Copy link
Contributor Author

paul-rogers commented Dec 9, 2022

@gianm, to fix 4, how far do we want to go? I'm using they argument name "format" for the format. Should we change that to "inputFormat" to be consistent with existing code? Maybe not, since this particular field is not visible, except as an argument name to extern, where it has a different meaning?

I'll change all the other parameters to the JSON forms. We can see if we get user feedback about name lengths, and can add new names later if so. Else, we just add to the confusion because we have to document back in the native ingest pages that fields that used to be called "veryLongOldName" are now just called "name". Consistency seems more important than brevity.

For the HTTP input source, the JSON uses httpAuthenticationPassword for both the plain text and env var. In SQL, these have to be different. So, to keep the existing names, we use httpAuthenticationPassword for the plain text password and httpAuthenticationPasswordEnvVar for the "type": "environment" version. I hope users appreciate these long names...

Also, note that the table function for the local input source is called localFiles so it can be used without quotes. I can change it to local to be consistent with the type tag at the cost of the user having to quote the name:

-- Current, unquotes
SELECT ... FROM TABLE(localFiles(...))
-- Consistent, but requires quotes
SELECT ... FROM TABLE("local"(...))

@gianm
Copy link
Contributor

gianm commented Dec 10, 2022

@gianm, to fix 4, how far do we want to go?

That's a good question. My goal here with is avoiding a situation where someone used to the JSON API switches to SQL, tries to use a familiar parameter, finds it isn't accepted, then eventually figures out it's just spelled differently and wonders why it was necessary to change it. (Or, worse, doesn't figure that out, and assumes the SQL function doesn't support that parameter.) I believe this risk is at its highest when dealing with parameters to SQL functions that have an "equivalent" entity in the JSON API.

So, for the examples you brought up:

  • I don't think it's necessary to change format to inputFormat in EXTERN. The docs all talk about using EXTERN with positional arguments anyway, so it doesn't much matter.
  • I don't think LOCALFILES needs to be named LOCAL. In fact I prefer LOCALFILES for the reason you mentioned. The potential for confusion isn't as strong there: I don't think there's going to be an expectation that the JSON type codes for input formats will literally become SQL function names. Maybe someone out there would assume that, but I wouldn't.
  • For httpAuthenticationPassword, as you point out, it isn't possible to keep the existing name for equivalent functionality, due to inherent differences in SQL and JSON. (JSON is more flexible in accepting various different types for the same parameter.) So in this case I think it's fine to change it to anything that seems to make sense. My wish is mainly that we keep the naming consistent unless there is some technical reason to change it, which is the case here.

I hope users appreciate these long names...

They will if they are in a position where they're using both the JSON and SQL APIs: perhaps they're migrating from one to the other, or perhaps they are in a situation where some of their jobs are run with JSON and some with SQL. They won't appreciate it if they're using the SQL API only. That is a tradeoff we have to make.

A thought: is it possible for the SQL API to accept two spellings of the same parameter? (Or, to accept two parameters where it would be an error to specify both unless they matched.) In that case, we could solve for both desires by having SQL accept a short and long form. The short form would be preferred by users that exclusively use SQL; the long form would be there for people that have a foot in both SQL and JSON worlds.

@paul-rogers
Copy link
Contributor Author

@gianm, Got the array parameter thing working. Required an horrible amount of Calcite hacking, but at least it works. This would be a good contribution back to Calcite, once things stabilize. See example below.

We can accept parameter aliases, which is how I thought to address the long name issue, once things stabilize. It is a nuisance to introduce that extra bit of complexity while things are still in flux.

For HTTP, I changed the parameter names back to userName, password and passwordEnvVar. Reason: we said that httpAuthorizationPassword doesn't work for the two JSON cases, so we need new names. You suggested using shorter names when the names don't directly map to JSON, as is the case here. That gives password and passwordEnvVar. I could not bring myself to leave httpAuthorizationUsername(sic, with the lack of capitalization on "n"), so changed that touserName` as well. If folks get confused, we'll add the long name as an alias in the future.

For the local input source, I added the desired functionality to the catalog/table function layer. This allows a user to specify a base directory and format in the catalog, and a set of files in a table function which are relative to the base directory. The catalog code rewrites things into the awkward form required by the local input source itself. Example, using just a table function:

  -- Gives /tmp/a.csv and /tmp/b.csv
  localfiles(
             baseDir => '/tmp',
             files => ARRAY['a.csv', 'b.csv'],
             format => 'csv')

With yet more Calcite hacking, array valued arguments can be given as parameters:

  localfiles(files => ?, format => 'csv')

The actual list of files can be given as a List or array via a query parameter. This should simplify the user experience when an MSQ query is driven by a script: just set a query parameter rather than mucking about with the JSON inside of SQL in the body of the query.

@paul-rogers
Copy link
Contributor Author

In the next PR, extern() will work as @gianm suggested:

INSERT INTO ...
SELECT ...
FROM TABLE(extern(
  inputSource => '<input source>',
  inputFormat => '<input format>'))
  (a VARCHAR, b BIGINT, c DOUBLE, ...)

@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - SQL Release Notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants