Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay reporting MSQ ingest success until segments are loaded #13770

Closed
paul-rogers opened this issue Feb 8, 2023 · 1 comment
Closed

Delay reporting MSQ ingest success until segments are loaded #13770

paul-rogers opened this issue Feb 8, 2023 · 1 comment
Labels
Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262

Comments

@paul-rogers
Copy link
Contributor

paul-rogers commented Feb 8, 2023

Affected Version

26.0.0-SNAPSHOT

Description

I am creating a Jupyter notebook to illustrate how to us the new Druid catalog. As part of that task, I submit an MSQ ingestion task, wait for the Overlord to report task completion, then query the table. Each ingestion uses REPLACE and usually creates a new datasource.

When running queries, I occasionally (about 20% of the time) get an error saying that there is no such table. Yet, if I wait a few seconds, and try again, the query succeeds. The reason is clear: MSQ reported success as soon as ingestion is complete. It takes a while for the new segments to be loaded onto my one historical node. During that time, the Broker knows nothing about the new table.

To be very specific:

  • No segments for the target table exist.
  • Call /sql/task to submit an MSQ REPLACE query.
  • Poll Overlord waiting for the task to be marked as completed.
  • Immediately issue a /sql query against that same table.

This creates a race condition. Druid reports that the ingest is done, but it is not really done. The client has to be smart enough to know that the resulting query error is due to a race condition, not to one of possibly many other problems. This puts the burden on the client. Or, in my case, I have to add extra verbiage that says "if this query fails, wait a while and try again", which doesn't scream "easy to use."

The MSQ ITs (and now the Jupyter notebook) use a two-part wait loop: first wait for segment load, then wait for a simple SQL query to succeed. This approach works, but means that each client (the Druid console, the Jupyter notebook, custom clients) must all discover the issue, discover the workaround, and code up the workaround every place that an MSQ query is run followed by a SELECT query. Again, this is not "easy to use."

The ask is for MSQ to wait for segments to be loaded before declaring completion. That way, a client that waits for MSQ task completion can be assured that, when the task is complete, the table is actually ready to be queried. If we don't feel that such a check is generally useful, then provide an option do do the wait when requested (say with a query context parameter.)

@paul-rogers paul-rogers added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Feb 8, 2023
@LakshSingla
Copy link
Contributor

I think #14322 completes all the requirements mentioned in the PR. cc: @adarshsanjeev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

No branches or pull requests

2 participants