Delay reporting MSQ ingest success until segments are loaded #13770

paul-rogers · 2023-02-08T00:05:55Z

Affected Version

26.0.0-SNAPSHOT

Description

I am creating a Jupyter notebook to illustrate how to us the new Druid catalog. As part of that task, I submit an MSQ ingestion task, wait for the Overlord to report task completion, then query the table. Each ingestion uses REPLACE and usually creates a new datasource.

When running queries, I occasionally (about 20% of the time) get an error saying that there is no such table. Yet, if I wait a few seconds, and try again, the query succeeds. The reason is clear: MSQ reported success as soon as ingestion is complete. It takes a while for the new segments to be loaded onto my one historical node. During that time, the Broker knows nothing about the new table.

To be very specific:

No segments for the target table exist.
Call /sql/task to submit an MSQ REPLACE query.
Poll Overlord waiting for the task to be marked as completed.
Immediately issue a /sql query against that same table.

This creates a race condition. Druid reports that the ingest is done, but it is not really done. The client has to be smart enough to know that the resulting query error is due to a race condition, not to one of possibly many other problems. This puts the burden on the client. Or, in my case, I have to add extra verbiage that says "if this query fails, wait a while and try again", which doesn't scream "easy to use."

The MSQ ITs (and now the Jupyter notebook) use a two-part wait loop: first wait for segment load, then wait for a simple SQL query to succeed. This approach works, but means that each client (the Druid console, the Jupyter notebook, custom clients) must all discover the issue, discover the workaround, and code up the workaround every place that an MSQ query is run followed by a SELECT query. Again, this is not "easy to use."

The ask is for MSQ to wait for segments to be loaded before declaring completion. That way, a client that waits for MSQ task completion can be assured that, when the task is complete, the table is actually ready to be queried. If we don't feel that such a check is generally useful, then provide an option do do the wait when requested (say with a query context parameter.)

The text was updated successfully, but these errors were encountered:

LakshSingla · 2023-09-12T06:16:30Z

I think #14322 completes all the requirements mentioned in the PR. cc: @adarshsanjeev

paul-rogers added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Feb 8, 2023

LakshSingla closed this as completed Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay reporting MSQ ingest success until segments are loaded #13770

Delay reporting MSQ ingest success until segments are loaded #13770

paul-rogers commented Feb 8, 2023 •

edited

Loading

LakshSingla commented Sep 12, 2023

Delay reporting MSQ ingest success until segments are loaded #13770

Delay reporting MSQ ingest success until segments are loaded #13770

Comments

paul-rogers commented Feb 8, 2023 • edited Loading

Affected Version

Description

LakshSingla commented Sep 12, 2023

paul-rogers commented Feb 8, 2023 •

edited

Loading