-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow LATEST_BY and EARLIEST_BY to accept timestamp arrays #14800
Comments
FWIW, I think a crude way of breaking the tie-breakers currently in Druid would be to use GroupBy and nested query for deduplication b/w the primary timestamp. This might look something like the following for your use case: WITH
mainTable AS (
SELECT TIME_FLOOR(__time, 'PT1H') as flooredTime,
"dim",
__time,
time2,
value
)
SELECT floored_time, "dim", EARLIEST_BY("innerColumn", __time)
FROM (
SELECT floored_time, "dim", EARLIEST_BY("value", TIME_PARSE("time2")) AS innerColumn, __time
FROM mainTable
GROUP BY floored_time, "dim", __time
)
GROUP BY floored_time, "dim" |
In case someone does pick up the task, I looked into the code, and there seem to be 2 places where the modifications would lie:
|
@LakshSingla Thanks for the suggestion about doing the nested query for deduplication b/w the primary timestamp. Our current work around is doing some bitshifting magic to cram both timestamps into 1 (which we then discovered the bug here). The 64bit long is just a handful of bits too small to keep full precision. We set the lower 3 bits of Obviously just having a function to provided both columns is much cleaner 😄 |
I wonder if you could try other approaches because of your comment in the issue description:
if you want to check the select earliest_by(value, max(time,time2))
select latest_by(value, min(time,time2)) but since I don't see non-aggregate min/max; you will need to pack this logic into a select
EARLIEST_BY("value", case when "time" > "time2" then MILLIS_TO_TIMESTAMP("time") else MILLIS_TO_TIMESTAMP(time2) end),
LATEST_BY("value", case when "time" < "time2" then MILLIS_TO_TIMESTAMP("time") else MILLIS_TO_TIMESTAMP(time2) end)
from "d1" |
@kgyrtkirk In the simple case of these two rows:
The desired Now if I modify the data to this, you can see why
Using |
This issue has been marked as stale due to 280 days of inactivity. |
This issue has been closed due to lack of activity. If you think that |
Description
Currently,
LATEST_BY
andEARLIEST_BY
accept only 1 timestamp column. The proposed would allow the function to accept multiple timestamp columns, to be used as tie-breakers.for example:
LATEST_BY("myValueColumn", ARRAY[__time, "mySecondTimeColumn"])
. The second (ornth
) time column in the array would be used if the proceeding time columns are equal.Motivation
Our motivation is to use a second time column to determine which rows are "latest" or "earliest". If you have 2 rows with the exact same
__time
value, both rows are "latest" or "earliest" (depending on the data and query being used). It is no longer deterministic which row's value is returned. In our application, it "flip flops" between the different values, as which value "wins" is determined by druid's internal processing.Adding a second (or more than 1) time column to use would resolve ties on the primary time column.
Our application accepts events from customers where the customers may provide the timestamp of the event. They are also allowed to provide secondary time columns. We want to be able to find the "first" and "last" value for a given timestamp. A simple example being a timestamp for when the data was received by our application, so that we can emulate an "update" to a timestamp without losing data by rolling up the source data.
Example
csv data:
Query and expected result:
The text was updated successfully, but these errors were encountered: