Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow LATEST_BY and EARLIEST_BY to accept timestamp arrays #14800

Closed
aedelbro opened this issue Aug 11, 2023 · 7 comments
Closed

Allow LATEST_BY and EARLIEST_BY to accept timestamp arrays #14800

aedelbro opened this issue Aug 11, 2023 · 7 comments

Comments

@aedelbro
Copy link

Description

Currently, LATEST_BY and EARLIEST_BY accept only 1 timestamp column. The proposed would allow the function to accept multiple timestamp columns, to be used as tie-breakers.
for example: LATEST_BY("myValueColumn", ARRAY[__time, "mySecondTimeColumn"]). The second (or nth) time column in the array would be used if the proceeding time columns are equal.

Motivation

Our motivation is to use a second time column to determine which rows are "latest" or "earliest". If you have 2 rows with the exact same __time value, both rows are "latest" or "earliest" (depending on the data and query being used). It is no longer deterministic which row's value is returned. In our application, it "flip flops" between the different values, as which value "wins" is determined by druid's internal processing.
Adding a second (or more than 1) time column to use would resolve ties on the primary time column.

Our application accepts events from customers where the customers may provide the timestamp of the event. They are also allowed to provide secondary time columns. We want to be able to find the "first" and "last" value for a given timestamp. A simple example being a timestamp for when the data was received by our application, so that we can emulate an "update" to a timestamp without losing data by rolling up the source data.

Example

csv data:

__time,time2,value,dim
1,1,1,a
1,2,2,a
1,3,3,b
2,1,4,b
2,2,5,b
2,2,6,a
3,1,7,a
3,2,8,b
3,3,9,a

Query and expected result:

SELECT 
  TIME_FLOOR(__time, 'PT1H') as flooredTime,
  "dim",
  EARLIEST_BY("value", ARRAY[_time, "time2"]) as "earliest", 
  LATEST_BY("value", ARRAY[_time, "time2"]) as "latest" 
FROM "myDataSource" GROUP BY TIME_FLOOR(__time, 'PT1H'), "dim"

+--------------------------+-----+----------+--------+
|        flooredTime       | dim | earliest | latest |
+--------------------------+-----+----------+--------+
| 1970-01-01T00:00:00.000Z | a   |        1 |      9 |
| 1970-01-01T00:00:00.000Z | b   |        3 |      8 |
+--------------------------+-----+----------+--------+
@LakshSingla
Copy link
Contributor

FWIW, I think a crude way of breaking the tie-breakers currently in Druid would be to use GroupBy and nested query for deduplication b/w the primary timestamp. This might look something like the following for your use case:

WITH 
mainTable AS (
  SELECT TIME_FLOOR(__time, 'PT1H') as flooredTime,
  "dim",
  __time,
  time2,
  value
)

SELECT floored_time, "dim", EARLIEST_BY("innerColumn", __time)
FROM (
  SELECT floored_time, "dim", EARLIEST_BY("value", TIME_PARSE("time2")) AS innerColumn, __time
  FROM mainTable
  GROUP BY floored_time, "dim", __time
)
GROUP BY floored_time, "dim"

@LakshSingla
Copy link
Contributor

In case someone does pick up the task, I looked into the code, and there seem to be 2 places where the modifications would lie:

  1. Allowing SQL function to accept the timestamp arrays.
  2. Making the aggregator work with the arrays - currently it is working correctly with any types, (even though the SQL only allows timestamp, we can bypass it by using native queries directly), however changing to an array seems to return an incorrect result.

@aedelbro
Copy link
Author

@LakshSingla Thanks for the suggestion about doing the nested query for deduplication b/w the primary timestamp.

Our current work around is doing some bitshifting magic to cram both timestamps into 1 (which we then discovered the bug here). The 64bit long is just a handful of bits too small to keep full precision. We set the lower 3 bits of __time to 0's then shift it 22 bits left, and right shift the secondary time column by 16 bits and add them together. 😅

Obviously just having a function to provided both columns is much cleaner 😄

@kgyrtkirk
Copy link
Member

I wonder if you could try other approaches because of your comment in the issue description:

The second (or nth) time column in the array would be used if the proceeding time columns are equal.

if you want to check the nth in case the previous ones are equal; aren't you looking for a function like
MAX(time,time2) ?

select earliest_by(value, max(time,time2))
select latest_by(value, min(time,time2))

but since I don't see non-aggregate min/max; you will need to pack this logic into a case statement - have you tried something like this:

select
  EARLIEST_BY("value", case when "time" > "time2" then MILLIS_TO_TIMESTAMP("time") else MILLIS_TO_TIMESTAMP(time2) end),
  LATEST_BY("value", case when "time" < "time2" then MILLIS_TO_TIMESTAMP("time") else MILLIS_TO_TIMESTAMP(time2) end)
from "d1"

@aedelbro
Copy link
Author

aedelbro commented Aug 23, 2023

@kgyrtkirk MIN and MAX are close but not quite the functionality needed.

In the simple case of these two rows:

+--------+-----+-------+-------+
| __time | id  | time2 | value |
+--------+-----+-------+-------+
|      5 | abc |     6 |    55 |
|      5 | def |     7 |    66 |
+--------+-----+-------+-------+

The desired LATEST_BY("value", ARRAY[__time, "time2"]) would be value 66 (id def), as the primary time column ties, it then falls back to time2 to decide what is "latest".

Now if I modify the data to this, you can see why MAX(__time, "time2") wouldn't work, as

+--------+-----+-------+-------+
| __time | id  | time2 | value |
+--------+-----+-------+-------+
|     10 | abc |    11 |    55 |
|      5 | def |    12 |    66 |
+--------+-----+-------+-------+

Using MAX would still return 66, where 55 is desired be returned. Because 10>5, it does not evaluate how time2 factors in at all (even though for all rows, time2 is greater than __time).

Copy link

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label May 30, 2024
Copy link

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants