Allow LATEST_BY and EARLIEST_BY to accept timestamp arrays #14800

aedelbro · 2023-08-11T19:29:36Z

Description

Currently, LATEST_BY and EARLIEST_BY accept only 1 timestamp column. The proposed would allow the function to accept multiple timestamp columns, to be used as tie-breakers.
for example: LATEST_BY("myValueColumn", ARRAY[__time, "mySecondTimeColumn"]). The second (or nth) time column in the array would be used if the proceeding time columns are equal.

Motivation

Our motivation is to use a second time column to determine which rows are "latest" or "earliest". If you have 2 rows with the exact same __time value, both rows are "latest" or "earliest" (depending on the data and query being used). It is no longer deterministic which row's value is returned. In our application, it "flip flops" between the different values, as which value "wins" is determined by druid's internal processing.
Adding a second (or more than 1) time column to use would resolve ties on the primary time column.

Our application accepts events from customers where the customers may provide the timestamp of the event. They are also allowed to provide secondary time columns. We want to be able to find the "first" and "last" value for a given timestamp. A simple example being a timestamp for when the data was received by our application, so that we can emulate an "update" to a timestamp without losing data by rolling up the source data.

Example

csv data:

__time,time2,value,dim
1,1,1,a
1,2,2,a
1,3,3,b
2,1,4,b
2,2,5,b
2,2,6,a
3,1,7,a
3,2,8,b
3,3,9,a

Query and expected result:

SELECT 
  TIME_FLOOR(__time, 'PT1H') as flooredTime,
  "dim",
  EARLIEST_BY("value", ARRAY[_time, "time2"]) as "earliest", 
  LATEST_BY("value", ARRAY[_time, "time2"]) as "latest" 
FROM "myDataSource" GROUP BY TIME_FLOOR(__time, 'PT1H'), "dim"

+--------------------------+-----+----------+--------+
|        flooredTime       | dim | earliest | latest |
+--------------------------+-----+----------+--------+
| 1970-01-01T00:00:00.000Z | a   |        1 |      9 |
| 1970-01-01T00:00:00.000Z | b   |        3 |      8 |
+--------------------------+-----+----------+--------+

The text was updated successfully, but these errors were encountered:

LakshSingla · 2023-08-11T20:33:18Z

FWIW, I think a crude way of breaking the tie-breakers currently in Druid would be to use GroupBy and nested query for deduplication b/w the primary timestamp. This might look something like the following for your use case:

WITH 
mainTable AS (
  SELECT TIME_FLOOR(__time, 'PT1H') as flooredTime,
  "dim",
  __time,
  time2,
  value
)

SELECT floored_time, "dim", EARLIEST_BY("innerColumn", __time)
FROM (
  SELECT floored_time, "dim", EARLIEST_BY("value", TIME_PARSE("time2")) AS innerColumn, __time
  FROM mainTable
  GROUP BY floored_time, "dim", __time
)
GROUP BY floored_time, "dim"

LakshSingla · 2023-08-11T20:35:38Z

In case someone does pick up the task, I looked into the code, and there seem to be 2 places where the modifications would lie:

Allowing SQL function to accept the timestamp arrays.
Making the aggregator work with the arrays - currently it is working correctly with any types, (even though the SQL only allows timestamp, we can bypass it by using native queries directly), however changing to an array seems to return an incorrect result.

aedelbro · 2023-08-11T20:48:52Z

@LakshSingla Thanks for the suggestion about doing the nested query for deduplication b/w the primary timestamp.

Our current work around is doing some bitshifting magic to cram both timestamps into 1 (which we then discovered the bug here). The 64bit long is just a handful of bits too small to keep full precision. We set the lower 3 bits of __time to 0's then shift it 22 bits left, and right shift the secondary time column by 16 bits and add them together. 😅

Obviously just having a function to provided both columns is much cleaner 😄

kgyrtkirk · 2023-08-23T13:57:02Z

I wonder if you could try other approaches because of your comment in the issue description:

The second (or nth) time column in the array would be used if the proceeding time columns are equal.

if you want to check the nth in case the previous ones are equal; aren't you looking for a function like
MAX(time,time2) ?

select earliest_by(value, max(time,time2))
select latest_by(value, min(time,time2))

but since I don't see non-aggregate min/max; you will need to pack this logic into a case statement - have you tried something like this:

select
  EARLIEST_BY("value", case when "time" > "time2" then MILLIS_TO_TIMESTAMP("time") else MILLIS_TO_TIMESTAMP(time2) end),
  LATEST_BY("value", case when "time" < "time2" then MILLIS_TO_TIMESTAMP("time") else MILLIS_TO_TIMESTAMP(time2) end)
from "d1"

aedelbro · 2023-08-23T20:30:41Z

@kgyrtkirk MIN and MAX are close but not quite the functionality needed.

In the simple case of these two rows:

+--------+-----+-------+-------+
| __time | id  | time2 | value |
+--------+-----+-------+-------+
|      5 | abc |     6 |    55 |
|      5 | def |     7 |    66 |
+--------+-----+-------+-------+

The desired LATEST_BY("value", ARRAY[__time, "time2"]) would be value 66 (id def), as the primary time column ties, it then falls back to time2 to decide what is "latest".

Now if I modify the data to this, you can see why MAX(__time, "time2") wouldn't work, as

+--------+-----+-------+-------+
| __time | id  | time2 | value |
+--------+-----+-------+-------+
|     10 | abc |    11 |    55 |
|      5 | def |    12 |    66 |
+--------+-----+-------+-------+

Using MAX would still return 66, where 55 is desired be returned. Because 10>5, it does not evaluate how time2 factors in at all (even though for all rows, time2 is greater than __time).

github-actions · 2024-05-30T00:17:32Z

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2024-06-27T00:18:25Z

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

aedelbro added the Feature/Change Description label Aug 11, 2023

github-actions bot added the stale label May 30, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow LATEST_BY and EARLIEST_BY to accept timestamp arrays #14800

Allow LATEST_BY and EARLIEST_BY to accept timestamp arrays #14800

aedelbro commented Aug 11, 2023

LakshSingla commented Aug 11, 2023

LakshSingla commented Aug 11, 2023

aedelbro commented Aug 11, 2023

kgyrtkirk commented Aug 23, 2023

aedelbro commented Aug 23, 2023 •

edited

Loading

github-actions bot commented May 30, 2024

github-actions bot commented Jun 27, 2024

Allow LATEST_BY and EARLIEST_BY to accept timestamp arrays #14800

Allow LATEST_BY and EARLIEST_BY to accept timestamp arrays #14800

Comments

aedelbro commented Aug 11, 2023

Description

Motivation

Example

LakshSingla commented Aug 11, 2023

LakshSingla commented Aug 11, 2023

aedelbro commented Aug 11, 2023

kgyrtkirk commented Aug 23, 2023

aedelbro commented Aug 23, 2023 • edited Loading

github-actions bot commented May 30, 2024

github-actions bot commented Jun 27, 2024

aedelbro commented Aug 23, 2023 •

edited

Loading