Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lens] Add derivative function #61775

Closed
2 tasks done
timroes opened this issue Mar 30, 2020 · 14 comments
Closed
2 tasks done

[Lens] Add derivative function #61775

timroes opened this issue Mar 30, 2020 · 14 comments
Labels
enhancement New value added to drive a business result Feature:Lens Project:LensDefault Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@timroes
Copy link
Contributor

timroes commented Mar 30, 2020

Add a derivative pipeline aggregation to Lens. See #56696 for more discussion.
Tasks:

@timroes timroes added enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure Feature:Lens labels Mar 30, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app (Team:KibanaApp)

@wylieconlon
Copy link
Contributor

wylieconlon commented Sep 4, 2020

The definition of derivative for our purposes is a function which subtracts sequential values in a date histogram to calculate the instant diff between sequential values. Derivatives in this context are discrete, as in non-continuous, and may have gaps.

Because the date histogram has a duration in time, the derivative function supports scaling values to a specific time interval, such as "derivative per second". Derivative values can be positive or negative.

User inputs

The derivative function requires a group by columns parameter, but this can be automatically set by Lens. This makes the derivative function only have optional inputs. Optional inputs are:

  • Scaled time unit (per second, etc)
  • Policy for handling gaps: Skip or replace with zeros (this is separate from the fitting functions)

This leads to a function signature like:

interface DerivativeArgs {
  // Table is used to determine the time units
  table: KibanaDatatable;
  // Required list of group by columns. Don't group the time field
  groupByIds: string[];
  scaleTo?: 'ms' | 's' | 'h' | 'd' | 'M';
  gapPolicy?: 'skip' | 'insert_zeros';
}

type DerivativeFunction = (input: DerivativeArgs) => KibanaDatatable;

Form design

This form is missing a way to set a "gap policy", but is otherwise close:

Table example with gap skipping (default)

timestamp per 3 hours Count Derivative Derivative per hour
2020-08-21 15:00 10 - -
2020-08-21 21:00 - - -
2020-08-21 18:00 19 - -
2020-08-22 00:00 22 3 1
2020-08-22 03:00 13 -9 -3

Table example with zeroes

timestamp per 3 hours Count Derivative Derivative per hour
2020-08-21 15:00 10 - -
2020-08-21 21:00 - -10 -3.33
2020-08-21 18:00 19 19 6.33
2020-08-22 00:00 22 3 1
2020-08-22 03:00 13 -9 -3

As you can see in this example, the value goes negative if there is missing data. I find this behavior a little annoying, so I consider it up for debate whether it should go negative or return to 0 for missing data.

Example visualizations

Derivatives can only be used in XY charts and data tables. They can't be rendered in pie charts because the values can go negative.

The simplest way to render a derivative is as a line chart. Derivatives can be calculated on a single line or as many lines indicating categories:

Derivative line chart

Screen Shot 2020-09-04 at 1 53 06 PM

But because we are trying to not do the bare minimum in Lens, we should also consider the most frequent requests that users have. For example, a common request is to have "red and green" colors to indicate derivatives, with a black color to indicate the underlying values. Here's an example I did in TSVB which required a lot of manual setup. Can Lens make this easy?

Derivative red and green

Going even beyond this, @monfera has worked on examples of derivatives where the derivative is shown as a cumulative derivative, also known as a waterfall chart. This chart type also uses red and green coloring, and shows negative values in the context of the overall trendline. Another feature of waterfall charts is that we can apply them as annotations on top of bar charts.

Derivative waterfall without annotation
Derivative waterfall with annotation

Implementation notes

The derivative function should be implemented as part of the standard library of expression functions, instead of using the aggregation features of Elasticsearch. This gives us the ability to compose more functions on top of the derivative. For example, the "time scaling" feature might actually be implemented as a separate expression function, making derivative a combination of two expression functions.

I don't consider the red/green styling or waterfall charts to be requirements for shipping a derivative feature. When we choose to implement this feature, it should be done as a chart styling option that might be applied automatically for derivatives, but that can also be applied to any data that goes positive and negative.

Steps to implement:

  1. Not blocked by ongoing discussion about how tables will provide the time interval, because we can make forward progress by using one of these PRs as a hack
  2. Start implementing the underlying Lens dependencies:
  3. Formatter to append "per second": [Lens] Support a formatter or format option which append "per second" "per minute" "per hour" #76714
  4. Make changes to the way that operations are defined in the Lens datasource, so that they can output more than esaggs
  5. Write the expression function to do the table manipulation

Stretch goals are:

  • Implement red/green styling
  • Build a waterfall chart option

@wylieconlon wylieconlon changed the title [Lens] Add derivative pipeline aggregation [Lens] Add derivative function Sep 4, 2020
@wylieconlon
Copy link
Contributor

@AlonaNadler Does the "gap policy" phrasing make sense as shown here? Do you have a better way to describe the use case for zeroing out the charts? We definitely need to implement this in code, but I noticed that we don't support this in TSVB or Visualize.

Also, I've listed some stretch goals of supporting red/green styling and waterfall charts, as shown above. Do you agree, or do you think these are required for Lens by default?

@AlonaNadler
Copy link

Can we address the gap policy as part of the fitting function?
Users then can decide if the missing values are zero
without stating anything in the fitting function behavior should be similar to TSVB default

solving the gap policy is not a high priority in my opinion
stretch goals seem great especially the red green, they are not mandatory for Lens default

@wylieconlon
Copy link
Contributor

@AlonaNadler I agree that gap policy is not a high priority. If we decide to do it, it will be completely separate from the fitting function for technical reasons.

@wylieconlon
Copy link
Contributor

The function signature I proposed earlier not complete for several reasons, and this is an attempt to update the proposed signature.

  1. There were no parameters to indicate which number to derive from
  2. There were no parameters to generate a new column
  3. The time scaling parameters are no longer needed because we will implement a separate time scaling function

Generating a new column requires us to have a new column ID, human-readable name, and formatHint.

In total, I think this is new new interface for the derivative expression function:

interface DerivativeArgs {
  groupBy: string[];
  inputColumn: string;
  outputColumnId: string;
  outputColumnName: string;
  outputColumnSerializedFormat: string;
  gapPolicy?: 'skip' | 'insert_zeroes';
}

My confidence level in this signature is higher than before because I wrote an actual expression function with these arguments, but it could change again if we run into consistency issues with the other time series functions.

@flash1293
Copy link
Contributor

This looks almost good to me, thanks for taking these things into account. While thinking a little about it I came up with some light additional touches (but I suspect we will continue iterating on this while actually implementing):

interface DerivativeArgs {
  groupBy?: string[];
  inputColumn: string;
  outputColumnId?: string;
  outputColumnName?: string;
  outputColumnSerializedFormat?: string;
  gapPolicy?: 'skip' | 'insert_zeroes';
}

(basically making output column and groupBy configuration optional)

The behavior would be as follows:

  • outputColumnId defaults to inputColumn
  • outputColumnName defaults to the name of inputColumn
  • outputColumnSerializedFormat defaults to the format of inputColumn
  • groupBy defaults to an empty array for cases where you don't have grouping columns. This is necessary because you can't set an expression argument to an empty array (unless I'm missing something)

@monfera
Copy link
Contributor

monfera commented Sep 30, 2020

A naming suggestion as they are important for UX and even DX. Would it be possible to change the working term "derivative" to "differences" in the UI? I may overlook a good reason for calling it derivative. We don't have continuous functions, our binned time series arenn't differentiable, and even if we disregard the lack of continuity it's not some kind of tangent at the point, and not even a ratio of dx and dy (ie. not angle related), it's just the dy and is a backward looking measure, not centered or infinitesimally small.

"Derivative" has some looser meaning too (=stuff you compute from other things, derived information) but it's not an ideal fit either, eg. it's too specific for that.

@flash1293
Copy link
Contributor

I guess we inherited this from Elasticsearch (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-derivative-aggregation.html), so one point in favor of "derivative" is familiarity for users used to the stack.

But we are not following Elasticsearch terminology in a lot of places so it's totally valid to rethink this.

cc @gchaps maybe you have another idea.

Naming is something we can iterate on separately from the functionality (as long as it's happening between releases of course)

@monfera
Copy link
Contributor

monfera commented Sep 30, 2020

Calculation of differences: Subtracting the value of the previous bin from the value of the current bin leads to accumulation of minuscule errors, which may or may not matter(*), could be decided upfront, though the implementation and runtime cost is the same.

If it matters, a robust way for computing deltas is to run through the series bin by bin, and compute the difference between (A) the cumulative sum of the differences computed already, and (B) the current value. The new difference will be the subtraction of the cumulated sum (sum bins[0].delta...deltabins[N-1].delta) from the current bin value bins[N].value. This way, numerical errors do not accumulate, ie. there's a known very small upper bound on the sum of those, given arbitrary intervals.

(*) It might matter when differences are eg. reintegrated downstream over an interval; with the robust method you know what epsilon to use when judging if the resulting number is (likely) a zero, or some very small positive or negative number.

It may also matter when eg. an ES payload is sent, for compression, in a delta-encoded way; eg. hourly temperature values won't wildly differ from each other, so it's more efficient to send serial deltas down the network for somewhat continuous phenomena, or when there are long stretches of unchanged values (delta=0, compresses well with RLE)

Btw. an alternative to the name "differences" would be "deltas" (or singular forms, or variation eg. series delta)

@monfera
Copy link
Contributor

monfera commented Sep 30, 2020

Thanks @flash1293 - some earlier discussions eg. with Raya and Vijay touched on the tradeoffs of using the industry standard terminology, or using the term as used by Elasticsearch, if they differ. Not sure if the product design principles for Lens design made it fall one way or another, or somewhat accidental. For example, there may be some decision document that voted in favor of "bucket" instead of the more standard term "bin". Again, who knows, there may be a good reason for calling it differentiation, besides momentum or accident

@gchaps
Copy link
Contributor

gchaps commented Sep 30, 2020

I lean toward using "difference" or "delta" because they are easier to understand at a glance.

@monfera
Copy link
Contributor

monfera commented Oct 1, 2020

As we use the quite precise "cumulative sum" and not "integration" elsewhere, consistency is another support for using differences, deltas or running deltas or some such here

@flash1293 flash1293 added the loe:needs-research This issue requires some research before it can be worked on or estimated label Oct 2, 2020
@flash1293 flash1293 removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Oct 12, 2020
@flash1293 flash1293 self-assigned this Oct 20, 2020
@flash1293 flash1293 removed their assignment Nov 16, 2020
@flash1293
Copy link
Contributor

Closed by #84384

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Lens Project:LensDefault Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests

8 participants