Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Long/Float dimensions with generic dimension handling interface #2621

Closed
wants to merge 1 commit into from

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Mar 9, 2016

This PR adds support for Long and Float numeric dimensions using a new dimension handling interface.

It depends on the following two PRs:

druid-io/druid-api#75
#2607

Implementation notes:

  • At a high level, type-specific handling for dimensions in the IncrementalIndex, IndexMerger, and querying paths has been extracted into a new interface. There are 4 associated classes with the new interface:
    • DimensionHandler: Top level class, provides comparators and used to create the other 3 classes
    • DimensionIndexer: Handles logic related to in-memory ingestion of rows (IncrementalIndex)
    • DimensionMerger: Handles logic related to merging of segments (IndexMerger)
    • DimensionColumnReader: Handles logic related to reading from QueryableIndex/Columns
  • There are String/Long/Float implementations of these classes.
  • Dictionary encoding is not used for Long and Float dimensions. No index structures are currently created for numeric dimensions. Numeric dims will use the existing long/float column formats on disk.
  • The DimensionSelector interface now has additional functions for retrieving rows from numeric dimension columns. There are separate functions for long/float primitives, to avoid primitive boxing/unboxing. Benchmark runs showed a fairly significant increase in processing time when using a Comparable instead of IndexedInts directly.
  • As there are no bitmap indexes on numeric dims, a full scan will be performed on numeric dims for filtering logic that uses filter.getBitmapIndex()
  • To support the requirement above, ColumnSelectorBitmapIndexSelector now supports bitmap generation from the result of a column scan
  • SelectorFilter can do direct numeric comparisons on Long/Float columns, no string conversion needed
  • Long and Float values will be converted to Strings for any filters or extraction functions that require String inputs.
  • null values for Long and Float dims will be converted to 0L or 0.0F, respectively.
  • Three benchmarks have been added:
    • IncrementalIndexAddRowsBenchmark: adds rows with different value types to IncrementalIndex
    • IndexedIntsWrapping: evaluates performance of wrapping/unboxing of ints
    • QueryableIndexLoadingBenchmark: evaluates performance of reading from a QueryableIndex with/without integer boxing, using DimensionSelectors

Unit test notes:

  • The druid.sample.tsv file has been modified to include Long and Float dimension columns (market_long, market_float, quality_long, quality_float). The original druid.sample.tsv file has been kept in the repo.
  • The numeric values for these new dimensions have a 1-to-1 mapping with the values in 'market' and 'quality'
  • For larger test suites like GroupByQueryRunnerTest and TopN, I have parameterized the use of the "quality" and "market" dimensions; in different test iterations, the tests will use the long or float versions of these dimension columns.
  • Smaller test suites have separate functions for testing numeric dimensions.

Example benchmark run showing difference between using Comparable (loadQIndexWrap) and IndexedInts (loadQIndex)

64 dim, 20000 cardinality
Benchmark                                       Mode  Cnt  Score   Error  Units
QueryableIndexLoadingBenchmark.loadQIndexWrap  avgt  100  1.414 ± 0.016  us/op
QueryableIndexLoadingBenchmark.loadQIndex       avgt  100  1.127 ± 0.009  us/op

Benchmark                                       Mode  Cnt  Score   Error  Units
QueryableIndexLoadingBenchmark.loadQIndexWrap  avgt  100  1.405 ± 0.008  us/op
QueryableIndexLoadingBenchmark.loadQIndex       avgt  100  1.166 ± 0.011  us/op

@jon-wei jon-wei added this to the 0.9.1 milestone Mar 9, 2016
@jon-wei jon-wei force-pushed the rebased_flex_dims2 branch 2 times, most recently from efe0a4f to f9b4da5 Compare March 15, 2016 21:26
@@ -0,0 +1,382 @@
package io.druid.benchmark;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header

public int size()
{
// only single-value numerics are supported
return 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does the code check for this constraint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjy Added a size() > 1 check to LongDimensionIndexer and FloatDimensionIndexer, they'll throw an exception now if a multi-value row is passed to them

theEvent.put(dim, dimVals);
Object dimVals;
ValueType type = capabilities == null ? ValueType.STRING : capabilities.getType();
//switch(capabilities.getType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@jon-wei jon-wei force-pushed the rebased_flex_dims2 branch 3 times, most recently from 56da4c9 to 9f0c08f Compare March 17, 2016 01:07
@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 17, 2016

@fjy added headers and removed dead code

@fjy
Copy link
Contributor

fjy commented Mar 17, 2016

@jon-wei merge conflicts

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 17, 2016

@fjy I'll do a rebase after #2607 is merged, so I can get rid of that as a dependent commit, it should cut down on the # of files that appear here

@jon-wei jon-wei force-pushed the rebased_flex_dims2 branch 2 times, most recently from db4a3c9 to 49530dd Compare March 24, 2016 02:41
@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 24, 2016

Rebased

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 26, 2016

I will try to reduce the size of this PR by splitting out the generic dimension interface changes, with support for only String dimensions.

Long/Float support will then be a separate follow-on PR that uses the generic dimension interface.

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 30, 2016

Closing this PR, I've moved the new dimension interface into a separate PR:
#2760

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants