Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic dimension handling interface #2760

Closed
wants to merge 7 commits into from

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Mar 30, 2016

This PR is a trimmed down version of #2621 that introduces a generic dimension handling infrastructure used for Long/Float dimension handling. However, only String dimensions are currently supported, to keep the patch at a reviewable size. Actual support for Longs/Floats will come in a later PR.

At a high level, type-specific handling for dimensions in the IncrementalIndex, IndexMerger, and querying paths has been extracted into a new interface. There are 4 associated classes with the new interface:

  • DimensionHandler: Top level class, provides comparators and used to create the other 3 classes
  • DimensionIndexer: Handles logic related to in-memory ingestion of rows (IncrementalIndex)
  • DimensionMerger: Handles logic related to merging of segments (IndexMerger)
  • DimensionColumnReader: Handles logic related to reading from QueryableIndex/Columns
    The javadoc comments within those classes provide further documentation on their usage/design.

StringDimensionHandler contains the implementations of these classes for String dimensions.

This patch also introduces query context "typeHints" for use by the GroupBy and TopN queries. The typehints are map of dimension name -> dimension ValueType. They are used to handle situations where a GroupBy or TopN has to merge results from two segments where the same dimension has two different types. The typeHints specify the type to merge the dimension values as (e.g. if a String and Float dimension with the same name are being merged, the typeHints will control whether the Strings are converted to Floats for merging, or vice versa).

The patch also changes the query engines to prepare them for accepting Long/Float dimensions later on.

Benchmark results, using the benchmarks introduced in #2875

EC2, c4.8xlarge, all tests use 100k rows/segment, 25 warmup iterations, 50 iterations

with dimension interface

IncrementalIndexReadBenchmark.read  avgt   50  336583.509 ± 7534.788  us/op
IndexIngestionBenchmark.addRows  avgt   50  372124.122 ± 1868.009  us/op
IndexMergeBenchmark.merge    avgt   50  6510862.831 ±  570281.634  us/op
IndexMergeBenchmark.mergeV9  avgt   50  5984500.801 ± 407132.211  us/op
IndexPersistBenchmark.persist    avgt   50  2032628.564 ± 307633.219  us/op
IndexPersistBenchmark.persistV9  avgt   50  1563141.337 ± 419760.184  us/op

GroupByBenchmark.queryIncrementalIndex               avgt   50   279854.065 ±  1773.672  us/op
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50   331237.471 ±  2641.921  us/op
GroupByBenchmark.queryQueryableIndex                 avgt   50   209293.283 ±  2293.477  us/op
SearchBenchmark.queryIncrementalIndex                avgt   50   270488.942 ±  3490.482  us/op
SearchBenchmark.queryQueryableIndex                  avgt   50    19156.616 ±   220.211  us/op
SelectBenchmark.queryIncrementalIndex                avgt   50    54925.129 ±   753.863  us/op
SelectBenchmark.queryQueryableIndex                  avgt   50    82525.845 ±   448.819  us/op
TimeseriesBenchmark.queryIncrementalIndex            avgt   50  1484743.161 ± 16800.148  us/op
TimeseriesBenchmark.queryQueryableIndex              avgt   50   144074.131 ±   659.014  us/op
TopNBenchmark.queryIncrementalIndex                  avgt   50   153446.734 ±  3896.064  us/op
TopNBenchmark.queryQueryableIndex                    avgt   50    19949.470 ±   229.467  us/op

Master

IncrementalIndexReadBenchmark.read  avgt   50  337834.112 ± 8681.444  us/op
IndexIngestionBenchmark.addRows  avgt   50  369308.625 ± 1433.732  us/op
IndexMergeBenchmark.merge    avgt   50  6925118.764 ±  608238.491  us/op
IndexMergeBenchmark.mergeV9  avgt   50  6030957.421 ± 605258.191  us/op
IndexPersistBenchmark.persist    avgt   50  1913073.976 ± 294931.368  us/op
IndexPersistBenchmark.persistV9  avgt   50  1685776.164 ± 362693.979  us/op

GroupByBenchmark.queryIncrementalIndex               avgt   50   284695.292 ±  2361.383  us/op
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50   335850.517 ±  3017.035  us/op
GroupByBenchmark.queryQueryableIndex                 avgt   50   210372.953 ±  2339.907  us/op
SearchBenchmark.queryIncrementalIndex                avgt   50   275442.862 ±  6831.624  us/op
SearchBenchmark.queryQueryableIndex                  avgt   50    18897.948 ±   116.609  us/op
SelectBenchmark.queryIncrementalIndex                avgt   50    56354.015 ±   610.019  us/op
SelectBenchmark.queryQueryableIndex                  avgt   50    83975.301 ±   558.047  us/op
TimeseriesBenchmark.queryIncrementalIndex            avgt   50  1513934.307 ± 18005.737  us/op
TimeseriesBenchmark.queryQueryableIndex              avgt   50   141366.282 ±   741.409  us/op
TopNBenchmark.queryIncrementalIndex                  avgt   50   150933.221 ±  3435.319  us/op
TopNBenchmark.queryQueryableIndex                    avgt   50    20091.945 ±   156.330  us/op

As discussed here, #2760 (diff), this patch changes the locking on the String dimension dictionary. Prior to this patch, the dimension value dictionaries during indexing synchronize on a common object, "dimensionDescs", the list of all dimension descriptors in IncrementalIndex. This patch changes the lock to a per-dimension object.

Benchmarks for a multithreaded groupby are show below (one thread per segment):

DimIntf, shared lock

1 segment
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  351056.654 ± 2384.127  us/op

2 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  382566.960 ± 4662.214  us/op

4 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  552082.477 ± 21863.935  us/op
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  471890.881 ± 8869.641  us/op
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  479048.159 ± 11562.504  us/op

8 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  690188.705 ± 22504.089  us/op

16 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  1338778.157 ± 27091.804  us/op

24 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  1957562.595 ± 34669.820  us/op

DimIntf, individual dimension lock (what this PR uses)

1 segment
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  332612.615 ± 3968.138  us/op

2 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  370438.654 ± 3404.577  us/op

4 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  449136.648 ± 7435.263  us/op

8 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  650122.886 ± 14151.922  us/op

16 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  1159378.708 ± 23919.462  us/op

24 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  1698422.435 ± 30193.147  us/op

Master

1 segment
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  355411.783 ± 4904.367  us/op

2 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  386838.313 ± 3954.246  us/op

4 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  466161.049 ± 8987.380  us/op
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  479591.747 ± 9621.430  us/op

8 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  705942.535 ± 21316.947  us/op

16 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  1357101.745 ± 30982.550  us/op

24 segments
GroupByBenchmark.queryIncrementalIndexMultiThreaded  avgt   50  1994296.053 ± 26432.891  us/op

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 31, 2016

rebasing, there are some conflicts with #2753

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 31, 2016

rebased

@gianm Updated StringDimensionHandler with your recent IndexMerger changes; it had most of the null handling code already since StringDimensionMerger was based off IndexMergerV9, just had to add this part to closeWritersV8():

https://github.com/jon-wei/druid/blob/dimension_interface/processing/src/main/java/io/druid/segment/StringDimensionHandler.java#L752

if (o2 == null) {
return 1;
}
return ObjectUtils.compare(o1, o2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectUtils.compare(o1,o2) handles null as well so the null checks before are redundant or replace this with just
o1.compareTo(o2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himanshug changed this and ENCODED_COMPARATOR to use o1.compareTo(o2)

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 25, 2016

Added some benchmark numbers comparing performance between this patch and master, as of 4/25/2016

@xvrl
Copy link
Member

xvrl commented Apr 26, 2016

@jon-wei based on your numbers it looks like most of the queries are unaffected, however I do see a 20% increase in SearchBenchmark.queryIncrementalIndex and between 10% and 15% increase in GroupByBenchmark.queryIncrementalIndex and GroupByBenchmark.queryQueryableIndex.

It would be good to investigate those in more detail and see if we can reduce the impact on those queries. 10% is a pretty significant drop in performance.

Everything else is either faster, or within +/- 5%, which is with error bands. You might want to run your benchmark on EC2 instead of a laptop to get those errors bands narrower (see https://metamarkets.com/2015/effect-of-frequency-governor-on-java-benchmarking/)

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 28, 2016

@xvrl I've been digging into the performance slowdown in GroupBy and Search.

I've made some improvements to GroupByEngine (haven't committed changes to this PR yet).

The increase shown in SearchBenchmark was found to ultimately come from an increase in the time it takes to read rows from an IncrementalIndex.

I've narrowed the cause there to the change from int[][] to Comparable[][] in various places, but still looking into how to improve the performance there.

* @param adapters List of adapters to be merged.
* @throws IOException
*/
public void mergeAcrossSegments(List<IndexableAdapter> adapters) throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is called, would addColumnValue be called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addColumnValue wouldn't be called within mergeAcrossSegments, it is called later when iterating through the merged Rowboat iterable

"mergeSegmentIndices" might be a better name than "mergeAcrossSegments"...

@fjy fjy modified the milestones: 0.9.2, 0.9.1 May 3, 2016
@jon-wei jon-wei force-pushed the dimension_interface branch 4 times, most recently from 5fcb521 to 99dd062 Compare May 10, 2016 01:59
@jon-wei
Copy link
Contributor Author

jon-wei commented May 10, 2016

Updated this to address PR comments, will post benchmark results when I have them available.

@jon-wei jon-wei force-pushed the dimension_interface branch 3 times, most recently from 1474059 to 7adf0c3 Compare May 12, 2016 02:09
@jon-wei
Copy link
Contributor Author

jon-wei commented May 12, 2016

Updated description with new benchmark results as of May 12.

The description also mentions a change in the locking behavior of the dimension value dictionary, with related benchmark results.

@jon-wei
Copy link
Contributor Author

jon-wei commented Jul 1, 2016

Going to break this PR up even further, closing this for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants