Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support maxRowsPerSegment for auto compaction #6780

Merged
merged 9 commits into from
Jan 10, 2019
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ public void setup()
null,
null,
null,
null,
null
)
);
Expand Down
15 changes: 13 additions & 2 deletions docs/content/configuration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -806,10 +806,11 @@ A description of the compaction config is:
|`keepSegmentGranularity`|Set [keepSegmentGranularity](../ingestion/compaction.html) to true for compactionTask.|no (default = true)|
|`taskPriority`|[Priority](../ingestion/tasks.html#task-priorities) of compact task.|no (default = 25)|
|`inputSegmentSizeBytes`|Total input segment size of a compactionTask.|no (default = 419430400)|
|`targetCompactionSizeBytes`|The target segment size of compaction. The actual size of a compact segment might be slightly larger or smaller than this value.|no (default = 419430400)|
|`targetCompactionSizeBytes`|The target segment size of compaction. The actual size of a compact segment might be slightly larger or smaller than this value. This configuration cannot be used together with `maxRowsPerSegment`.|no (default = 419430400 if `maxRowsPerSegment` is not specified)|
|`maxRowsPerSegment`|Max number of rows per segment after compaction. This configuration cannot be used together with `targetCompactionSizeBytes`.|no|
|`maxNumSegmentsToCompact`|Max number of segments to compact together.|no (default = 150)|
|`skipOffsetFromLatest`|The offset for searching segments to be compacted. Strongly recommended to set for realtime dataSources. |no (default = "P1D")|
|`tuningConfig`|Tuning config for compact tasks. See below [Compact Task TuningConfig](#compact-task-tuningconfig).|no|
|`tuningConfig`|Tuning config for compact tasks. See [Compaction TuningConfig](#compaction-tuningconfig).|no|
|`taskContext`|[Task context](../ingestion/tasks.html#task-context) for compact tasks.|no|

An example of compaction config is:
Expand All @@ -822,6 +823,16 @@ An example of compaction config is:

For realtime dataSources, it's recommended to set `skipOffsetFromLatest` to some sufficiently large value to avoid frequent compact task failures.

##### Compaction TuningConfig

|Property|Description|Required|
|--------|-----------|--------|
|`maxRowsInMemory`|See [tuningConfig for indexTask](../ingestion/native_tasks.html#tuningconfig)|no (default = 1000000)|
|`maxTotalRows`|See [tuningConfig for indexTask](../ingestion/native_tasks.html#tuningconfig)|no (default = 20000000)|
|`indexSpec`|See [IndexSpec](../ingestion/native_tasks.html#indexspec)|no|
|`maxPendingPersists`|See [tuningConfig for indexTask](../ingestion/native_tasks.html#tuningconfig)|no (default = 0 (meaning one persist can be running concurrently with ingestion, and none can be queued up))|
|`pushTimeout`|See [tuningConfig for indexTask](../ingestion/native_tasks.html#tuningconfig)|no (default = 0)|

## Overlord

For general Overlord Node information, see [here](../design/indexing-service.html).
Expand Down
7 changes: 4 additions & 3 deletions docs/content/ingestion/compaction.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,10 @@ Compaction tasks merge all segments of the given interval. The syntax is:
|`id`|Task id|No|
|`dataSource`|DataSource name to be compacted|Yes|
|`interval`|Interval of segments to be compacted|Yes|
|`dimensions`|Custom dimensionsSpec. compaction task will use this dimensionsSpec if exist instead of generating one. See below for more details.|No|
|`dimensionsSpec`|Custom dimensionsSpec. Compaction task will use this dimensionsSpec if exist instead of generating one. See below for more details.|No|
|`metricsSpec`|Custom metricsSpec. Compaction task will use this metricsSpec if specified rather than generating one.|No|
|`keepSegmentGranularity`|If set to true, compactionTask will keep the time chunk boundaries and merge segments only if they fall into the same time chunk.|No (default = true)|
|`targetCompactionSizeBytes`|Target segment size after comapction. Cannot be used with `targetPartitionSize`, `maxTotalRows`, and `numShards` in tuningConfig.|No|
|`targetCompactionSizeBytes`|Target segment size after comapction. Cannot be used with `maxRowsPerSegment`, `maxTotalRows`, and `numShards` in tuningConfig.|No|
|`tuningConfig`|[Index task tuningConfig](../ingestion/native_tasks.html#tuningconfig)|No|
|`context`|[Task context](../ingestion/locking-and-priority.html#task-context)|No|

Expand All @@ -64,7 +65,7 @@ An example of compaction task is

This compaction task reads _all segments_ of the interval `2017-01-01/2018-01-01` and results in new segments.
Note that intervals of the input segments are merged into a single interval of `2017-01-01/2018-01-01` no matter what the segmentGranularity was.
To control the number of result segments, you can set `targetPartitionSize` or `numShards`. See [indexTuningConfig](../ingestion/native_tasks.html#tuningconfig) for more details.
To control the number of result segments, you can set `maxRowsPerSegment` or `numShards`. See [indexTuningConfig](../ingestion/native_tasks.html#tuningconfig) for more details.
To merge each day's worth of data into separate segments, you can submit multiple `compact` tasks, one for each day. They will run in parallel.

A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters.
Expand Down
12 changes: 6 additions & 6 deletions docs/content/ingestion/native_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,11 +152,11 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|property|description|default|required?|
|--------|-----------|-------|---------|
|type|The task type, this should always be `index_parallel`.|none|yes|
|targetPartitionSize|Used in sharding. Determines how many rows are in each segment.|5000000|no|
|maxRowsPerSegment|Used in sharding. Determines how many rows are in each segment.|5000000|no|
|maxRowsInMemory|Used in determining when intermediate persists to disk should occur. Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set.|1000000|no|
|maxBytesInMemory|Used in determining when intermediate persists to disk should occur. Normally this is computed internally and user does not need to set it. This value represents number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists)|1/6 of max JVM memory|no|
|maxTotalRows|Total number of rows in segments waiting for being pushed. Used in determining when intermediate pushing should occur.|20000000|no|
|numShards|Directly specify the number of shards to create. If this is specified and 'intervals' is specified in the granularitySpec, the index task can skip the determine intervals/partitions pass through the data. numShards cannot be specified if targetPartitionSize is set.|null|no|
|numShards|Directly specify the number of shards to create. If this is specified and 'intervals' is specified in the granularitySpec, the index task can skip the determine intervals/partitions pass through the data. numShards cannot be specified if maxRowsPerSegment is set.|null|no|
|indexSpec|defines segment storage format options to be used at indexing time, see [IndexSpec](#indexspec)|null|no|
|maxPendingPersists|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists).|0 (meaning one persist can be running concurrently with ingestion, and none can be queued up)|no|
|forceExtendableShardSpecs|Forces use of extendable shardSpecs. Experimental feature intended for use with the [Kafka indexing service extension](../development/extensions-core/kafka-ingestion.html).|false|no|
Expand Down Expand Up @@ -337,7 +337,7 @@ An example of the result is
},
"tuningConfig": {
"type": "index_parallel",
"targetPartitionSize": 5000000,
"maxRowsPerSegment": 5000000,
"maxRowsInMemory": 1000000,
"maxTotalRows": 20000000,
"numShards": null,
Expand Down Expand Up @@ -454,7 +454,7 @@ The Local Index Task is designed to be used for smaller data sets. The task exec
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 1000000
}
}
Expand Down Expand Up @@ -491,11 +491,11 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|property|description|default|required?|
|--------|-----------|-------|---------|
|type|The task type, this should always be "index".|none|yes|
|targetPartitionSize|Used in sharding. Determines how many rows are in each segment.|5000000|no|
|maxRowsPerSegment|Used in sharding. Determines how many rows are in each segment.|5000000|no|
|maxRowsInMemory|Used in determining when intermediate persists to disk should occur. Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set.|1000000|no|
|maxBytesInMemory|Used in determining when intermediate persists to disk should occur. Normally this is computed internally and user does not need to set it. This value represents number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists)|1/6 of max JVM memory|no|
|maxTotalRows|Total number of rows in segments waiting for being pushed. Used in determining when intermediate pushing should occur.|20000000|no|
|numShards|Directly specify the number of shards to create. If this is specified and 'intervals' is specified in the granularitySpec, the index task can skip the determine intervals/partitions pass through the data. numShards cannot be specified if targetPartitionSize is set.|null|no|
|numShards|Directly specify the number of shards to create. If this is specified and 'intervals' is specified in the granularitySpec, the index task can skip the determine intervals/partitions pass through the data. numShards cannot be specified if maxRowsPerSegment is set.|null|no|
|partitionDimensions|The dimensions to partition on. Leave blank to select all dimensions. Only used with `forceGuaranteedRollup` = true, will be ignored otherwise.|null|no|
|indexSpec|defines segment storage format options to be used at indexing time, see [IndexSpec](#indexspec)|null|no|
|maxPendingPersists|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists).|0 (meaning one persist can be running concurrently with ingestion, and none can be queued up)|no|
Expand Down
2 changes: 1 addition & 1 deletion docs/content/tutorials/tutorial-batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ which has been configured to read the `quickstart/tutorial/wikiticker-2015-09-12
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
4 changes: 2 additions & 2 deletions docs/content/tutorials/tutorial-compaction.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ We have included a compaction task spec for this tutorial datasource at `quickst
"interval": "2015-09-12/2015-09-13",
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand All @@ -85,7 +85,7 @@ This will compact all segments for the interval `2015-09-12/2015-09-13` in the `

The parameters in the `tuningConfig` control how many segments will be present in the compacted set of segments.

In this tutorial example, only one compacted segment will be created, as the 39244 rows in the input is less than the 5000000 `targetPartitionSize`.
In this tutorial example, only one compacted segment will be created, as the 39244 rows in the input is less than the 5000000 `maxRowsPerSegment`.

Let's submit this task now:

Expand Down
4 changes: 2 additions & 2 deletions docs/content/tutorials/tutorial-ingestion-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -564,7 +564,7 @@ As an example, let's add a `tuningConfig` that sets a target segment size for th
```json
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000
"maxRowsPerSegment" : 5000000
}
```

Expand Down Expand Up @@ -623,7 +623,7 @@ We've finished defining the ingestion spec, it should now look like the followin
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000
"maxRowsPerSegment" : 5000000
}
}
}
Expand Down
2 changes: 1 addition & 1 deletion docs/content/tutorials/tutorial-rollup.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ We'll ingest this data using the following ingestion task spec, located at `quic
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion docs/content/tutorials/tutorial-transform-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ We will ingest the sample data using the following spec, which demonstrates the
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/compaction-final-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"interval": "2015-09-12/2015-09-13",
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/compaction-init-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/deletion-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/retention-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/rollup-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/transform-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/updates-append-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/updates-append-index2.json
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/updates-init-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/updates-overwrite-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
2 changes: 1 addition & 1 deletion examples/quickstart/tutorial/wikipedia-index.json
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ public void testSerdeWithDefaults() throws Exception

Assert.assertNotNull(config.getBasePersistDirectory());
Assert.assertEquals(1000000, config.getMaxRowsInMemory());
Assert.assertEquals(5_000_000, config.getMaxRowsPerSegment());
Assert.assertEquals(5_000_000, config.getMaxRowsPerSegment().intValue());
Assert.assertEquals(null, config.getMaxTotalRows());
Assert.assertEquals(new Period("PT10M"), config.getIntermediatePersistPeriod());
Assert.assertEquals(0, config.getMaxPendingPersists());
Expand Down Expand Up @@ -94,7 +94,7 @@ public void testSerdeWithNonDefaults() throws Exception

Assert.assertEquals(new File("/tmp/xxx"), config.getBasePersistDirectory());
Assert.assertEquals(100, config.getMaxRowsInMemory());
Assert.assertEquals(100, config.getMaxRowsPerSegment());
Assert.assertEquals(100, config.getMaxRowsPerSegment().intValue());
Assert.assertNotEquals(null, config.getMaxTotalRows());
Assert.assertEquals(1000, config.getMaxTotalRows().longValue());
Assert.assertEquals(new Period("PT1H"), config.getIntermediatePersistPeriod());
Expand Down Expand Up @@ -134,7 +134,7 @@ public void testConvert()
KafkaIndexTaskTuningConfig copy = (KafkaIndexTaskTuningConfig) original.convertToTaskTuningConfig();

Assert.assertEquals(1, copy.getMaxRowsInMemory());
Assert.assertEquals(2, copy.getMaxRowsPerSegment());
Assert.assertEquals(2, copy.getMaxRowsPerSegment().intValue());
Assert.assertNotEquals(null, copy.getMaxTotalRows());
Assert.assertEquals(10L, copy.getMaxTotalRows().longValue());
Assert.assertEquals(new Period("PT3S"), copy.getIntermediatePersistPeriod());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ public void testSerdeWithDefaults() throws Exception

Assert.assertNotNull(config.getBasePersistDirectory());
Assert.assertEquals(1000000, config.getMaxRowsInMemory());
Assert.assertEquals(5_000_000, config.getMaxRowsPerSegment());
Assert.assertEquals(5_000_000, config.getMaxRowsPerSegment().intValue());
Assert.assertEquals(new Period("PT10M"), config.getIntermediatePersistPeriod());
Assert.assertEquals(0, config.getMaxPendingPersists());
Assert.assertEquals(new IndexSpec(), config.getIndexSpec());
Expand Down Expand Up @@ -105,7 +105,7 @@ public void testSerdeWithNonDefaults() throws Exception

Assert.assertEquals(new File("/tmp/xxx"), config.getBasePersistDirectory());
Assert.assertEquals(100, config.getMaxRowsInMemory());
Assert.assertEquals(100, config.getMaxRowsPerSegment());
Assert.assertEquals(100, config.getMaxRowsPerSegment().intValue());
Assert.assertEquals(new Period("PT1H"), config.getIntermediatePersistPeriod());
Assert.assertEquals(100, config.getMaxPendingPersists());
Assert.assertEquals(true, config.isReportParseExceptions());
Expand Down
Loading