Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emit maxLag/avgLag in KafkaSupervisor #6587

Merged
merged 2 commits into from
Nov 28, 2018
Merged

emit maxLag/avgLag in KafkaSupervisor #6587

merged 2 commits into from
Nov 28, 2018

Conversation

QiuMM
Copy link
Member

@QiuMM QiuMM commented Nov 7, 2018

Currently, the KafkaSupervisor just emit the total lag which is not enough. Some index tasks may have larger lag than others(e.g failure of the tasks can lead to some tasks have very large lag) , emit the max lag can let me monitor this thing. By the way, also emit the average lag.

@@ -129,7 +129,9 @@ emission period.|
|`ingest/handoff/count`|Number of handoffs that happened.|dataSource, taskId, taskType.|Varies. Generally greater than 0 once every segment granular period if cluster operating normally|
|`ingest/sink/count`|Number of sinks not handoffed.|dataSource, taskId, taskType.|1~3|
|`ingest/events/messageGap`|Time gap between the data time in event and current system time.|dataSource, taskId, taskType.|Greater than 0, depends on the time carried in event |
|`ingest/kafka/lag`|Applicable for Kafka Indexing Service. Total lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.|dataSource.|Greater than 0, should not be a very high number |
|`ingest/kafka/maxLag`|Applicable for Kafka Indexing Service. Max lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.|dataSource.|Greater than 0, should not be a very high number |
|`ingest/kafka/totalLag`|Applicable for Kafka Indexing Service. Total lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.|dataSource.|Greater than 0, should not be a very high number |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be kept as ingest/kafka/lag for backwards compatibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

@QiuMM QiuMM changed the title emit maxLag/totalLag/avgLag in KafkaSupervisor emit maxLag/avgLag in KafkaSupervisor Nov 9, 2018
@gianm
Copy link
Contributor

gianm commented Nov 13, 2018

How about adding a metric ingest/kafka/partitionLag with a dimension "partition" on it? Then you could take the average or max of those to get the average/max per-partition lag. But you could also look at the lag for each individual partition, which would be useful too.

Maybe include the task ID as a dimension too?

@QiuMM
Copy link
Member Author

QiuMM commented Nov 14, 2018

@gianm for metrics, I think it's enough to emit max lag, average lag and total lag across all partitions. In my scenario, some kafka topics may have 1000 partitions, so there will be lots of metric events if emit lag for each individual partition. If people want to look at the lag for each individual partition, they can get status of a supervisor through endpoint /druid/indexer/v1/supervisor/{id}/status. (overlord console ui has enabled this status endpoint).

@QiuMM QiuMM added this to the 0.13.1 milestone Nov 22, 2018
@gianm gianm assigned gianm and unassigned jon-wei Nov 27, 2018
@gianm
Copy link
Contributor

gianm commented Nov 28, 2018

@QiuMM Sure, that sounds ok to me. Thanks for describing your reasoning.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gianm gianm merged commit c5405bb into apache:master Nov 28, 2018
jsun98 added a commit to jsun98/druid that referenced this pull request Dec 1, 2018
dclim pushed a commit that referenced this pull request Dec 21, 2018
* created seekablestream classes

* created seekablestreamsupervisor class

* first attempt to integrate kafa indexing service to use SeekableStream

* seekablestream bug fixes

* kafkarecordsupplier

* integrated kafka indexing service with seekablestream

* implemented resume/suspend and refactored some package names

* moved kinesis indexing service into core druid extensions

* merged some changes from kafka supervisor race condition

* integrated kinesis-indexing-service with seekablestream

* unite tests for kinesis-indexing-service

* various bug fixes for kinesis-indexing-service

* refactored kinesisindexingtask

* finished up more kinesis unit tests

* more bug fixes for kinesis-indexing-service

* finsihed refactoring kinesis unit tests

* removed KinesisParititons and KafkaPartitions to use SeekableStreamPartitions

* kinesis-indexing-service code cleanup and docs

* merge #6291

merge #6337

merge #6383

* added more docs and reordered methods

* fixd kinesis tests after merging master and added docs in seekablestream

* fix various things from pr comment

* improve recordsupplier and add unit tests

* migrated to aws-java-sdk-kinesis

* merge changes from master

* fix pom files and forbiddenapi checks

* checkpoint JavaType bug fix

* fix pom and stuff

* disable checkpointing in kinesis

* fix kinesis sequence number null in closed shard

* merge changes from master

* fixes for kinesis tasks

* capitalized <partitionType, sequenceType>

* removed abstract class loggers

* conform to guava api restrictions

* add docker for travis other modules test

* address comments

* improve RecordSupplier to supply records in batch

* fix strict compile issue

* add test scope for localstack dependency

* kinesis indexing task refactoring

* comments

* github comments

* minor fix

* removed unneeded readme

* fix deserialization bug

* fix various bugs

* KinesisRecordSupplier unable to catch up to earliest position in stream bug fix

* minor changes to kinesis

* implement deaggregate for kinesis

* Merge remote-tracking branch 'upstream/master' into seekablestream

* fix kinesis offset discrepancy with kafka

* kinesis record supplier disable getPosition

* pr comments

* mock for kinesis tests and remove docker dependency for unit tests

* PR comments

* avg lag in kafkasupervisor #6587

* refacotred SequenceMetadata in taskRunners

* small fix

* more small fix

* recordsupplier resource leak

* revert .travis.yml formatting

* fix style

* kinesis docs

* doc part2

* more docs

* comments

* comments*2

* revert string replace changes

* comments

* teamcity

* comments part 1

* comments part 2

* comments part 3

* merge #6754

* fix injection binding

* comments

* KinesisRegion refactor

* comments part idk lol

* can't think of a commit msg anymore

* remove possiblyResetDataSourceMetadata() for IncrementalPublishingTaskRunner

* commmmmmmmmmments

* extra error handling in KinesisRecordSupplier getRecords

* comments

* quickfix

* typo

* oof
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants