Appenderators, DataSource metadata, KafkaIndexTask #2220

gianm · 2016-01-07T06:18:17Z

See also #1642. FYI we are running this in our cluster as of 2016-03-02, things seem to work so far…

Set of three related things (each one a separate commit):

kafka-indexing-service core extension
Includes KafkaIndexTask, which reads a specific offset range from specific partitions, and can use dataSource metadata transactions to guarantee exactly-once ingestion.

Each task has a finite lifecycle, so it is expected that a process will be supervising existing tasks and creating new ones when needed. @dclim is working on this process in https://github.com/dclim/druid/tree/kafka-supervisor.

This extension requires the other two features (DataSource metadata and Appenderators).

DataSource metadata
Geared towards supporting transactional inserts of new segments. This involves an interface DataSourceMetadata that allows combining of partially specified metadata (useful for partitioned ingestion). It also involves changes to the SegmentInsertAction to allow it to take a startMetadata and endMetadata for compare-and-swap.

DataSource metadata is stored in a new "dataSource" table.

Appenderators
Like Plumbers, but different. Appenderators are a way of getting more control over the ingestion process than a Plumber allows. They are less ambitious than Plumbers, but more flexible. In particular, they offer facilities to deal with:

Indexing and persisting data
Checkpointing and restoring ingestion state
Merging and pushing segments to deep storage
Serving queries

But they do not do any of these things:

Decide which rows to index and which to drop
Decide which segments to put your rows into
Decide when to push and publish segments
Monitor handoff

So you can think of Appenderators as a way of separating out the mechanical functionality of Plumbers from their decision-making processes. The idea is that existing Plumbers could be implemented using Appenderators, but you could also use them to implement workflows that the existing Plumbers can't support.

Discussion?
Some open questions (and reasons this is still marked Discuss),

Should RealtimePlumber be adapted to use an Appenderator? Pros of doing this are that the AppenderatorImpl actually has a lot of functionality and code overlap with RealtimePlumber (the query-runner stuff is particularly annoying since it is mostly similar but also kinda different). I think the main stumbling block is that the RealtimePlumber and the AppenderatorImpl have different persist directory layouts (mostly because the RealtimePlumber never has to have more than one shard per interval, but the AppenderatorImpl might) and so the code will need to be able to migrate existing data.
Should the other existing indexy things be adapted to use an Appenderator too? Like IndexTask and IndexGeneratorReducer.

BTW, I think if we do either of these things, they should be in future PRs rather than this one.

fjy · 2016-01-07T06:19:35Z

I am so excited!

gianm · 2016-01-07T06:25:09Z

[wip] because TODOs remain, but would really appreciate comments

himanshug · 2016-01-11T16:16:37Z

indexing-service/src/main/java/io/druid/indexing/common/actions/SegmentInsertAction.java

+      toolbox.getEmitter().emit(metricBuilder.build("segment/txn/success", 1));
+    } else {
+      toolbox.getEmitter().emit(metricBuilder.build("segment/txn/failure", 1));
+    }


i understand the usefulness of segment/txn/failure but not sure how segment/txn/success is going to be useful.

maybe you want to confirm that txns are actually happening? I dunno? It seemed like if one existed then the other should too

himanshug · 2016-01-11T17:47:14Z

did a scan, looks like a step in right direction, waiting to see the impl for SegmentAllocator.
i'm not sure about the name kafka-indexing-service which gave me the notion that this was some kind of "indexing service", which rings me overlord bcoz of the renaming history

also, i believe, we should set the "scope" for this PR to at least result in a working (even if experimental) version of no-window period realtime ingestion with tranquility.

gianm · 2016-01-11T17:54:14Z

@himanshug SegmentAllocator has an implementation in KafkaIndexTask.

Do you have a suggestion for a better name for kafka-indexing-service? I just called it that because it has Kafka stuff for the indexing service.

Hmm IMO this PR should be scoped to be the base for tranquility working and for kafka working. It actually doesn't fully achieve either one but it was getting big enough that I thought it made sense to cut it off at this point and do the rest in follow on PRs.

I think the follow on PRs would be:

One supporting Kafka that introduces a supervisor that actually spawns a series of KafkaIndexTasks
One supporting Tranquility that introduces a push-oriented task based on the FiniteAppenderatorDriver, which Tranquility can push to
A Tranquility PR (in the Tranquility repo) that coordinates that new task in (2)

himanshug · 2016-01-11T20:03:44Z

we can rename kafka-indexing-service to just kafka-indexing or may be I'm overthinking?

regarding scope, I would move the kafka module in a separate PR and do (2) in this PR itself. That said, I will let you take the final call on that.

gianm · 2016-01-11T20:13:51Z

@himanshug kafka-indexing sounds cool

Hmm, I would prefer to have this PR have the stuff it has for now, mostly because this stuff is "done" (ish) and the follow-on stuff is not done :). I did start working on (2) though, so depending on how long this PR is open for, we could potentially include that instead of the kafka stuff.

Out of curiosity why do you prefer to have the tranquility stuff here instead of the kafka stuff?

himanshug · 2016-01-11T20:20:56Z

"Out of curiosity why do you prefer to have the tranquility stuff here instead of the kafka stuff?"
to get you focused on that and have stuff working for tranquility as early as possible.

also, I dint realize that (2) was going to be "tranquility specific" code, i thought that was core druid change and not a new module.

gianm · 2016-01-11T20:38:09Z

Haha, ok :)

I actually had been doing the tranquility stuff first and then took a detour through the kafka stuff as I think it is a lot simpler, and wanted to get some kind of initial thing working. The tranquility stuff ended up getting kind of complicated. I plan to go back to that soon though. But in the meantime, some review on this stuff would be great.

himanshug · 2016-01-12T21:59:20Z

server/src/main/java/io/druid/segment/realtime/appenderator/FiniteAppenderatorDriver.java

+        final long sleepMillis = computeNextRetrySleep(++nTry);
+        log.warn(e, "Failed publishAll (try %d), retrying in %,dms.", nTry, sleepMillis);
+        Thread.sleep(sleepMillis);
+      }


various places where we have long running stuff, we need to check for thread interrupt status and finish early, so that processes stop properly.

an interrupt will cause publishAll to throw an InterruptedException (due to the Thread.sleep), is that enough?

what if the Exception e caught here is an InterruptedException?

good point, will fix.

himanshug · 2016-03-08T17:46:19Z

server/src/main/java/io/druid/segment/realtime/appenderator/AppenderatorImpl.java

+      log.makeAlert("Unknown query type, [%s]", query.getClass())
+         .addData("dataSource", query.getDataSource())
+         .emit();
+      return new NoopQueryRunner<>();


this should probably be an exception instead of responding with empty result.

sounds good, will change to ISE

fjy · 2016-03-08T19:33:17Z

👍

himanshug · 2016-03-08T22:27:09Z

server/src/main/java/io/druid/segment/realtime/appenderator/FiniteAppenderatorDriver.java

+    handoffNotifier.start();
+
+    final FiniteAppenderatorDriverMetadata metadata = objectMapper.convertValue(
+        appenderator.startJob(),


appenderator.startJob() might give null in many cases.

that's ok, convertValue turns null into null, and then there's a null check in this method.

ok, dint know that converValue could handle null.

himanshug · 2016-03-09T18:43:07Z

@gianm had some comments but overall looks good and from what I see this PR does not change anything in RealtimeIndexTask so it should have no impact on existing ingestion mechanism.

regarding the discussion points..

I think we can update RealtimePlumber at some point if/when possible.
I'm not sure if we need to update IndexGeneratorReducer with this.

Appenderators are a way of getting more control over the ingestion process than a Plumber allows. The idea is that existing Plumbers could be implemented using Appenderators, but you could also implement things that Plumbers can't do. FiniteAppenderatorDrivers help simplify indexing a finite stream of data. Also: - Sink: Ability to consider itself "finished" vs "still writable". - Sink: Ability to return the number of rows contained within the sink.

Geared towards supporting transactional inserts of new segments. This involves an interface "DataSourceMetadata" that allows combining of partially specified metadata (useful for partitioned ingestion). DataSource metadata is stored in a new "dataSource" table.

gianm · 2016-03-11T01:43:10Z

@himanshug updated with most of your review comments addressed, but I am not entirely sure what you mean on the thread w/ #2220 (comment).

Reads a specific offset range from specific partitions, and can use dataSource metadata transactions to guarantee exactly-once ingestion. Each task has a finite lifecycle, so it is expected that some process will be supervising existing tasks and creating new ones when needed.

himanshug · 2016-03-14T15:06:21Z

👍 @gianm looks good overall, can you pls cleanup your commit history as needed.

gianm · 2016-03-14T17:43:35Z

@himanshug I had it split up as 6 commit on purpose (I think the 6 commits are pretty distinct), do you think I should combine some of them?

himanshug · 2016-03-14T18:11:48Z

i did say, "as needed" in there. if its already done then, great. 👍

gianm · 2016-03-14T18:13:08Z

cool, just checking 😄

Appenderators, DataSource metadata, KafkaIndexTask

gianm changed the title ~~Appenderators, DataSource ,etadata, KafkaIndexTask~~ Appenderators, DataSource metadata, KafkaIndexTask Jan 7, 2016

gianm added the Discuss label Jan 7, 2016

gianm changed the title ~~Appenderators, DataSource metadata, KafkaIndexTask~~ [wip] Appenderators, DataSource metadata, KafkaIndexTask Jan 7, 2016

gianm force-pushed the appenderator-kafka branch from c3f6217 to a22c035 Compare January 7, 2016 06:33

gianm mentioned this pull request Jan 7, 2016

[Do not merge] Appenderator stuff #1907

Closed

gianm force-pushed the appenderator-kafka branch 3 times, most recently from f5e4f44 to 1fa64c3 Compare January 10, 2016 02:50

himanshug reviewed Jan 11, 2016
View reviewed changes

himanshug reviewed Jan 12, 2016
View reviewed changes

gianm mentioned this pull request Jan 14, 2016

upgrade to kafka 9 #2269

Closed

himanshug reviewed Mar 8, 2016
View reviewed changes

gianm added 5 commits March 10, 2016 16:50

Nix Committers.supplierOf; Suppliers.ofInstance is good enough.

ad5ffdf

Make SegmentHandoffNotifier Closeable.

92c828f

Publish test-jar for indexing-service.

08284fe

DataSource metadata.

187569e

Geared towards supporting transactional inserts of new segments. This involves an interface "DataSourceMetadata" that allows combining of partially specified metadata (useful for partitioned ingestion). DataSource metadata is stored in a new "dataSource" table.

gianm force-pushed the appenderator-kafka branch from 111cfa6 to bbb3108 Compare March 11, 2016 01:41

gianm force-pushed the appenderator-kafka branch from bbb3108 to f22fb2c Compare March 11, 2016 02:41

himanshug removed the Discuss label Mar 14, 2016

fjy mentioned this pull request Mar 14, 2016

Supporting firehoseV2 in RealtimeIndexTask #2649

Closed

himanshug added a commit that referenced this pull request Mar 14, 2016

Merge pull request #2220 from gianm/appenderator-kafka

d51a0a0

Appenderators, DataSource metadata, KafkaIndexTask

himanshug merged commit d51a0a0 into apache:master Mar 14, 2016

gianm deleted the appenderator-kafka branch March 15, 2016 03:23

dclim mentioned this pull request Mar 15, 2016

Supervisor for KafkaIndexTask #2656

Merged

nishantmonu51 mentioned this pull request May 2, 2016

SegmentInsertAction incompatibility introduced in #2220 #2912

Closed

gianm mentioned this pull request May 4, 2016

appenderator plumber #2913

Merged

fjy mentioned this pull request May 20, 2016

[WIP] Druid 0.9.1 Release Notes #2999

Closed

loquisgon mentioned this pull request Jun 8, 2021

Minimize memory utilization in Sinks/Hydrants for native batch ingestion #11231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appenderators, DataSource metadata, KafkaIndexTask #2220

Appenderators, DataSource metadata, KafkaIndexTask #2220

gianm commented Jan 7, 2016

fjy commented Jan 7, 2016

gianm commented Jan 7, 2016

himanshug Jan 11, 2016

gianm Feb 20, 2016

himanshug commented Jan 11, 2016

gianm commented Jan 11, 2016

himanshug commented Jan 11, 2016

gianm commented Jan 11, 2016

himanshug commented Jan 11, 2016

gianm commented Jan 11, 2016

himanshug Jan 12, 2016

gianm Jan 12, 2016

himanshug Jan 12, 2016

gianm Jan 12, 2016

himanshug Mar 8, 2016

gianm Mar 11, 2016

fjy commented Mar 8, 2016

himanshug Mar 8, 2016

gianm Mar 11, 2016

himanshug Mar 14, 2016

himanshug commented Mar 9, 2016

gianm commented Mar 11, 2016

himanshug commented Mar 14, 2016

gianm commented Mar 14, 2016

himanshug commented Mar 14, 2016

gianm commented Mar 14, 2016

Appenderators, DataSource metadata, KafkaIndexTask #2220

Appenderators, DataSource metadata, KafkaIndexTask #2220

Conversation

gianm commented Jan 7, 2016

fjy commented Jan 7, 2016

gianm commented Jan 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshug commented Jan 11, 2016

gianm commented Jan 11, 2016

himanshug commented Jan 11, 2016

gianm commented Jan 11, 2016

himanshug commented Jan 11, 2016

gianm commented Jan 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjy commented Mar 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshug commented Mar 9, 2016

gianm commented Mar 11, 2016

himanshug commented Mar 14, 2016

gianm commented Mar 14, 2016

himanshug commented Mar 14, 2016

gianm commented Mar 14, 2016