New (key|value).multi.type option for Avro serialization #680

ept · 2017-12-01T15:51:47Z

In some situations, an application needs to store events of several different types in the same Kafka topic. In particular, when developing a data model in an Event Sourcing style, you might have several kinds of event that affect the state of an entity. For example, for a customer entity there may be customerCreated, customerAddressChanged, customerEnquiryReceived, customerInvoicePaid, etc. events, and the application may require that those events are always read in the same order. Thus, they need to go in the same Kafka partition (to maintain ordering).

The Avro schema registry currently assumes a 1:1 mapping between Kafka topics and Avro schemas, making it difficult to support scenarios like the one above. Users who want several event types in the same topic currently either have to put them in one big Avro union (which works, but gets unwieldy very quickly), or turn off the registry's schema compatibility checking (which would be unfortunate, since the compatibility check is very valuable).

This patch introduces two new boolean config settings, key.multi.type and value.multi.type. When set to true, they allow the key (or value, respectively) of a message to be any Avro record type. The schema of the type is stored in the schema registry as usual; however, instead of
using <topic>-key or <topic>-value as subject, the fully-qualified name of the record type is used as subject.

This has the effect that a Kafka producer will happily accept any mixture of Avro record types and publish them to the same topic. Since the schema registry's ID for a schema is globally unique, the binary message encoding does not need to change, and consumers also handle the mixture of record types without change. When a schema is changed, the registry checks compatibility with previous schemas of the same fully-qualified type name; different record types can be evolved independently without any interference.

ghost · 2017-12-01T15:51:50Z

It looks like @ept hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement here.

Once you've signed reply with [clabot:check] to prove it.

Appreciation of efforts,

clabot

ept · 2017-12-01T15:55:06Z

[clabot:check]

ghost · 2017-12-01T15:55:07Z

@confluentinc It looks like @ept just signed our Contributor License Agreement. 👍

Always at your service,

clabot

mageshn · 2017-12-04T21:54:05Z

@ept thanks for your patch. While for most part, it looks good; I'm just thinking gout aloud here if it would help to generalize this. We have different scenarios where users would want to share the same schema across topic. Your solution can be used to fix that scenario as well. So, may be we could call the config to be something key.subject.name.strategy and value.subject.name.strategy. The default strategy could always use topic-ket and topic-value. Let me know your thoughts.

ept · 2017-12-11T21:26:20Z

@mageshn Thanks for the suggestion — I think that's a good idea, so I've implemented the key.subject.name.strategy and value.subject.name.strategy configs. They currently have three valid settings:

default: <topic>-key or <topic>-value as before
"type" setting: uses the fully-qualified Avro record name as subject
"topic-type" setting: uses the Kafka topic concatenated with the fully-qualified Avro record name as subject

Both the "type" and "topic-type" settings allow multiple event types in the same topic; the difference is just the scope at which the schema compatibility check is performed (per-topic per-type, or globally per-type).

rhauch

This is a really great improvement. Thanks, @ept. I do have one suggestion below:

rhauch · 2017-12-20T16:37:26Z

avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroSerDe.java

+        throw new SerializationException("Unknown value for "
+                + AbstractKafkaAvroSerDeConfig.VALUE_SUBJECT_NAME_STRATEGY + ": "
+                + valueSubjectNameStrategy);
+      }


This method is called frequently, and doing all these string comparisons in each call is less than ideal. What do you think about creating a functional interface like:

public interface SubjectNamingStrategy { String getSubjectName(String topic, Object value); }

with a separate implementation for each strategy (see below). The strategy for keys and values could be instantiated once in configureClientProperties(...) above:

if ("topic-key".equals(keySubjectNameStrategy)) { keySubjectStrategy = new TopicNamingStrategy(); } else if ("type".equals(keySubjectNameStrategy)) { keySubjectStrategy = new RecordSchemaNamingStrategy(""); } else if ("topic-type".equals(keySubjectNameStrategy)) { keySubjectStrategy = new RecordSchemaNamingStrategy("topic-"); } else { throw new SerializationException("Unknown value for " + AbstractKafkaAvroSerDeConfig.KEY_SUBJECT_NAME_STRATEGY + ": " + keySubjectNameStrategy); } if ("topic-value".equals(valueSubjectNameStrategy)) { keySubjectStrategy = new TopicNamingStrategy(); } else if ("type".equals(valueSubjectNameStrategy)) { valueSubjectStrategy = new RecordSchemaNamingStrategy(""); } else if ("topic-type".equals(valueSubjectNameStrategy)) { valueSubjectStrategy = new RecordSchemaNamingStrategy("topic-"); } else { throw new SerializationException("Unknown value for " + AbstractKafkaAvroSerDeConfig.VALUE_SUBJECT_NAME_STRATEGY + ": " + valueSubjectNameStrategy); }

One benefit is that an incorrect value for the configuration property is detected right away. However, the big benefit is that the getSubjectName member method that is called frequently by subclasses can be far more efficient by delegating to the correct strategy and forgoing all of the string comparisons:

protected String getSubjectName(String topic, boolean isKey, Object value) { return isKey ? keyStrategy.getSubjectName(topic, isKey, value) : valueStrategy.getSubjectName(topic, isKey, value); }

Each SubjectNamingStrategy implementation would be quite straightforward. For example, the TopicSubjectNamingStrategy might be implemented as follows:

protected static class TopicNamingStrategy implements SubjectNamingStrategy { public String getSubjectName(String topic, boolean isKey, Object value) { if (isKey) { return topic + "-key"; } return topic + "-value"; } }

while another RecordSchemaNamingStrategy implementation could handle both type and topic-type options by just using different prefixes:

protected static class RecordSchemaNamingStrategy implements SubjectNamingStrategy { private final String prefix; public RecordSchemaNamingStrategy(String prefix) { this.prefix = prefix != null ? prefix : ""; } public String getSubjectName(String topic, boolean isKey, Object value) { // Null is passed through unserialized, since it has special meaning in // log-compacted Kafka topics. if (value == null) { return null; } if (value instanceof GenericContainer) { Schema schema = ((GenericContainer) value).getSchema(); if (schema.getType() == Schema.Type.RECORD) { return prefix + schema.getFullName(); } } // isKey is only used to produce more helpful error messages if (isKey) { throw new SerializationException("In configuration " + AbstractKafkaAvroSerDeConfig.KEY_SUBJECT_NAME_STRATEGY + " = " + keySubjectNameStrategy + ", the message key must only be an Avro record"); } else { throw new SerializationException("In configuration " + AbstractKafkaAvroSerDeConfig.VALUE_SUBJECT_NAME_STRATEGY + " = " + valueSubjectNameStrategy + ", the message value must only be an Avro record"); } } }

IMO, I would prefer making the Strategy interface public so that users could choose to use their own topic to subject mapping strategy if they need to. Essentially, the config becomes a class name.

ept · 2018-01-05T15:03:33Z

@rhauch @mageshn Happy new year! I have updated the patch as you suggested, using different classes to implement the different subject-name choosing strategies. The configuration is now a fully-qualified Java classname, so that people can easily plug in their own strategies if desired. Could you let me know if it looks good now?

rhauch

@ept, happy new year to you! Thanks for the changes. I have one really minor question below -- otherwise this looks great!

Approving as is in case it's difficult to find succinct and clear text to add.

rhauch · 2018-01-05T16:36:02Z

avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroSerDeConfig.java

+      TopicNameStrategy.class.getName();
+  public static final String KEY_SUBJECT_NAME_STRATEGY_DOC =
+      "Determines how to construct the subject name under which the key schema is registered "
+      + "with the schema registry";


Minor nit: perhaps the key and value doc strings could say that by default the older topic naming behavior/strategy is used. It's not essential, but if it can be said clearly and succinctly it might help people understand that the behavior won't change if they use the default.

Added clarification of default behaviour in bf574fe.

rhauch · 2018-01-05T16:44:23Z

BTW, not sure if these pass locally, but the build is failing with the NPEs in the following tests:

io.confluent.connect.avro.AvroConverterTest.testPrimitive
io.confluent.connect.avro.AvroConverterTest.testVersionExtracted
io.confluent.connect.avro.AvroConverterTest.testVersionMaintained
io.confluent.connect.avro.AvroConverterTest.testComplex

For example:

java.lang.NullPointerException
at io.confluent.connect.avro.AvroConverterTest.testVersionMaintained(AvroConverterTest.java:210)

ept · 2018-01-05T16:59:52Z

Whoops, sorry about the test failures. I had just ran the tests on the kafka-avro-serializer module but not the rest. Pushed 6a9092a to fix the build.

mageshn · 2018-01-05T18:20:17Z

avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroSerDe.java

@@ -41,6 +43,8 @@

  private static final Map<String, Schema> primitiveSchemas;
  protected SchemaRegistryClient schemaRegistry;
+  protected SubjectNameStrategy keySubjectNameStrategy = new TopicNameStrategy();


nit : default is already specified in the config def. So, this is not necessary.

It is necessary: if the field is not initialised here, we get the NPEs that @rhauch complained about in tests for the kafka-connect-avro-converter module. It might be that those tests aren't properly configuring the SerDe, but I didn't want to get into debugging tests that are unrelated to the feature at hand.

mageshn · 2018-01-05T18:26:05Z

avro-serializer/src/main/java/io/confluent/kafka/serializers/subject/RecordNameStrategy.java

+public class RecordNameStrategy implements SubjectNameStrategy {
+
+  @Override
+  public void configure(Map<String, ?> config) {


I'm not seeing this being invoked after instance creation. Since this is a public API, users would possibly expect it to work if they use it. We should either invoke it or not extend Configurable in the interface.

This is invoked here, right?

I think it's useful to extend Configurable -- it's too hard w/ Java 7 to add it later when we need it.

That's right. I missed it is using getConfiguredInstance. LGTM

ept · 2018-01-09T11:06:31Z

Could someone at Confluent please tell me why the Jenkins check is failing? The branch builds fine for me locally, and jenkins.confluent.io is not accessible to the public.

mageshn · 2018-01-10T06:26:26Z

@ept Its failing for some find bug errors thats already fixed in master. Can you try rebasing your branch.

In some situations, an application needs to store events of several different types in the same Kafka topic. In particular, when developing a data model in an Event Sourcing style, you might have several kinds of event that affect the state of an entity. For example, for a customer entity there may be customerCreated, customerAddressChanged, customerEnquiryReceived, customerInvoicePaid, etc. events, and the application may require that those events are always read in the same order. Thus, they need to go in the same Kafka partition (to maintain ordering). The Avro schema registry currently assumes a 1:1 mapping between Kafka topics and Avro schemas, making it difficult to support scenarios like the one above. Users who want several event types in the same topic currently either have to put them in one big Avro union (which works, but gets unwieldy very quickly), or turn off the registry's schema compatibility checking (which would be unfortunate, since the compatibility check is very valuable). This patch introduces two new boolean config settings, key.multi.type and value.multi.type. When set to true, they allow the key (or value, respectively) of a message to be *any* Avro record type. The schema of the type is stored in the schema registry as usual; however, instead of using "<topic>-key" or "<topic>-value" as subject, the fully-qualified name of the record type is used as subject. This has the effect that a Kafka producer will happily accept any mixture of Avro record types and publish them to the same topic. Since the schema registry's ID for a schema is globally unique, the binary message encoding does not need to change, and consumers also handle the mixture of record types without change. When a schema is changed, the registry checks compatibility with previous schemas of the same fully-qualified type name; different record types can be evolved independently without any interference.

Using a string value for the configuration allows more than two settings. Added an option to use topic name + record name as subject.

ept · 2018-01-12T15:12:49Z

@mageshn Ok, rebased onto master.

arnaudbos · 2018-02-12T15:52:40Z

Hi,
What release of schema-registry (and which docker image tag) is this patch planned for?
Thanks for the hard work 🤓

defpearlpilot · 2018-02-13T18:56:12Z

I second @arnaudbos question. What is the release timeline for this?

mageshn · 2018-02-15T17:47:25Z

This will be released with the upcoming 4.1 release. I don't have exact timelines but should be tentatively around end of March or early April.

tPl0ch · 2018-02-18T08:42:10Z

@mageshn Are there any plans to port this behaviour to the REST proxy too? Currently the same restriction applies there by only allowing single key/value schemas with a record batch.

eventSourcerer · 2018-02-20T14:55:08Z

This looks like a very nice improvement - Our event schema is getting HUGE ;-)

Will this integrate with Kafka Streams? Currently we provide a concrete keySerde and valueSerde per topic there...

jxbes · 2018-03-14T13:38:40Z

I second @eventSourcerer 's question. How will we provide the key/value serde when reading a topic into a stream?

gphilipp · 2018-03-23T16:41:32Z

Users who want several event types in the same topic currently either have to put them in one big Avro union (which works, but gets unwieldy very quickly).

Is there an example of this somewhere (with a union type as the root element of the schema) ?

Yahampath · 2018-05-24T07:16:55Z

Hi,
Is there any example of this. If any please comment below.

arnaudbos · 2018-05-24T07:50:15Z

I don't think Avro supports this out of the box.
What you'd have to do is have a union type field.

Now that 4.1 is out, can we use the multi-schema feature? I didn't see it mentioned in the release note.

We have implemented our own version of schema-registry based on etcd (we had etcd at hand) because we needed this feature before it was released, though it doesn't support schema compatibility enforcements and rather than adding it on top of our own we'd like to migrate whenever possible.

Edit: Yes it has been released in 4.1.0, see changelog and docs.

buntyray · 2019-01-31T10:36:20Z

Where can I get this patch for version 3.3.0

teabot · 2019-07-16T12:32:14Z

Why is the Avro union not a better fit for this? The criticism of union was that this create unwieldy schemas - which I assume is a design time concern. However I believe that this is alleviated through the used of IDL imports to compose multiple smaller schemas into a single union.

By inventing a new out-of-band method of type composition, the utility of Avro compatibility checks is reduced and the burden on consumers increased. It unnecessarily creates another method of encoding/decoding records onto a topic. There is no longer a single contract for the whole topic, and now to validate a consumers ability to read one needs to:

list all subjects
filter subjects by topic prefix
extract schema for each subject

Under this scheme, the 'total schema' for the topic is opaque to the consumer and this feels misaligned with the principles of shared schemas and registries.

rayokota · 2020-07-08T17:10:15Z

@teabot , unions can now be used with schema references to store multiple event types in the same topic:

https://www.confluent.io/blog/multiple-event-types-in-the-same-kafka-topic/

rhauch reviewed Dec 20, 2017

View reviewed changes

ept force-pushed the multi-type branch from 6764a57 to b9b80d1 Compare January 5, 2018 15:01

rhauch approved these changes Jan 5, 2018

View reviewed changes

mageshn reviewed Jan 5, 2018

View reviewed changes

ept added 5 commits January 12, 2018 14:54

Generalize multi.type to (key|value).subject.name.strategy

ce34f98

Using a string value for the configuration allows more than two settings. Added an option to use topic name + record name as subject.

Use classes instead of strings for subject name strategies

1c55cd0

Fix test failures

913b632

Explain default value of new config parameters

3b0c548

ept force-pushed the multi-type branch from bf574fe to 3b0c548 Compare January 12, 2018 15:02

mageshn merged commit 10788d4 into confluentinc:master Jan 12, 2018

sfrehse mentioned this pull request Jan 25, 2018

Implement protobuf format reader confluentinc/ksql#278

Closed

solsson mentioned this pull request Feb 3, 2018

Schema Registry and REST Proxy as opt-in folder Yolean/kubernetes-kafka#102

Merged

gidxl03 mentioned this pull request Feb 9, 2018

Multiple schemas on same topic? #281

Closed

mageshn mentioned this pull request Feb 16, 2018

CC-1511 : Fix for subject strategy when used with new SubjectNameStrategy. #741

Merged

tPl0ch mentioned this pull request Feb 18, 2018

Allow sending key/value schemas per record confluentinc/kafka-rest#402

Open

apurvam mentioned this pull request May 4, 2018

Support multi schema topics confluentinc/ksql#1267

Open

OneCricketeer mentioned this pull request May 17, 2018

Support for full schema name as subject #552

Closed

jadireddi mentioned this pull request May 28, 2018

Code changes to upgrade latest versions of kafka and confluent schema registry. ovotech/kafka-serialization#44

Merged

jeqo mentioned this pull request Jul 19, 2018

add: post kafka schema registry p1 sysco-middleware/sysco-middleware.github.io#3

Merged

darkwings mentioned this pull request Jan 19, 2019

Allow the specification of the schema-registry subject when creating avro tables/stream confluentinc/ksql#1994

Open

papa99do mentioned this pull request May 14, 2019

Support multiple schemas in a topic confluentinc/kafka-connect-storage-cloud#247

Open

baileywall mentioned this pull request Oct 1, 2019

add support for key.subject.name.strategy and value.subject.name.strategy registry configuration options bencebalogh/avro-schema-registry#25

Closed

baileywall mentioned this pull request Nov 22, 2019

FEATURE REQUEST: add getSchemaByTopicRecordName and getSchemaByRecordName bencebalogh/avro-schema-registry#28

Open

mt3593 mentioned this pull request May 26, 2020

Support for multiple schema on a topic FundingCircle/jackdaw#250

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New (key|value).multi.type option for Avro serialization #680

New (key|value).multi.type option for Avro serialization #680

ept commented Dec 1, 2017

ghost commented Dec 1, 2017

ept commented Dec 1, 2017

ghost commented Dec 1, 2017

mageshn commented Dec 4, 2017

ept commented Dec 11, 2017

rhauch left a comment

rhauch Dec 20, 2017

mageshn Dec 20, 2017 •

edited

Loading

ept commented Jan 5, 2018

rhauch left a comment

rhauch Jan 5, 2018

ept Jan 5, 2018

rhauch commented Jan 5, 2018

ept commented Jan 5, 2018

mageshn Jan 5, 2018

ept Jan 6, 2018

mageshn Jan 5, 2018

rhauch Jan 5, 2018

mageshn Jan 5, 2018

ept commented Jan 9, 2018

mageshn commented Jan 10, 2018

ept commented Jan 12, 2018

arnaudbos commented Feb 12, 2018

defpearlpilot commented Feb 13, 2018

mageshn commented Feb 15, 2018

tPl0ch commented Feb 18, 2018 •

edited

Loading

eventSourcerer commented Feb 20, 2018 •

edited

Loading

jxbes commented Mar 14, 2018

gphilipp commented Mar 23, 2018 •

edited

Loading

Yahampath commented May 24, 2018

arnaudbos commented May 24, 2018 •

edited

Loading

buntyray commented Jan 31, 2019

teabot commented Jul 16, 2019

rayokota commented Jul 8, 2020

New (key|value).multi.type option for Avro serialization #680

New (key|value).multi.type option for Avro serialization #680

Conversation

ept commented Dec 1, 2017

ghost commented Dec 1, 2017

ept commented Dec 1, 2017

ghost commented Dec 1, 2017

mageshn commented Dec 4, 2017

ept commented Dec 11, 2017

rhauch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mageshn Dec 20, 2017 • edited Loading

Choose a reason for hiding this comment

ept commented Jan 5, 2018

rhauch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhauch commented Jan 5, 2018

ept commented Jan 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ept commented Jan 9, 2018

mageshn commented Jan 10, 2018

ept commented Jan 12, 2018

arnaudbos commented Feb 12, 2018

defpearlpilot commented Feb 13, 2018

mageshn commented Feb 15, 2018

tPl0ch commented Feb 18, 2018 • edited Loading

eventSourcerer commented Feb 20, 2018 • edited Loading

jxbes commented Mar 14, 2018

gphilipp commented Mar 23, 2018 • edited Loading

Yahampath commented May 24, 2018

arnaudbos commented May 24, 2018 • edited Loading

buntyray commented Jan 31, 2019

teabot commented Jul 16, 2019

rayokota commented Jul 8, 2020

mageshn Dec 20, 2017 •

edited

Loading

tPl0ch commented Feb 18, 2018 •

edited

Loading

eventSourcerer commented Feb 20, 2018 •

edited

Loading

gphilipp commented Mar 23, 2018 •

edited

Loading

arnaudbos commented May 24, 2018 •

edited

Loading