Clear the CallbackStatus entry from the map in FlushlessEventProducerHandler #843

vmaheshw · 2021-07-19T21:26:42Z

ISSUE:
Stuck partitions were reported in prod-lor1 mm.one twice in last 1 week. During both the instances, either the bad Kafka broker was abruptly removed or there were lots of URP(Under replicated partitions).

Root-cause:

a. The initial suspicion was Kafka failing to ACK() for some of the send(), resulting in stuck partitions.
b. There were logs in the files which should not be present:
2021/07/04 13:34:58.325 ERROR [FlushlessEventProducerHandler] [kafka-producer-network-thread | DatastreamLiKafkaProducer] [brooklin-service] [] Internal error: checkpoints should progress in increasing order. Resolved checkpoint as 966284416 which is less than current checkpoint of 966284432
2021/07/04 13:34:58.325 ERROR [FlushlessEventProducerHandler] [kafka-producer-network-thread | DatastreamLiKafkaProducer] [brooklin-service] [] Internal error: checkpoints should progress in increasing order. Resolved checkpoint as 966284363 which is less than current checkpoint of 966284416

This log means that the ACK received is for an offset smaller than the in-memory checkpoint in callbackStatus. This either means out of order offsets coming from kafka or the inflight message set has out-of-order messages for some reason.

What happened is: We did not clear the callbackStatus object for the topic Partition that was rewind to last checkpoint. So, it maintained all the older in-memory states and started working on top of it.

Action	InFlight Message Set	ACK Queue	Checkpoint	Poll() offset	Notes
	{3, 5, 6, 7}	{4}	3	8	Poll() offset is the offset to be polled by consumer.
Send failure for 5	{3, 5, 6, 7}	{4}	3	3	Seek to last checkpoint
Send 3, 4, 5, 6, 7	{3, 5, 6, 7, 4}	{4}	3	8	Add offset to inflight Queue
Ack for 3	{5, 6, 7, 4}	{3,4} -> {}	5	8	Move the checkpoint to 5, since all the ACKs are smaller than the first inflight message.
Send failure for 4	{5, 6, 7, 4}	{}	5	5	Seek to last checkpoint
Send 5,6,7,8	{5, 6, 7, 4, 8}	{}	5	9	Add send offset to inflight queue
Ack 7	{5,6,4,8}	{7}	5	9	Add 7 to ACK queue, since 7 > 5(first inflight message)
Ack 6	{5,4,8}	{6,7}	5	9	Add 6 to ACK queue, since 6 > 5 (first inflight message)
Ack 5	{4,8}	{5, 6, 7}	5	9	Add 5 to ACK queue, since 5 > 4 (first inflight message)
			[The checkpoint will remain stuck because 4 was never sent again]

So, this stuck partition issue happens when there are 2 consecutive {send() failures, seekToLastCheckpoint() } without the task thread restarting. This is the main Rootcause behind the stuck partitions seen during shutdown when there were brokers removals from Kafka side.

Solution:

To fix this: The ordering of inflight message set is extremely important in this algorithm and because the older states were not cleared, it resulted in undesired transition states.

Option 1: We will clear the callbackStatus entry for the topic Partition which are rewind to older checkpoint().
Option 2: Different data-structure that can maintain the ordering as well as remove the duplicates for inflight message set. Priority Queue can take care of ordering, but will have to be extended to avoid addition of duplicates.

For more deterministic behavior, my recommendation is Option(1).

Pull latest

jzakaryan

LGTM. Nit for updating comment in the test.

.../com/linkedin/datastream/connectors/kafka/mirrormaker/TestKafkaMirrorMakerConnectorTask.java

jzakaryan · 2021-07-26T17:12:07Z

...m-server-api/src/main/java/com/linkedin/datastream/server/FlushlessEventProducerHandler.java

+   * Clear the source-partition entry from the _callbackStatusMap
+   */
+  public void clear(String source, int partition) {
+    _callbackStatusMap.remove(new SourcePartition(source, partition));


Just for clarification. Will this result in the correct key getting removed from the map? Does SourcePartition override getHashKey()?

SourcePartition extends Pair underneath and uses the default Hash() function at Object level. SourcePartition was in use as Key in the HashMap for a long time.

shrinandthakkar

lgtm!

…Handler (linkedin#843) Clear the CallbackStatus entry from the map in FlushlessEventProducerHandler

vmaheshw and others added 13 commits November 18, 2019 12:06

Merge pull request #1 from linkedin/master

c31cd4a

Pull latest

Merge branch 'master' of github.com:vmaheshw/Brooklin

1429ce1

Merge branch 'master' of github.com:linkedin/brooklin

301bde3

Merge branch 'master' of github.com:linkedin/brooklin

e12230d

Merge branch 'master' of github.com:linkedin/brooklin

045e337

Merge branch 'master' of github.com:linkedin/brooklin

051edbc

Merge branch 'master' of github.com:linkedin/brooklin

4dab237

Merge branch 'master' of github.com:linkedin/brooklin

332208d

Merge branch 'master' of github.com:linkedin/brooklin

c92ca46

Merge branch 'master' of github.com:linkedin/brooklin

246cae6

Reset the callbackStatus entry in FlushlessEventProducerHandler

97aea5c

Cleanup

a9af031

Cleanup

c8e62b1

vmaheshw requested review from jzakaryan, shrinandthakkar and somandal July 19, 2021 21:28

jzakaryan previously approved these changes Jul 26, 2021

View reviewed changes

somandal previously approved these changes Jul 26, 2021

View reviewed changes

Address comments

85a4e21

vmaheshw dismissed stale reviews from somandal and jzakaryan via 85a4e21 July 26, 2021 21:27

vmaheshw requested review from jzakaryan and somandal July 26, 2021 21:29

somandal approved these changes Jul 26, 2021

View reviewed changes

shrinandthakkar approved these changes Jul 27, 2021

View reviewed changes

vmaheshw merged commit 7f7c7c5 into linkedin:master Jul 27, 2021

vmaheshw deleted the fixFlushlessProducer branch July 27, 2021 16:41

vmaheshw added a commit to vmaheshw/brooklin that referenced this pull request Mar 1, 2022

Clear the CallbackStatus entry from the map in FlushlessEventProducer…

6a23a63

…Handler (linkedin#843) Clear the CallbackStatus entry from the map in FlushlessEventProducerHandler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear the CallbackStatus entry from the map in FlushlessEventProducerHandler #843

Clear the CallbackStatus entry from the map in FlushlessEventProducerHandler #843

vmaheshw commented Jul 19, 2021

jzakaryan left a comment

jzakaryan Jul 26, 2021

vmaheshw Jul 26, 2021

shrinandthakkar left a comment

Clear the CallbackStatus entry from the map in FlushlessEventProducerHandler #843

Clear the CallbackStatus entry from the map in FlushlessEventProducerHandler #843

Conversation

vmaheshw commented Jul 19, 2021

jzakaryan left a comment

Choose a reason for hiding this comment

jzakaryan Jul 26, 2021

Choose a reason for hiding this comment

vmaheshw Jul 26, 2021

Choose a reason for hiding this comment

shrinandthakkar left a comment

Choose a reason for hiding this comment