Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate Periodic Realtime Segment Metadata Queries: Task Now Publish Schema for Seamless Coordinator Updates #5

Closed
wants to merge 14 commits into from

Conversation

findingrish
Copy link
Owner

@findingrish findingrish commented Nov 28, 2023

Description

Issue: apache#14989

The initial step in optimizing segment metadata was to centralize the construction of table schema in the Coordinator (apache#14985). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This task encompasses addressing both realtime and finalized segments.

This modification specifically addresses the issue with realtime segments. Tasks will now routinely communicate the schema for realtime segments during the segment announcement process. The Coordinator will identify the schema alongside the segment announcement and subsequently update the schema for realtime segments in the metadata cache.

Design

Task

  • Periodically, the StreamAppenderator.SinkSchemaAnnouncer will compute sink schema changes and announce them to the DataSegmentAnnouncer.
  • New APIs have been introduced in DataSegmentAnnouncer to receive sink schema information and manage schema cleanup when a task is closed.
  • A new Pojo named SegmentSchemas has been added to facilitate the passing of schema information for multiple segments.
  • A new implementation of DataSegmentChangeRequest has been introduced, named SegmentSchemasChangeRequest.

Coordinator

  • Modifications have been made to the HttpServerInventoryView to handle schema information.
  • The CoordinatorSegmentMetadata cache has been updated to incorporate schema changes. Changes have also been made to the refresh logic to eliminate the need for executing segment metadata queries for realtime segments.

Testing

  • Added UTs.
  • Tested it locally with wikipedia dataset and kafka based ingestion.

Potential side effects

TBA

Limitations

Currently, this feature doesn't work with zookeeper based segment announcement.

Upgrade considerations

The general upgrade order should be followed. The new code is behind a feature flag, so it is compatible with existing setups. Even if centralized table schema building (apache#14985) is enabled, realtime segments will be refreshed using segment metadata query to Indexer/Task.

Release notes

This experimental feature aims to eliminate the necessity for periodically executing the SegmentMetadataQuery to the Indexer/Task for retrieving the schema of realtime segments. Presently, it is accessible through two feature flags and should only be enabled for Proof of Concept (PoC) or testing purposes. To activate it, configure the following settings in the common configurations: druid.coordinator.centralizedTableSchema.enabled and druid.coordinator.centralizedTableSchema.announceRealtimeSegmentSchema. It's important to note that the feature flag is temporary druid.coordinator.centralizedTableSchema.announceRealtimeSegmentSchema and will be removed in a subsequent update.

@findingrish findingrish changed the title Changes to publish realtime segment schema changes in segment announcement step Publish realtime segment schema changes in segment announcement step Nov 30, 2023
@findingrish findingrish changed the title Publish realtime segment schema changes in segment announcement step Publish realtime segment schema in segment announcement step Nov 30, 2023
@findingrish findingrish changed the title Publish realtime segment schema in segment announcement step Publish realtime segment schema in segment announcement flow Nov 30, 2023
@findingrish findingrish changed the title Publish realtime segment schema in segment announcement flow Eliminate Periodic Realtime Segment Metadata Queries: Task Now Publish Schema for Seamless Coordinator Updates Dec 3, 2023
@findingrish findingrish closed this Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant