Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip refresh for unused segments in metadata cache #16990

Merged
merged 6 commits into from
Sep 12, 2024

Conversation

findingrish
Copy link
Contributor

@findingrish findingrish commented Sep 3, 2024

Parent issue: #14989

In the metadata cache, unused segment could be marked for refresh leading to unnecessary metadata queries.

One such scenario is:

  • Compaction task generates new set of segments for an interval leading to existing segments being marked as unused.
  • Those segments are being unloaded from historical's.
  • MetadataCache receives segment removal callback from historical's and marks the datasource for refresh.
  • The segment and schema poll updates the schema cache with schema for used segments.
  • The refresh call doesn't find schema for some segments and marks them to be refreshed in the next cycle.
  • In the next refresh call metadata query is issued for unused segments.

This PR tries to place a check for unused segments before marking it for refresh.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

ImmutableDruidDataSource druidDataSource =
sqlSegmentsMetadataManager.getImmutableDataSourceWithUsedSegments(segmentId.getDataSource());

if (druidDataSource != null && druidDataSource.getSegment(segmentId) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the segment if the segment is unused when we receive a callback from the inventory view.?

Copy link
Contributor

@kfaraz kfaraz Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever SqlSegmentsMetadataManager polls the database and returns the latest list of used segments, we should check that list and remove any segments from cache which are not in that list.

This removal should not be triggered based on any callback from server inventory view. Metadata store should be the source of truth on whether a segment is used or not.

Also, I was under the impression that this feature is already included. Can you confirm, @findingrish ?

Copy link
Contributor Author

@findingrish findingrish Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever SqlSegmentsMetadataManager polls the database and returns the latest list of used segments, we should check that list and remove any segments from cache which are not in that list.

This doesn't happen currently. Callbacks from the inventory view is the source of truth for the metadata cache.
The reason being that segments in the metadata cache are representative of the available segments loaded on various data servers in the cluster.

I feel that this would be an expensive operation as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Callbacks from the inventory view is the source of truth for the metadata cache.

I am not sure I understand. Isn't the schema coming from the metadata store?

Copy link
Contributor Author

@findingrish findingrish Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand. Isn't the schema coming from the metadata store?

That is partly correct, we still rely on metadata query to fetch schema for segments published from MSQ job.
The datasource schema is still computed from the segments which are available.

Copy link
Contributor

@kfaraz kfaraz Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my understanding is that we should be doing the following:

  • Every minute (or less), coordinator polls the metadata store to get the list of used segments and their schemas
  • Coordinator populates its cache with this information
  • Coordinator gets rid of any segment not found in the above list
  • If any used segment that the Coordinator received in the latest poll does not have the schema populated, we fire a metadata query to get that schema and populate it back into the DB.
  • Subsequent polls will get schema for that segment also.
  • If any service requests schema for a used segment, coordinator sends back an OK response
  • If any service requests schema for an unused segment, coordinator sends back a NOT FOUND response.

Let me know if this aligns with what we need and what we currently have.


Traditionally, coordinator has been aware of "used" segments only (both overshadowed and non-overshadowed). We should continue to maintain those semantics.

Coordinator does also maintain an inventory view which might show unused segments on some server. But in the immediately next run, the Coordinator unloads the unused segments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can base the schema building logic entirely on segments polled from the database since realtime segment schema isn't present in the metadata database.

Also, in the present mechanism datasource schema is rebuilt only on segment addition/removal. In the proposed approach the datasource schema would be rebuilt after each poll leading to overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the segment if the segment is unused when we receive a callback from the inventory view.?

@cryptoe I think there could be a scenario where this would still fail.

  • Segment s1 got added to foo and the cache tries to refresh it.
  • The cache is building the schema for foo and in parallel s2 was marked unused.
  • The caches marks s2 to be refreshed in the next cycle.
  • Lets assume that the segment removal callback didn't make it to the Coordinator until the next refresh cycle got triggered.
  • s2 an unused segment will be refreshed in that cycle.

Copy link
Contributor

@kfaraz kfaraz Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can base the schema building logic entirely on segments polled from the database since realtime segment schema isn't present in the metadata database.

That's a fair point. How does the coordinator currently distinguish between a realtime segment and a published segment?

Also, in the present mechanism datasource schema is rebuilt only on segment addition/removal. In the proposed approach the datasource schema would be rebuilt after each poll leading to overhead?

Don't we already do that? If not, how is the schema polled from the DB used?

Copy link
Contributor Author

@findingrish findingrish Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the coordinator currently distinguish between a realtime segment and a published segment?

Both segments are added to the timeline, the difference is in refreshing the segment schema.
Traditionally, a realtime segment would be refreshed every minute by issuing metadata query to tasks serving the segment whereas a published segment would be refreshed only once after the segment has been added.

In the current scheme, schema for published segment is obtained from the database whereas realtime segment schema is periodically pushed to the Coordinator from tasks.

Don't we already do that? If not, how is the schema polled from the DB used?

No we don't do this as of now. The segment schema polled from the database is cached and used on-demand ie. when a segment is added or datasource schema is being rebuilt.

@cryptoe cryptoe merged commit a18f582 into apache:master Sep 12, 2024
88 of 90 checks passed
@cryptoe cryptoe added this to the 31.0.0 milestone Sep 12, 2024
cecemei pushed a commit to cecemei/druid that referenced this pull request Sep 12, 2024
* Skip refresh for unused segments in metadata cache

* Cover the condition where a used segment missing schema is marked for refresh

* Fix test
cecemei pushed a commit to cecemei/druid that referenced this pull request Sep 13, 2024
* Skip refresh for unused segments in metadata cache

* Cover the condition where a used segment missing schema is marked for refresh

* Fix test
cecemei pushed a commit to cecemei/druid that referenced this pull request Sep 14, 2024
* Skip refresh for unused segments in metadata cache

* Cover the condition where a used segment missing schema is marked for refresh

* Fix test
cecemei added a commit to cecemei/druid that referenced this pull request Sep 14, 2024
findingrish added a commit to findingrish/druid that referenced this pull request Sep 16, 2024
* Skip refresh for unused segments in metadata cache

* Cover the condition where a used segment missing schema is marked for refresh

* Fix test
findingrish added a commit to findingrish/druid that referenced this pull request Sep 16, 2024
* Skip refresh for unused segments in metadata cache

* Cover the condition where a used segment missing schema is marked for refresh

* Fix test
abhishekrb19 pushed a commit that referenced this pull request Sep 18, 2024
* Skip refresh for unused segments in metadata cache

* Cover the condition where a used segment missing schema is marked for refresh

* Fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants