Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split KillUnusedSegmentsTask to processing in smaller chunks #14642

Merged
merged 10 commits into from
Jul 31, 2023

Conversation

jasonk000
Copy link
Contributor

@jasonk000 jasonk000 commented Jul 23, 2023

Related to #14131, #14634, and #14639.

Description

Split the KillUnusedSegmentsTask to process in smaller batches. The segment nuke runs inside a critical section inside the TaskLockbox. This blocks up most activity in the overlord, including allocation of new segments, while the nuke is running. If the nuke runs for a long time, it can cause issues with cluster stability.

This change relieves pressure on the TaskLockbox by splitting the nuke work into smaller chunks, which will allow other tasks to interleave with the lockbox.

Key changed/added classes in this PR
  • KillUnusedSegmentsTask: partition the segments into batches of size 10000 for nuke.

This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Release notes

A new optional parameter has been introduced to the kill task description, which allows to perform the kill in batches (vs all at once). This should allow better stability to issue larger and longer running tasks.

@jasonk000 jasonk000 requested a review from kfaraz July 23, 2023 20:09
@jasonk000
Copy link
Contributor Author

jasonk000 commented Jul 23, 2023

@kfaraz @TSFenwick looking for input from you both on this, currently it's a draft:

  1. should the 1000 value be configurable, and if so, what's the best mechanism to introduce this?
  2. currently there's no TC for SegmentNukeAction, is it warranted to introduce one here?

I intend to run this on one of our clusters during the week next week. cc @maytasm.

@kfaraz
Copy link
Contributor

kfaraz commented Jul 24, 2023

This makes sense, @jasonk000 .

  • The batch size need not be configurable and can probably even be a little higher.
  • You can try to benchmark and determine the optimal batch size during your testing.
  • Also, instead of having a task action return type of void, we can return the number of segments actually deleted. This becomes more important now since we are breaking up the transaction.

@kfaraz
Copy link
Contributor

kfaraz commented Jul 24, 2023

Also, I was looking at the SegmentNukeAction and the KillUnusedSegmentsTask is the only place where this is used.

Now the kill task essentially does the following:

  • For a given interval, mark segments as unused
  • Fetch all the unused segments in that interval
  • Delete all the fetched unused segments

Since we already have the interval, I wonder if we even need to fire multiple DELETE statements and if just a single one with the right WHERE clause on start, end, datasource and used would suffice. There are probably some nuances here such as the IntervalMode used while retrieving the unused segments, etc, but nothing that cannot be wired up.

Since we are trying to optimize this code path, I think we should explore that angle as well.

@maytasm
Copy link
Contributor

maytasm commented Jul 24, 2023

Druid has a feature where the Coordinator will submit tasks periodically to kill unused segments (this is enable by druid.coordinator.kill.on). When this feature is enable, it also has a config druid.coordinator.kill.maxSegments which control maximum number of segment to kill per kill task submission. The default value is 100.

Just some thoughts I have:

  • If we already speed up segment nuke by batching the SQL Metadata deleteSegments ((perf) server: batch SQL Metadata deleteSegments #14639), do we still need this?
  • If automatic kill task (druid.coordinator.kill.on) is our paved path, should we encourage users to use that (and set the druid.coordinator.kill.maxSegments, etc) over manually submitting kill task? If user manually submit kill task, they would not have any guardrail that automatic kill task provide (such as too many segments in one kill task, killing interval that they did not intended, datasource whitelist/blacklist, etc).
  • Do we need the segment nuke action to return the segment
  • What happen to the caller of the segment nuke action (i.e. KillTask) if some batches succeeded and some batches failed? The number of failed/succeeded may not be enough and we may need the actual id of the segments right?
  • If we do go ahead with this change, I think not having a new config is better.

@kfaraz
Copy link
Contributor

kfaraz commented Jul 24, 2023

Yes, @maytasm , I agree. With the recent improvements made in #14639 and the automatic Coordinator kill being a thing, I am not sure if we would really gain a lot of perf benefits here in most cases.

But since we cannot prevent users from submitting a kill task (it is allowed by the console after all), it would make sense to improve this action, especially since it is done in a critical section of the TaskLockbox.

In my opinion, the best way forward is to check if we can fire a single DELETE with the right filter on interval, rather than a batch of DELETEs. It would not require us to have any new configs or worry about partial success.

cc: @jasonk000

@jasonk000
Copy link
Contributor Author

jasonk000 commented Jul 24, 2023

Since we already have the interval, I wonder if we even need to fire multiple DELETE statements and if just a single one with the right WHERE clause on start, end, datasource and used would suffice. There are probably some nuances here such as the IntervalMode used while retrieving the unused segments, etc, but nothing that cannot be wired up.

This makes sense to me, though I'm not entirely clear on the segment rules around this. If we did this, it would change scope a bit: we take in a (unordered) Set, it's not obvious at this layer to me that we have all the entries in the Interval, so we might delete unintended files. I would think for best results the API needs to change from deleteSegments() to deleteSegmentsInInterval()?

If we already speed up segment nuke by batching the SQL Metadata deleteSegments, do we still need this?

Yes, - today if we issue a task for "delete 10 million segments", it will locks up the overlord/cluster. Even though S3 and SQL batching will make it much much faster, it won't solve the cluster stability, just make the segment count to lock it up bigger. This this lock being held too long brings most overlord guided activity -- including ingestion -- to a halt since ingestion tasks can't get segments allocated and are forced to wait.

If automatic kill task (druid.coordinator.kill.on) is our paved path, should we encourage users to use that (and set the druid.coordinator.kill.maxSegments, etc) over manually submitting kill task? If user manually submit kill task, they would not have any guardrail that automatic kill task provide (such as too many segments in one kill task, killing interval that they did not intended, datasource whitelist/blacklist, etc).

I agree, though, we can easily imagine a situation where a user doesn't realise automatic kill task and needs to do a catch-up (me!), or even a change in requirements to reduce or change storage configurations means a user no longer need 24mo but only need 12mo data. In these situations, a user should be able to issue a task and the system stay stable.

Do we need the segment nuke action to return the segment

Could be, it doesn't at the moment. Would you use it in KillTask or just for logging?

What happen to the caller of the segment nuke action (i.e. KillTask) if some batches succeeded and some batches failed? The number of failed/succeeded may not be enough and we may need the actual id of the segments right?

This is an unresolved problem today: the SQL delete happens inside a transaction which is committed before the S3 deletes are issued. So, if any files fail to delete from S3, the segments are still gone from SQL. I'm not sure if there's a reconciliation process.

If we do go ahead with this change, I think not having a new config is better.

Makes sense to me too.

@jasonk000
Copy link
Contributor Author

Just to be clear, I don't see this as a "performance" thing, but rather, a "stability" thing. Issuing a large kill task currently will block the overlord from operating. With this patch, the overlord will continue operating.

@maytasm
Copy link
Contributor

maytasm commented Jul 24, 2023

  • I believe KillUnusedSegments actually convert interval to segments before passing to SegmentNukeAction. (KillUnusedSegments uses RetrieveUnusedSegmentsAction to find unused segments for a given datasource + interval). So we actually started with an Interval. I am not sure if there is any reason that we are using segment ids rather than interval (as suggested by @kfaraz).
  • I am convinced we should make this change. KillTask can still be issued manually and the number of segments in each KillTask can easily be very large without the user realizing it.
  • For returning the segment actually deleted from SegmentNukeAction, I was thinking we can use this to determine which segments we want to delete from s3. (skip deleting from s3 if SegmentNukeAction failed to remove segment from metadatastore)
  • If any files fail to delete from S3, the segments are still gone from SQL -> I think this is less of a problem than if the file is deleted from s3 but still remains in SQL. If files fail to delete from S3, it does not effect Druid as Druid would considered them as deleted (gone forever). The s3 leak can be reconcile with s3 retention policy where the policy delete really old files (much greater than Druid retention policy to clean up the leak). if the file is deleted from s3 but still remains in SQL, Druid allow user to bring back unused segments (marking unused as used) which would cause problem if the actual files are no longer in deep storage!

@jasonk000
Copy link
Contributor Author

After discussion with @maytasm (thanks!) , we think it's best to move the batching up from Segment Nuke to the KillUnusedSegmentsTask. This will allow us to do batching, while also ensuring better consistency between the existing behaviour of "when nuke is OK, remove S3 files, otherwise leave the S3 records intact". Batching in only the SegmentNuke layer would have made the failure handling logic overly complicated.

@jasonk000
Copy link
Contributor Author

jasonk000 commented Jul 25, 2023

@maytasm @kfaraz how do we feel now about this PR? I've focused the change on the one specific problem area, and added some extra comments to explain the code a bit more thoroughly.

There's an opportunity that I'm ignoring here for the moment, similar to @kfaraz suggestion, which is that the Retrieve Unused Tasks can limit the number retrieved, or divide and request a smaller interval and iterate intervals in smaller blocks. It looks like RetrieveUnusedSegmentsAction is also used by MoveTask, ArchiveTask, and RestoreTask in addition to KillUnusedSegmentsTask.

Processing in smaller chunks allows the task execution to yield the TaskLockbox lock,
which allows the overlord to continue being responsive to other tasks and users while
this particular kill task is executing.
@jasonk000 jasonk000 force-pushed the segment-nuke-chunk-processing branch from d08fcff to a372da0 Compare July 27, 2023 22:54
@jasonk000 jasonk000 requested a review from maytasm July 27, 2023 22:57
@jasonk000 jasonk000 changed the title split segment-nuke action to smaller chunks split KillUnusedSegmentsTask to processing in smaller chunks Jul 27, 2023
@jasonk000 jasonk000 marked this pull request as ready for review July 27, 2023 23:03
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the for changes, @jasonk000 !
I have left some comments. Could you please also update the PR description to reflect the latest set of changes?

Copy link
Contributor

@maytasm maytasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I have some minor comments but is +1 on the design/approach

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the overall approach given that we want the change to have a small footprint while reducing the lock burden on Overlord.

I have left some minor comments. +1 after those are addressed.

* provide a longer explanation for kill task batchSize parameter

* add logging details for kill batch progress

* javadoc and field name changes
@jasonk000
Copy link
Contributor Author

sample logging output during test

2023-07-29T21:15:04,585 INFO [main] org.apache.druid.indexing.common.task.KillUnusedSegmentsTask - Marked 4 segments as unused.
2023-07-29T21:15:04,586 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Found 4 unused segments for dataSource for interval 2018-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z.
2023-07-29T21:15:04,586 INFO [main] org.apache.druid.indexing.common.task.KillUnusedSegmentsTask - kill starting: id [kill_dataSource_kmgianbe_2018-01-01T00:00:00.000Z_2020-01-01T00:00:00.000Z_2023-07-29T21:15:04.580Z] dataSource [dataSource] interval [2018-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z], total segments [4], batches [2], ([3] segments per batch)
2023-07-29T21:15:04,587 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Deleted [3] segments from metadata storage for dataSource [dataSource].
2023-07-29T21:15:04,587 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Deleted [1] segments from metadata storage for dataSource [dataSource].
2023-07-29T21:15:04,587 INFO [main] org.apache.druid.indexing.common.task.KillUnusedSegmentsTask - kill complete: id [kill_dataSource_kmgianbe_2018-01-01T00:00:00.000Z_2020-01-01T00:00:00.000Z_2023-07-29T21:15:04.580Z] dataSource [dataSource] interval [2018-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z], total segments [4], batches [2]

@jasonk000
Copy link
Contributor Author

added logging and javadoc changes in d3d5a71 and 3fc9e96

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor logging and doc suggestions.

jasonk000 and others added 5 commits July 31, 2023 09:16
better description of the task parameters

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
better description of batch size parameter

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…indexing/common/task/KillUnusedSegmentsTask.java


updated druid logging style

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…indexing/common/task/KillUnusedSegmentsTask.java


updated druid logging style

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…indexing/common/task/KillUnusedSegmentsTask.java


updated druid logging style

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
@jasonk000 jasonk000 merged commit 44d5c1a into apache:master Jul 31, 2023
74 checks passed
@jasonk000 jasonk000 deleted the segment-nuke-chunk-processing branch August 3, 2023 22:08
@abhishekagarwal87
Copy link
Contributor

@jasonk000 - can you please add some release notes to the PR description?

@jasonk000
Copy link
Contributor Author

@abhishekagarwal87 added some notes to description.

@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants