Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release all segments of a table in releaseAndRemoveAllSegments method #12297

Merged
merged 1 commit into from
Feb 2, 2024

Conversation

jackjlli
Copy link
Contributor

@jackjlli jackjlli commented Jan 22, 2024

This PR attempts to asynchronously release all the segments of a table in releaseAndRemoveAllSegments method. The reason of doing that by submitting multiple threads is that the table could have too many open segments, which will take time to release them one by one (especially for consuming segments). Thus, introduce the fixed thread pool to release them in order to speed up the stopping behavior.

The releaseAndRemoveAllSegments method can be invoked by two scenarios:

  1. server needs to be shut down. In this case, it's fine to spawn thread to async close segments.
  2. table gets deleted on a running server. In this case, it doesn't matter whether the consuming segments of the target table can be closed gracefully. Thus, it makes sense to use a fixed thread pool to close them at best effort.

This PR also adds the fixed thread pool to shut down table managers asynchronously. This can also help speed up the shutdown process for the server.

@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2024

Codecov Report

Attention: 18 lines in your changes are missing coverage. Please review.

Comparison is base (17db0fd) 61.69% compared to head (a796baa) 61.73%.
Report is 3 commits behind head on master.

Files Patch % Lines
...server/starter/helix/HelixInstanceDataManager.java 0.00% 13 Missing ⚠️
.../pinot/core/data/manager/BaseTableDataManager.java 61.53% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12297      +/-   ##
============================================
+ Coverage     61.69%   61.73%   +0.03%     
  Complexity      207      207              
============================================
  Files          2424     2424              
  Lines        132340   132520     +180     
  Branches      20436    20483      +47     
============================================
+ Hits          81651    81808     +157     
- Misses        44693    44715      +22     
- Partials       5996     5997       +1     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.67% <30.76%> (+26.65%) ⬆️
java-21 61.60% <30.76%> (+0.02%) ⬆️
skip-bytebuffers-false 61.70% <30.76%> (+0.01%) ⬆️
skip-bytebuffers-true 61.55% <30.76%> (-0.01%) ⬇️
temurin 61.73% <30.76%> (+0.03%) ⬆️
unittests 61.72% <30.76%> (+0.03%) ⬆️
unittests1 46.91% <61.53%> (+<0.01%) ⬆️
unittests2 27.69% <0.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jackjlli jackjlli force-pushed the stop-all-segments-in-releaseAndRemoveAllSegments branch from cb5a820 to 3ed9d68 Compare January 22, 2024 23:39
@siddharthteotia
Copy link
Contributor

  • Are we going to start an ExecutorService every time ?
  • Did we take a thread dump to understand why the segment release was blocked / what was it waiting on ?

@jackjlli
Copy link
Contributor Author

Are we going to start an ExecutorService every time?

The releaseAndRemoveAllSegments method is only invoked when shutting down the whole pinot-server. So within the lifecycle of a java process the executor will be started only once.

Did we take a thread dump to understand why the segment release was blocked / what was it waiting on?

No we didn't take a thread dump but agree that we should have taken one. Basically, it timed out while waiting for the consumer thread to exit. The consumer thread may still be in consumeLoop() method, or the consumer thread was still busy processing the fetched events. But right now since the stop() method is called for each segment at a time, and the timeout for waiting for a segment to stop is 10 mins, it will make the whole server to wait for just 1 segment to stop, while other consuming segments are still functioning. Same for other tables in the same pinot-server that hasn't gotten the chance to stop. That's why this PR tries to invoke the stop call async so that all the segments of a table can be stopped together for server shutdown.

@jackjlli jackjlli force-pushed the stop-all-segments-in-releaseAndRemoveAllSegments branch from 3ed9d68 to 2abd892 Compare January 23, 2024 04:04
@siddharthteotia
Copy link
Contributor

Another qq - Will the same release method be called if the table is dropped / deleted ?

I ask because we start the executor service with threads = numSegments

When the call is being made during deployment / shutdown, it is fine to spawn these many threads which could be a very high multiple of numCores.

But during steady state, if this code is invoked then spawning so many threads can affect query performance of other tables on the host.

@siddharthteotia
Copy link
Contributor

LGTM overall.

@mcvsubbu / @vvivekiyer / @somandal - in case one of you wants to take a look

@siddharthteotia
Copy link
Contributor

Also, it will be good to update the PR description with a bit more details on how we noticed the problem.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using a feature flag to guard this enhancement for the following reasons:

  • The change will make the segment release async, so not sure if it has side effect
  • It can consume extra CPU resources thus impacting query performance

If the intention is to accelerate the server shutdown, we can consider parallel the table data manager shutdown in HelixInstanceDataManager.shutDown(), at which moment we know server is not serving queries

@jackjlli
Copy link
Contributor Author

If the intention is to accelerate the server shutdown, we can consider parallel the table data manager shutdown in HelixInstanceDataManager.shutDown(), at which moment we know server is not serving queries.

Yes, the intention is to accelerate the server shutdown. Paralleling table data manager shutdown can help in one way, whereas there is another scenario that there is some certain table with too many segments which require to be shut down one by one (and unfortunately this is the scenario we met). I can add the parallelism to the table level shutdown as well as changing the thread pool to a fixed sized thread pool.

@jackjlli jackjlli force-pushed the stop-all-segments-in-releaseAndRemoveAllSegments branch from 2abd892 to 4f05fd7 Compare February 1, 2024 06:56
@jackjlli
Copy link
Contributor Author

jackjlli commented Feb 1, 2024

Comments have been addressed in the latest push and the description is updated. Please take a look. Thanks!

@jackjlli jackjlli force-pushed the stop-all-segments-in-releaseAndRemoveAllSegments branch from 4f05fd7 to a796baa Compare February 1, 2024 18:50
@jackjlli jackjlli merged commit 87c089f into master Feb 2, 2024
19 checks passed
@jackjlli jackjlli deleted the stop-all-segments-in-releaseAndRemoveAllSegments branch February 2, 2024 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants