Support task resource tracking in OpenSearch #3982

ketanv3 · 2022-07-22T11:58:03Z

Description

Reopens changes from #2639 (reverted in #3046) to add a framework for task resource tracking.
Currently, SearchTask and SearchShardTask support resource tracking but it can be extended to any other task in the future.

Changes since #2639:

Replaced the usage of AutoQueueAdjustingExecutorBuilder with ResizableExecutorBuilder
Fixed a race-condition when Task is unregistered before its threads are stopped
Resolved merge conflicts
Fixed broken tests

Signed-off-by: Ketan Verma ketan9495@gmail.com

Issues Resolved

#1179

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2022-07-22T12:25:40Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/908/
CommitID: d23c7f8b6da7842fc413ab1a0f0c61df3708a875

github-actions · 2022-07-23T12:21:48Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/936/
CommitID: 8ec9d57c3c11b1cd98bf278a755a2fb9bbe4ed79

codecov-commenter · 2022-07-23T12:22:36Z

Codecov Report

Merging #3982 (a09a60a) into main (740f75d) will increase coverage by 0.20%.
The diff coverage is 80.97%.

@@             Coverage Diff              @@
##               main    #3982      +/-   ##
============================================
+ Coverage     70.50%   70.71%   +0.20%     
- Complexity    56848    57025     +177     
============================================
  Files          4583     4586       +3     
  Lines        273931   274133     +202     
  Branches      40158    40175      +17     
============================================
+ Hits         193146   193864     +718     
+ Misses        64561    63987     -574     
- Partials      16224    16282      +58

Impacted Files	Coverage Δ
...rg/opensearch/common/settings/ClusterSettings.java	`91.89% <ø> (ø)`
...org/opensearch/action/support/TransportAction.java	`55.42% <50.00%> (-0.28%)`	⬇️
...ster/node/tasks/list/TransportListTasksAction.java	`47.61% <66.66%> (+3.17%)`	⬆️
...a/org/opensearch/threadpool/TaskAwareRunnable.java	`69.56% <69.56%> (ø)`
...opensearch/tasks/TaskResourceTrackingListener.java	`75.00% <75.00%> (ø)`
...erver/src/main/java/org/opensearch/tasks/Task.java	`80.00% <78.04%> (+2.22%)`	⬆️
.../opensearch/tasks/TaskResourceTrackingService.java	`79.76% <79.76%> (ø)`
...ensearch/common/util/concurrent/ThreadContext.java	`93.33% <80.00%> (+1.14%)`	⬆️
.../org/opensearch/action/search/SearchShardTask.java	`75.00% <100.00%> (+8.33%)`	⬆️
.../java/org/opensearch/action/search/SearchTask.java	`75.00% <100.00%> (+2.27%)`	⬆️
... and 500 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

Bukhtawar

Thanks @ketanv3 for the changes. What is the additional delay the await mechanism might introduce. We might need to run benchmarks for this change

ketanv3 · 2022-07-23T13:42:41Z

Thanks @ketanv3 for the changes. What is the additional delay the await mechanism might introduce. We might need to run benchmarks for this change

Yes, I'm working towards running performance benchmarks.

server/src/main/java/org/opensearch/tasks/Task.java

github-actions · 2022-07-26T21:24:58Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/1012/
CommitID: a7c210aa9535664a7f22f2ccc86644544a595eb1

server/src/main/java/org/opensearch/tasks/Task.java

github-actions · 2022-07-29T06:45:58Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/1094/
CommitID: dcdaf6e7ac68e2980c5b08a34a2d84db0eacc071

Reopens changes from opensearch-project#2639 (reverted in opensearch-project#3046) to add a framework for task resource tracking. Currently, SearchTask and SearchShardTask support resource tracking but it can be extended to any other task. Changes since opensearch-project#2639: * Replaced the usage of AutoQueueAdjustingExecutorBuilder with ResizableExecutorBuilder * Resolved merge conflicts * Fixed broken tests Signed-off-by: Ketan Verma <ketan9495@gmail.com>

…re stopped Signed-off-by: Ketan Verma <ketan9495@gmail.com>

Signed-off-by: Ketan Verma <ketan9495@gmail.com>

github-actions · 2022-07-31T09:47:46Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/1145/
CommitID: a09a60a

ketanv3 · 2022-07-31T10:22:30Z

Comparing the accuracy of both approaches – (1) no waiting for task's threads to complete, and (2) using a callback to keep track of active threads – before marking the task as unregistered.

An existing integration test was re-used to perform large number of search requests with predictable CPU/memory usage. Measurements were taken for:

Number of times thread usage was reported before task tracking was stopped (thread usage accounted).
Number of times thread usage was reported after task tracking was stopped (thread usage lost).

Both tests executed 500 search queries and reported resource usages for ~9020 tasks.

	approach 1	approach 2
thread usage accounted	2641	4392
thread usage lost	3721	237
tasks completed	9020	9021

Based on these results, approach (2) has been used for the implementation as it gives better accuracy.

server/src/main/java/org/opensearch/tasks/Task.java

github-actions · 2022-08-01T19:20:03Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/1200/
CommitID: 4e774762f74ea05f5d3854ec7061dc1b0884b6a0

ketanv3 · 2022-08-01T19:22:57Z

Benchmark results

Used c5.2xlarge EC2 instance-type
Used ./gradlew localDistro to generate a distribution from arbitrary commits
Used the default configs to launch a cluster
Used nyc_taxis workload

Baseline commit: 740f75d
Contender commit: a09a60a

opensearch-benchmark compare --baseline 740f75d2051 --contender a09a60acbbb

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/


Comparing baseline
  TestExecution ID: 740f75d2051
  TestExecution timestamp: 2022-07-31 13:29:54
  TestProcedure: append-no-conflicts
  ProvisionConfigInstance: external

with contender
  TestExecution ID: a09a60acbbb
  TestExecution timestamp: 2022-08-01 09:25:35
  TestProcedure: append-no-conflicts
  ProvisionConfigInstance: external

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                        Metric |                     Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |                          |       126.4 |     126.926 |   0.5262 |    min |
|             Min cumulative indexing time across primary shard |                          |       126.4 |     126.926 |   0.5262 |    min |
|          Median cumulative indexing time across primary shard |                          |       126.4 |     126.926 |   0.5262 |    min |
|             Max cumulative indexing time across primary shard |                          |       126.4 |     126.926 |   0.5262 |    min |
|           Cumulative indexing throttle time of primary shards |                          |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |                          |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |                          |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |                          |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |                          |     52.5814 |      54.486 |  1.90467 |    min |
|                      Cumulative merge count of primary shards |                          |         208 |         212 |        4 |        |
|                Min cumulative merge time across primary shard |                          |     52.5814 |      54.486 |  1.90467 |    min |
|             Median cumulative merge time across primary shard |                          |     52.5814 |      54.486 |  1.90467 |    min |
|                Max cumulative merge time across primary shard |                          |     52.5814 |      54.486 |  1.90467 |    min |
|              Cumulative merge throttle time of primary shards |                          |     1.37742 |     1.66843 |  0.29102 |    min |
|       Min cumulative merge throttle time across primary shard |                          |     1.37742 |     1.66843 |  0.29102 |    min |
|    Median cumulative merge throttle time across primary shard |                          |     1.37742 |     1.66843 |  0.29102 |    min |
|       Max cumulative merge throttle time across primary shard |                          |     1.37742 |     1.66843 |  0.29102 |    min |
|                     Cumulative refresh time of primary shards |                          |    0.690283 |    0.595333 | -0.09495 |    min |
|                    Cumulative refresh count of primary shards |                          |          76 |          80 |        4 |        |
|              Min cumulative refresh time across primary shard |                          |    0.690283 |    0.595333 | -0.09495 |    min |
|           Median cumulative refresh time across primary shard |                          |    0.690283 |    0.595333 | -0.09495 |    min |
|              Max cumulative refresh time across primary shard |                          |    0.690283 |    0.595333 | -0.09495 |    min |
|                       Cumulative flush time of primary shards |                          |     1.51388 |     1.44558 |  -0.0683 |    min |
|                      Cumulative flush count of primary shards |                          |          31 |          33 |        2 |        |
|                Min cumulative flush time across primary shard |                          |     1.51388 |     1.44558 |  -0.0683 |    min |
|             Median cumulative flush time across primary shard |                          |     1.51388 |     1.44558 |  -0.0683 |    min |
|                Max cumulative flush time across primary shard |                          |     1.51388 |     1.44558 |  -0.0683 |    min |
|                                       Total Young Gen GC time |                          |      64.188 |      66.636 |    2.448 |      s |
|                                      Total Young Gen GC count |                          |       17754 |       18124 |      370 |        |
|                                         Total Old Gen GC time |                          |           0 |           0 |        0 |      s |
|                                        Total Old Gen GC count |                          |           0 |           0 |        0 |        |
|                                                    Store size |                          |     24.3704 |     24.3655 | -0.00482 |     GB |
|                                                 Translog size |                          | 5.12227e-08 | 5.12227e-08 |        0 |     GB |
|                                        Heap used for segments |                          |           0 |           0 |        0 |     MB |
|                                      Heap used for doc values |                          |           0 |           0 |        0 |     MB |
|                                           Heap used for terms |                          |           0 |           0 |        0 |     MB |
|                                           Heap used for norms |                          |           0 |           0 |        0 |     MB |
|                                          Heap used for points |                          |           0 |           0 |        0 |     MB |
|                                   Heap used for stored fields |                          |           0 |           0 |        0 |     MB |
|                                                 Segment count |                          |          30 |          30 |        0 |        |
|                                                Min Throughput |                    index |      139816 |      140101 |  284.462 | docs/s |
|                                               Mean Throughput |                    index |      140748 |      141834 |  1086.13 | docs/s |
|                                             Median Throughput |                    index |      140712 |      141860 |  1147.85 | docs/s |
|                                                Max Throughput |                    index |      141949 |      143104 |   1155.2 | docs/s |
|                                       50th percentile latency |                    index |      459.04 |     461.438 |  2.39825 |     ms |
|                                       90th percentile latency |                    index |     641.768 |     667.389 |  25.6212 |     ms |
|                                       99th percentile latency |                    index |      1401.7 |     1471.98 |  70.2789 |     ms |
|                                     99.9th percentile latency |                    index |     2154.13 |     2254.37 |  100.243 |     ms |
|                                    99.99th percentile latency |                    index |     2801.73 |      2947.4 |   145.67 |     ms |
|                                      100th percentile latency |                    index |     2950.47 |     3561.21 |  610.743 |     ms |
|                                  50th percentile service time |                    index |      459.04 |     461.438 |  2.39825 |     ms |
|                                  90th percentile service time |                    index |     641.768 |     667.389 |  25.6212 |     ms |
|                                  99th percentile service time |                    index |      1401.7 |     1471.98 |  70.2789 |     ms |
|                                99.9th percentile service time |                    index |     2154.13 |     2254.37 |  100.243 |     ms |
|                               99.99th percentile service time |                    index |     2801.73 |      2947.4 |   145.67 |     ms |
|                                 100th percentile service time |                    index |     2950.47 |     3561.21 |  610.743 |     ms |
|                                                    error rate |                    index |           0 |           0 |        0 |      % |
|                                                Min Throughput | wait-until-merges-finish |  0.00545798 |  0.00717392 |  0.00172 |  ops/s |
|                                               Mean Throughput | wait-until-merges-finish |  0.00545798 |  0.00717392 |  0.00172 |  ops/s |
|                                             Median Throughput | wait-until-merges-finish |  0.00545798 |  0.00717392 |  0.00172 |  ops/s |
|                                                Max Throughput | wait-until-merges-finish |  0.00545798 |  0.00717392 |  0.00172 |  ops/s |
|                                      100th percentile latency | wait-until-merges-finish |      183218 |      139394 | -43824.2 |     ms |
|                                 100th percentile service time | wait-until-merges-finish |      183218 |      139394 | -43824.2 |     ms |
|                                                    error rate | wait-until-merges-finish |           0 |           0 |        0 |      % |
|                                                Min Throughput |                  default |     3.01589 |     3.01582 |   -7e-05 |  ops/s |
|                                               Mean Throughput |                  default |     3.02592 |      3.0258 | -0.00011 |  ops/s |
|                                             Median Throughput |                  default |     3.02365 |     3.02349 | -0.00016 |  ops/s |
|                                                Max Throughput |                  default |     3.04569 |     3.04548 | -0.00022 |  ops/s |
|                                       50th percentile latency |                  default |      5.7814 |     5.91026 |  0.12886 |     ms |
|                                       90th percentile latency |                  default |     6.43334 |     6.83153 |  0.39818 |     ms |
|                                       99th percentile latency |                  default |     8.34059 |     10.2823 |  1.94167 |     ms |
|                                      100th percentile latency |                  default |      9.3404 |     11.1879 |  1.84747 |     ms |
|                                  50th percentile service time |                  default |     3.23102 |     3.30837 |  0.07735 |     ms |
|                                  90th percentile service time |                  default |     3.64748 |     3.97348 |  0.32599 |     ms |
|                                  99th percentile service time |                  default |     5.69504 |     7.69124 |   1.9962 |     ms |
|                                 100th percentile service time |                  default |      6.9198 |     8.15025 |  1.23045 |     ms |
|                                                    error rate |                  default |           0 |           0 |        0 |      % |
|                                                Min Throughput |                    range |    0.703708 |    0.703334 | -0.00037 |  ops/s |
|                                               Mean Throughput |                    range |    0.706099 |    0.705482 | -0.00062 |  ops/s |
|                                             Median Throughput |                    range |    0.705548 |    0.704986 | -0.00056 |  ops/s |
|                                                Max Throughput |                    range |    0.711012 |    0.709893 | -0.00112 |  ops/s |
|                                       50th percentile latency |                    range |     230.942 |     228.752 | -2.18983 |     ms |
|                                       90th percentile latency |                    range |     232.469 |      232.27 | -0.19822 |     ms |
|                                       99th percentile latency |                    range |     281.604 |      268.33 | -13.2736 |     ms |
|                                      100th percentile latency |                    range |     282.024 |     274.873 | -7.15022 |     ms |
|                                  50th percentile service time |                    range |     224.228 |     221.927 | -2.30112 |     ms |
|                                  90th percentile service time |                    range |     225.436 |     225.098 | -0.33821 |     ms |
|                                  99th percentile service time |                    range |     274.534 |     261.367 | -13.1669 |     ms |
|                                 100th percentile service time |                    range |     274.651 |     268.094 |  -6.5567 |     ms |
|                                                    error rate |                    range |           0 |           0 |        0 |      % |
|                                                Min Throughput |      distance_amount_agg |     2.01208 |     2.01214 |    6e-05 |  ops/s |
|                                               Mean Throughput |      distance_amount_agg |     2.01986 |     2.01997 |  0.00011 |  ops/s |
|                                             Median Throughput |      distance_amount_agg |     2.01805 |     2.01815 |   0.0001 |  ops/s |
|                                                Max Throughput |      distance_amount_agg |     2.03565 |      2.0359 |  0.00025 |  ops/s |
|                                       50th percentile latency |      distance_amount_agg |     5.11124 |     5.33593 |  0.22468 |     ms |
|                                       90th percentile latency |      distance_amount_agg |     5.49016 |      5.5372 |  0.04703 |     ms |
|                                       99th percentile latency |      distance_amount_agg |     5.91635 |     5.86624 | -0.05011 |     ms |
|                                      100th percentile latency |      distance_amount_agg |     5.94142 |     6.09461 |  0.15319 |     ms |
|                                  50th percentile service time |      distance_amount_agg |     1.83493 |     1.90533 |   0.0704 |     ms |
|                                  90th percentile service time |      distance_amount_agg |     2.09502 |     2.09756 |  0.00254 |     ms |
|                                  99th percentile service time |      distance_amount_agg |     2.26754 |     2.35835 |  0.09081 |     ms |
|                                 100th percentile service time |      distance_amount_agg |     2.44594 |     2.45766 |  0.01172 |     ms |
|                                                    error rate |      distance_amount_agg |           0 |           0 |        0 |      % |
|                                                Min Throughput |            autohisto_agg |     1.50055 |     1.50018 | -0.00037 |  ops/s |
|                                               Mean Throughput |            autohisto_agg |     1.50088 |     1.50029 | -0.00059 |  ops/s |
|                                             Median Throughput |            autohisto_agg |     1.50081 |     1.50027 | -0.00055 |  ops/s |
|                                                Max Throughput |            autohisto_agg |     1.50158 |      1.5005 | -0.00108 |  ops/s |
|                                       50th percentile latency |            autohisto_agg |     432.987 |     447.084 |  14.0965 |     ms |
|                                       90th percentile latency |            autohisto_agg |     440.025 |     454.874 |   14.849 |     ms |
|                                       99th percentile latency |            autohisto_agg |     445.834 |     462.858 |  17.0243 |     ms |
|                                      100th percentile latency |            autohisto_agg |     452.361 |     466.995 |  14.6342 |     ms |
|                                  50th percentile service time |            autohisto_agg |     430.369 |      445.16 |  14.7912 |     ms |
|                                  90th percentile service time |            autohisto_agg |     437.996 |     452.532 |  14.5358 |     ms |
|                                  99th percentile service time |            autohisto_agg |       443.3 |     460.938 |  17.6376 |     ms |
|                                 100th percentile service time |            autohisto_agg |     450.203 |     463.866 |  13.6633 |     ms |
|                                                    error rate |            autohisto_agg |           0 |           0 |        0 |      % |
|                                                Min Throughput |       date_histogram_agg |     1.50276 |     1.50321 |  0.00044 |  ops/s |
|                                               Mean Throughput |       date_histogram_agg |     1.50448 |     1.50523 |  0.00075 |  ops/s |
|                                             Median Throughput |       date_histogram_agg |     1.50409 |     1.50477 |  0.00068 |  ops/s |
|                                                Max Throughput |       date_histogram_agg |     1.50791 |     1.50925 |  0.00134 |  ops/s |
|                                       50th percentile latency |       date_histogram_agg |     460.007 |      446.69 | -13.3169 |     ms |
|                                       90th percentile latency |       date_histogram_agg |     469.084 |     455.775 | -13.3096 |     ms |
|                                       99th percentile latency |       date_histogram_agg |     478.717 |      467.45 | -11.2664 |     ms |
|                                      100th percentile latency |       date_histogram_agg |     491.838 |     468.051 | -23.7876 |     ms |
|                                  50th percentile service time |       date_histogram_agg |     457.659 |     444.775 | -12.8836 |     ms |
|                                  90th percentile service time |       date_histogram_agg |     466.225 |     453.912 | -12.3126 |     ms |
|                                  99th percentile service time |       date_histogram_agg |     475.816 |     465.039 |  -10.777 |     ms |
|                                 100th percentile service time |       date_histogram_agg |     488.723 |     466.383 | -22.3394 |     ms |
|                                                    error rate |       date_histogram_agg |           0 |           0 |        0 |      % |


-------------------------------
[INFO] SUCCESS (took 0 seconds)
-------------------------------

…ion listener Signed-off-by: Ketan Verma <ketan9495@gmail.com>

github-actions · 2022-08-01T19:58:28Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/1203/
CommitID: d89de65

Bukhtawar · 2022-08-02T06:22:08Z

server/src/main/java/org/opensearch/tasks/Task.java

+    public void addResourceTrackingCompletionListener(NotifyOnceListener<Task> listener) {
+        resourceTrackingCompletionListeners.add(listener);
+    }


We shouldn't addResourceTrackingCompletionListener if the count is zero, else its possible to that newly added listener is never called.

Fair point, updated.

Though there is still a rare possibility of race-condition:

Task resource tracking is completed (num threads = 0), and existing completion listeners are invoked.

Delayed thread execution starts for the task (num threads = 1)

New completion listener added at this point may succeed.

Delayed thread execution stops for the task (num threads = 0)

New completion listener is invoked.

To solve this, we may have to bring back the isResourceTrackingCompleted atomic boolean into the Task. It's not a concern at the moment as listeners are added during Task creation, not in the middle of execution.

On a different note, this may not even be a problem because a (delayed) thread is still a part of the task, and the newly added listener would just receive the more recent/accurate usage stats.

Yeah we haven't also strictly synchronised adding listeners and invoking them, so I am fine with this limitation as long as it doesn't overcomplicate the use case

Bukhtawar

Thanks @ketanv3 for the changes, one minor comment

Signed-off-by: Ketan Verma <ketan9495@gmail.com>

github-actions · 2022-08-02T07:15:10Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/1246/
CommitID: 2027602

opensearch-trigger-bot · 2022-08-02T08:01:20Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-3982-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 5eac54d4ade73f6d0ed80b0f2408a104a98e3232
# Push it to GitHub
git push --set-upstream origin backport/backport-3982-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-3982-to-2.x.

* Support task resource tracking in OpenSearch * Reopens changes from opensearch-project#2639 (reverted in opensearch-project#3046) to add a framework for task resource tracking. Currently, SearchTask and SearchShardTask support resource tracking but it can be extended to any other task. * Fixed a race-condition when Task is unregistered before its threads are stopped * Improved error handling and simplified task resource tracking completion listener * Avoid registering listeners on already completed tasks Signed-off-by: Ketan Verma <ketan9495@gmail.com>

* [Backport 2.x] Support task resource tracking in OpenSearch * Reopens changes from #2639 (reverted in #3046) to add a framework for task resource tracking. Currently, SearchTask and SearchShardTask support resource tracking but it can be extended to any other task. * Fixed a race-condition when Task is unregistered before its threads are stopped * Improved error handling and simplified task resource tracking completion listener * Avoid registering listeners on already completed tasks Signed-off-by: Ketan Verma <ketan9495@gmail.com>

Backporting pull requests opensearch-project#2089 and opensearch-project#3982 Signed-off-by: PritLadani <pritkladani@gmail.com>

ketanv3 marked this pull request as ready for review July 23, 2022 13:17

ketanv3 requested review from a team and reta as code owners July 23, 2022 13:17

Bukhtawar reviewed Jul 23, 2022

View reviewed changes

Bukhtawar reviewed Jul 25, 2022

View reviewed changes

server/src/main/java/org/opensearch/tasks/Task.java Outdated Show resolved Hide resolved

Bukhtawar reviewed Jul 27, 2022

View reviewed changes

server/src/main/java/org/opensearch/tasks/Task.java Outdated Show resolved Hide resolved

ketanv3 added 4 commits July 31, 2022 01:10

Fixed a race-condition when Task is unregistered before its threads a…

3190e44

…re stopped Signed-off-by: Ketan Verma <ketan9495@gmail.com>

Replaced await with callbacks to mark task resource tracking completion

61cbf79

Signed-off-by: Ketan Verma <ketan9495@gmail.com>

Improved error handling for callbacks and relaxed assertions

a09a60a

Signed-off-by: Ketan Verma <ketan9495@gmail.com>

ketanv3 force-pushed the feature/resource-tracking-framework branch from dcdaf6e to a09a60a Compare July 31, 2022 09:22

nssuresh2007 reviewed Aug 1, 2022

View reviewed changes

server/src/main/java/org/opensearch/tasks/Task.java Outdated Show resolved Hide resolved

server/src/main/java/org/opensearch/tasks/Task.java Outdated Show resolved Hide resolved

nssuresh2007 approved these changes Aug 1, 2022

View reviewed changes

Bukhtawar reviewed Aug 1, 2022

View reviewed changes

server/src/main/java/org/opensearch/tasks/Task.java Outdated Show resolved Hide resolved

Improved error handling and simplified task resource tracking complet…

d89de65

…ion listener Signed-off-by: Ketan Verma <ketan9495@gmail.com>

ketanv3 force-pushed the feature/resource-tracking-framework branch from 4e77476 to d89de65 Compare August 1, 2022 19:25

ketanv3 requested a review from Bukhtawar August 2, 2022 04:03

Bukhtawar reviewed Aug 2, 2022

View reviewed changes

Bukhtawar approved these changes Aug 2, 2022

View reviewed changes

Avoid registering listeners on already completed tasks

2027602

Signed-off-by: Ketan Verma <ketan9495@gmail.com>

Bukhtawar merged commit 5eac54d into opensearch-project:main Aug 2, 2022

Bukhtawar added the backport 2.x Backport to 2.x branch label Aug 2, 2022

ketanv3 mentioned this pull request Aug 2, 2022

[Backport 2.x]Support task resource tracking in OpenSearch (#3982) #4087

Merged

ohltyler mentioned this pull request Aug 4, 2022

Bump version to 2.2 opensearch-project/anomaly-detection#627

Merged

PritLadani added a commit to PritLadani/OpenSearch that referenced this pull request Sep 6, 2022

Backporting RTF

20c43cd

Backporting pull requests opensearch-project#2089 and opensearch-project#3982 Signed-off-by: PritLadani <pritkladani@gmail.com>

This was referenced Nov 9, 2022

Search Query Runtime Cost Calculation #5174

Open

Add search backpressure cancellation at the coordinator level #5173

Closed

ketanv3 mentioned this pull request Dec 27, 2022

[BUG] OpenSearch sort query performance regression #5534

Closed

andrross mentioned this pull request Feb 2, 2023

[Backport 2.x] Support task resource tracking in OpenSearch #3021

Closed

CaptainDredge mentioned this pull request May 29, 2023

[RFC] Shard Indexing backpressure mechanism should also protect from any CPU contention on nodes #7638

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support task resource tracking in OpenSearch #3982

Support task resource tracking in OpenSearch #3982

ketanv3 commented Jul 22, 2022 •

edited

Loading

github-actions bot commented Jul 22, 2022

github-actions bot commented Jul 23, 2022

codecov-commenter commented Jul 23, 2022 •

edited

Loading

Bukhtawar left a comment

ketanv3 commented Jul 23, 2022

github-actions bot commented Jul 26, 2022

github-actions bot commented Jul 29, 2022

github-actions bot commented Jul 31, 2022

ketanv3 commented Jul 31, 2022

github-actions bot commented Aug 1, 2022

ketanv3 commented Aug 1, 2022

github-actions bot commented Aug 1, 2022

Bukhtawar Aug 2, 2022

ketanv3 Aug 2, 2022

ketanv3 Aug 2, 2022

ketanv3 Aug 2, 2022

Bukhtawar Aug 2, 2022

Bukhtawar left a comment

github-actions bot commented Aug 2, 2022

opensearch-trigger-bot bot commented Aug 2, 2022

Support task resource tracking in OpenSearch #3982

Support task resource tracking in OpenSearch #3982

Conversation

ketanv3 commented Jul 22, 2022 • edited Loading

Description

Issues Resolved

Check List

github-actions bot commented Jul 22, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 23, 2022

Gradle Check (Jenkins) Run Completed with:

codecov-commenter commented Jul 23, 2022 • edited Loading

Codecov Report

Bukhtawar left a comment

Choose a reason for hiding this comment

ketanv3 commented Jul 23, 2022

github-actions bot commented Jul 26, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 29, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 31, 2022

Gradle Check (Jenkins) Run Completed with:

ketanv3 commented Jul 31, 2022

github-actions bot commented Aug 1, 2022

Gradle Check (Jenkins) Run Completed with:

ketanv3 commented Aug 1, 2022

Benchmark results

github-actions bot commented Aug 1, 2022

Gradle Check (Jenkins) Run Completed with:

Bukhtawar Aug 2, 2022

Choose a reason for hiding this comment

ketanv3 Aug 2, 2022

Choose a reason for hiding this comment

ketanv3 Aug 2, 2022

Choose a reason for hiding this comment

ketanv3 Aug 2, 2022

Choose a reason for hiding this comment

Bukhtawar Aug 2, 2022

Choose a reason for hiding this comment

Bukhtawar left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 2, 2022

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Aug 2, 2022

ketanv3 commented Jul 22, 2022 •

edited

Loading

codecov-commenter commented Jul 23, 2022 •

edited

Loading