Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeout exception and Add Integ test for Master task throttling #4588

Conversation

dhwanilpatel
Copy link
Contributor

@dhwanilpatel dhwanilpatel commented Sep 26, 2022

Signed-off-by: Dhwanil Patel dhwanip@amazon.com

Description

  • Fix Timeout exception, if task is getting timedout while doing retry we would want to throw ClusterEventTimeoutException.
  • Added Integ tests for the throttling change to cover below scenario.
    • Tasks are getting submitted to remote master node (multi node cluster)
    • Tasks are getting submitted to local master node (single node cluster)
    • Tasks are getting timedout while retrying.

Issues Resolved

[#479 ]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Comment on lines +141 to +142
* Here we will assert the client behaviour that client's request is not
* failed with throttling exception but timeout exception.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pls help me understand why this is the right behavior ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throttling is introduced for internal task submission, cx don't have to be aware of it.

Without throttling, such tasks will wait in master's pending task queue. If tasks gets timeout in queue itself, then master return ClusterProcessEventTimeoutException.

With throttling we are making those tasks wait on data node itself instead of master node. So let's say tasks was throttled throughout the time and gets timedout, it is the same as task is waiting in master queue without throttling.

From cx perspective, for both case master node is busy and not able to process the event in given timeout. For that we should throw timeout exception only.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer Throttling exception as the task didn't wait in master's pending task queue , but was outright rejected . The customers which are submitting the task should be aware of the difference between two . This would help us in the future if we want to take different actions on each of those. What do you think ?

Copy link
Contributor Author

@dhwanilpatel dhwanilpatel Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now we can keep it as internal, we will go incremental and expose this to cx later. As we will need to expose various retries configuration as well for tuning. Till then cx can treat it as timeout exception only.

I will create followup issue for this.

Copy link
Member

@shwetathareja shwetathareja Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 lets create a different issue to expose ThrottlingException to user. Also need to cross check if this would require changes in the clients

Copy link
Contributor Author

@dhwanilpatel dhwanilpatel Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created followup issue for it: #4724

Comment on lines +221 to +229
/**
* If tasks gets timed out in retrying on throttling,
* it should send cluster event timeout exception.
*/
@Override
public Exception getTimeoutException(Exception e) {
return new ProcessClusterEventTimeoutException(request.masterNodeTimeout, actionName);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are pretty much discarding the underlying exception which is a serious impairment to debuggability. Can we wrap exception causes instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For debuggability, RetrayableAction will add the underlying exception in suppressed exception list and which will be logged on data node. So in data node logs, throttling exception will be logged. Master will also logs on throttling exceptions.

We just want to keep user experience consistent, so to user it will be thrown as ProcessClusterEventTimeoutException.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add another overloaded constructor for and keep the underlying cause as is.

 public ProcessClusterEventTimeoutException(TimeValue timeValue, String source, Exception ex) {
        super("failed to process cluster event (" + source + ") within " + timeValue, ex);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually when tasks is getting timed out due to throttling, those throttling exception is returned as suppressed exception in process cluster event timeout exception. RetryableAction does this for us, it stores upto last 4 exceptions and add it in suppressed exception.

From that user will know that task is getting timed out due to throttling till we expose the throttling exception to user.

Do we still need to add underlying cause in event timeout exception?

sample exception which is returned to user:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "process_cluster_event_timeout_exception",
        "reason" : "failed to process cluster event (indices:admin/mapping/put) within 30s",
        "suppressed" : [
          {
            "type" : "cluster_manager_throttling_exception",
            "reason" : "Throttling Exception : Limit exceeded for put-mapping"
          },
          {
            "type" : "cluster_manager_throttling_exception",
            "reason" : "Throttling Exception : Limit exceeded for put-mapping"
          },
          {
            "type" : "cluster_manager_throttling_exception",
            "reason" : "Throttling Exception : Limit exceeded for put-mapping"
          }
        ]
      }
    ],
    "type" : "process_cluster_event_timeout_exception",
    "reason" : "failed to process cluster event (indices:admin/mapping/put) within 30s",
    "suppressed" : [
      {
        "type" : "cluster_manager_throttling_exception",
        "reason" : "Throttling Exception : Limit exceeded for put-mapping"
      },
      {
        "type" : "cluster_manager_throttling_exception",
        "reason" : "Throttling Exception : Limit exceeded for put-mapping"
      },
      {
        "type" : "cluster_manager_throttling_exception",
        "reason" : "Throttling Exception : Limit exceeded for put-mapping"
      }
    ]
  },
  "status" : 503
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice should this be 429 instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have created followup issue (#4724) for exposing throttling exception to user and we will make it as 429.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2022

Gradle Check (Jenkins) Run Completed with:

…' into throttling-IT

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2022

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

Codecov Report

Merging #4588 (389d1e8) into feature/master-task-throttling (1212afd) will increase coverage by 0.00%.
The diff coverage is 82.40%.

@@                        Coverage Diff                        @@
##             feature/master-task-throttling    #4588   +/-   ##
=================================================================
  Coverage                             70.71%   70.72%           
+ Complexity                            57644    57621   -23     
=================================================================
  Files                                  4665     4666    +1     
  Lines                                276614   276704   +90     
  Branches                              40297    40300    +3     
=================================================================
+ Hits                                 195621   195701   +80     
  Misses                                64737    64737           
- Partials                              16256    16266   +10     
Impacted Files Coverage Δ
...stermanager/TransportClusterManagerNodeAction.java 83.96% <0.00%> (-0.80%) ⬇️
...search/cluster/service/ClusterManagerTaskKeys.java 0.00% <0.00%> (ø)
...main/java/org/opensearch/script/ScriptService.java 71.65% <0.00%> (-0.45%) ⬇️
...oredscripts/TransportDeleteStoredScriptAction.java 55.55% <50.00%> (+5.55%) ⬆️
.../storedscripts/TransportPutStoredScriptAction.java 55.55% <50.00%> (+5.55%) ⬆️
...ing/delete/TransportDeleteDanglingIndexAction.java 9.58% <50.00%> (+1.13%) ⬆️
...dmin/indices/rollover/TransportRolloverAction.java 51.21% <50.00%> (-0.04%) ⬇️
...ster/metadata/MetadataCreateDataStreamService.java 61.53% <50.00%> (-0.31%) ⬇️
...arch/persistent/PersistentTasksClusterService.java 48.88% <50.00%> (+0.63%) ⬆️
...cluster/metadata/MetadataIndexTemplateService.java 82.98% <83.33%> (+1.15%) ⬆️
... and 523 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Collaborator

@gbbafna gbbafna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a follow up issue for changing the final exception type . Apart from that, LGTM.

Copy link
Member

@shwetathareja shwetathareja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments otherwise looks good. Thanks @dhwanilpatel

* Throttling exception and data node is performing retries on it.
*
*/
public void testThrottlingForRemoteMaster() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor]: replace Master with "ClusterManager" at all places applicable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Updated.


@Override
public void onResponseSent(long requestId, String action, Exception error) {
if (action.contains("mapping")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it is expecting all mapping requests to be throttled, which isnt the case. I am missing something here.

Copy link
Contributor Author

@dhwanilpatel dhwanilpatel Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all requests will be throttled, it will be on random basis.

Here we have overridden TransportServiceMessageListener of cluster manager node to get count of throttled requests. This function will be called whenever exception is returned from cluster manager and we will calculate the throttled mapping requests over here. Based on throttled count we will assert master is getting (totalRequest + throttledRequests).

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dhwanilpatel dhwanilpatel force-pushed the throttling-IT branch 2 times, most recently from 6a6f9ad to 55ed984 Compare October 19, 2022 13:58
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@Bukhtawar
Copy link
Collaborator

We need a changelog for the build to succeed

@Bukhtawar Bukhtawar merged commit 58e5ed0 into opensearch-project:feature/master-task-throttling Oct 20, 2022
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 2, 2022
…pensearch-project#4588)

* Fix timeout exception and Add Integ test for Master task throttling

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Bukhtawar pushed a commit that referenced this pull request Nov 4, 2022
Basic Throttler Framework / Exponential Basic back off policy. 
Add basic thorttler/exponential backoff policy for retry/Defination o… #3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling #3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling #4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling #4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
dhwanilpatel added a commit to dhwanilpatel/OpenSearch-1 that referenced this pull request Nov 4, 2022
…t#4986)

Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling opensearch-project#3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling opensearch-project#4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling opensearch-project#4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Bukhtawar pushed a commit that referenced this pull request Nov 4, 2022
* Add basic thorttler/exponential backoff policy for retry/Defination of throttling exception (#3856)

* Corrected Java doc for Throttler

* Changed the default behaviour of Throttler to return Optional

* Removed generics from Throttler and used String as key

* Ignore backport / autocut / dependabot branches for gradle checks on push

* Master node changes for master task throttling (#3882)

* Data node changes for master task throttling (#4204)

* Onboarding of few task types to throttling (#4542)

* Fix timeout exception and Add Integ test for Master task throttling (#4588)

* Complete TODO for version change and removed unused classes(Throttler and Semaphore) (#4846)

* Remove V1 version from throttling testcase

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Bukhtawar pushed a commit that referenced this pull request Nov 4, 2022
Basic Throttler Framework / Exponential Basic back off policy.
Add basic thorttler/exponential backoff policy for retry/Defination o… #3527
Changes required in Master node to perform throttling.
Master node changes for master task throttling #3882
Changes required in Data node to perform retry on throttling.
Data node changes for master task throttling #4204
Provide support for all task type in throttling framework.
Onboarding of few task types to throttling #4542
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588

Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants