Refactor multipart download to a more async model #10349

andrross · 2023-10-04T00:04:24Z

The previous approach of kicking off the stream requests for all parts
of a file did not work well for very large files. For example, a 20GiB
file uploaded in 16MiB parts will consist of 1200+ parts. When we
attempted to initiate streaming for all parts concurrently, some parts
would hit a client timeout after 2 minutes without being able to get a
connection due to the other parts not having been completed in that time
frame. This refactoring adds yet another layer of indirection in order
to allow the code that is actually writing the destination file to
control the rate at which streams are started. This should allow for
downloading files consisting of arbitrarily many parts at any connection
speed.

This commit also wires in the download rate limiter so that the
indices.recovery.max_bytes_per_sec is properly honored.

This PR supersedes #10284

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
GitHub issue/PR created in OpenSearch documentation repo for the required public documentation changes (#[Issue/PR number])

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2023-10-04T00:38:42Z

Compatibility status:

Checks if related components are compatible with change a9b8b63

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git]

github-actions · 2023-10-04T00:42:41Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/26811/
CommitID: 95a9e78
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com>

github-actions · 2023-10-04T20:07:40Z

Gradle Check (Jenkins) Run Completed with:

RESULT: ❌
URL: https://build.ci.opensearch.org/job/gradle-check/26958/
CommitID: 93ca05a
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-10-04T20:07:41Z

Gradle Check (Jenkins) Run Completed with:

RESULT: ❌
URL: https://build.ci.opensearch.org/job/gradle-check/26964/
CommitID: a9b8b63
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-10-04T20:14:02Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/26966/
CommitID: a9b8b63
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-10-04T21:09:56Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/26967/
CommitID: a9b8b63
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

kotwanikunal · 2023-10-04T21:13:27Z

#9115

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌

URL: https://build.ci.opensearch.org/job/gradle-check/26967/

CommitID: a9b8b63
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-10-04T22:00:28Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/26985/
CommitID: a9b8b63

codecov · 2023-10-04T22:01:40Z

Codecov Report

Merging #10349 (a9b8b63) into main (a0cb344) will decrease coverage by 0.10%.
Report is 1 commits behind head on main.
The diff coverage is 87.27%.

@@             Coverage Diff              @@
##               main   #10349      +/-   ##
============================================
- Coverage     71.21%   71.11%   -0.10%     
+ Complexity    58337    58295      -42     
============================================
  Files          4832     4830       -2     
  Lines        274828   274840      +12     
  Branches      40043    40048       +5     
============================================
- Hits         195708   195444     -264     
- Misses        62728    63071     +343     
+ Partials      16392    16325      -67

Files	Coverage Δ
...rg/opensearch/repositories/s3/S3BlobContainer.java	`80.74% <100.00%> (+2.15%)`	⬆️
...bstore/AsyncMultiStreamEncryptedBlobContainer.java	`59.18% <100.00%> (+1.73%)`	⬆️
...arch/common/blobstore/stream/read/ReadContext.java	`100.00% <100.00%> (ø)`
...rg/opensearch/common/settings/ClusterSettings.java	`92.85% <ø> (ø)`
...c/main/java/org/opensearch/index/IndexService.java	`75.43% <100.00%> (-0.44%)`	⬇️
...java/org/opensearch/index/shard/StoreRecovery.java	`57.25% <ø> (+0.82%)`	⬆️
...earch/index/store/RemoteSegmentStoreDirectory.java	`91.09% <100.00%> (+1.16%)`	⬆️
...ndex/store/RemoteSegmentStoreDirectoryFactory.java	`96.15% <100.00%> (+3.84%)`	⬆️
...ices/replication/RemoteStoreReplicationSource.java	`90.47% <100.00%> (ø)`
server/src/main/java/org/opensearch/node/Node.java	`85.58% <100.00%> (ø)`
... and 7 more

... and 442 files with indirect coverage changes

andrross · 2023-10-04T22:04:09Z

I'm going to merge this to unblock subsequent PRs, but will follow up with @Bukhtawar and @gbbafna and address any comments or concerns.

opensearch-trigger-bot · 2023-10-04T22:13:10Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-10349-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 28f185b347a3333c8670ca1a7bd7d0a85fed14e9
# Push it to GitHub
git push --set-upstream origin backport/backport-10349-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-10349-to-2.x.

…#10349) * Refactor read context streams to async streams Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Refactor multipart download to a more async model The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit 28f185b)

* Refactor read context streams to async streams Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Refactor multipart download to a more async model The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit 28f185b)

Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>

…#10349) * Refactor read context streams to async streams Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Refactor multipart download to a more async model The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com>

Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>

Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com> Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>

…#10349) * Refactor read context streams to async streams Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Refactor multipart download to a more async model The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>

andrross added v2.11.0 Issues and PRs related to version 2.11.0 backport 2.x Backport to 2.x branch skip-changelog labels Oct 4, 2023

kotwanikunal mentioned this pull request Oct 4, 2023

Refactor read context streams to async streams #10284

Closed

6 tasks

andrross mentioned this pull request Oct 4, 2023

Update multipart download path to first write to temp files #10347

Merged

7 tasks

andrross force-pushed the multipart-async branch from 5c06336 to 93ca05a Compare October 4, 2023 19:34

andrross force-pushed the multipart-async branch from 93ca05a to a9b8b63 Compare October 4, 2023 20:00

kotwanikunal approved these changes Oct 4, 2023

View reviewed changes

kotwanikunal merged commit 28f185b into opensearch-project:main Oct 4, 2023
13 checks passed

opensearch-trigger-bot bot added the backport-failed label Oct 4, 2023

andrross deleted the multipart-async branch October 4, 2023 22:16

andrross mentioned this pull request Oct 4, 2023

[Backport 2.x] Refactor multipart download to a more async model #10373

Merged

andrross added a commit to andrross/documentation-website that referenced this pull request Oct 5, 2023

Add documentation for new recovery setting

ed3d2a2

Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>

andrross mentioned this pull request Oct 5, 2023

Add documentation for new recovery setting opensearch-project/documentation-website#5162

Merged

1 task

This was referenced Oct 5, 2023

[Repository] Honor throttling limits on recovery for multistream downloads #10282

Closed

[Repository] Define and tune for appropriate ThreadPool/Executor usage for multi stream downloads #10106

Closed

kolchfa-aws pushed a commit to opensearch-project/documentation-website that referenced this pull request Oct 10, 2023

Add documentation for new recovery setting (#5162)

89d2860

Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>

harshavamsi pushed a commit to harshavamsi/documentation-website that referenced this pull request Oct 31, 2023

Add documentation for new recovery setting (opensearch-project#5162)

6cd7a5d

Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>

vagimeli pushed a commit to opensearch-project/documentation-website that referenced this pull request Dec 21, 2023

Add documentation for new recovery setting (#5162)

8ebeb3a

Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor multipart download to a more async model #10349

Refactor multipart download to a more async model #10349

andrross commented Oct 4, 2023

github-actions bot commented Oct 4, 2023 •

edited

Loading

github-actions bot commented Oct 4, 2023

github-actions bot commented Oct 4, 2023

github-actions bot commented Oct 4, 2023

github-actions bot commented Oct 4, 2023

github-actions bot commented Oct 4, 2023

kotwanikunal commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 4, 2023

codecov bot commented Oct 4, 2023

andrross commented Oct 4, 2023

opensearch-trigger-bot bot commented Oct 4, 2023

Refactor multipart download to a more async model #10349

Refactor multipart download to a more async model #10349

Conversation

andrross commented Oct 4, 2023

Check List

github-actions bot commented Oct 4, 2023 • edited Loading

Compatibility status:

Incompatible components

Skipped components

Compatible components

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

kotwanikunal commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

codecov bot commented Oct 4, 2023

Codecov Report

andrross commented Oct 4, 2023

opensearch-trigger-bot bot commented Oct 4, 2023

github-actions bot commented Oct 4, 2023 •

edited

Loading