Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3store: Experiment with optimistic part uploads to S3 #1084

Open
Acconut opened this issue Feb 12, 2024 · 2 comments
Open

s3store: Experiment with optimistic part uploads to S3 #1084

Acconut opened this issue Feb 12, 2024 · 2 comments
Assignees

Comments

@Acconut
Copy link
Member

Acconut commented Feb 12, 2024

Right now, an incoming PATCH request is split into multiple parts. Each part is saved on disk (or optionally in memory) and then queued to be uploaded to S3. These uploads can be performed in parallel. However, it seems as if this process can be very CPU-intensive (see #924 and many other reports about performance concerns of the s3store).

I wonder if it we can skip the step reading the parts into a temporary buffer (on disk or in memory) altogether. The reason for first reading into a buffer is that we want to know the size of the part before uploading it to S3. The UploadPart and PutObject calls in S3 require the Content-Length header to be present, so we must know the size before starting an object/part upload (I get this understanding from their docs and my past experiments; maybe this is not true anymore). Thus, we have to supply some part size when starting the upload to S3.

An alternative approach is to estimate the part size when we start receiving data from the PATCH request. If we know the total tus upload size and the request's size, we can estimate what size the next part will have. If not, we will just assume that a full part is filled. Then, we open a new UploadPart request to S3 with out estimated Content-Length and start piping the PATCH request's content directly into the request to S3. If the amount of data read from the incoming request is smaller than the Content-Length that we provided to S3, the upload will fail and we have to fallback to uploading it into a .part object using PutObject, so that data can be used for resuming the upload in the next PATCH request. However, this would mean that we have to keep some data in memory, so we can fallback to storing the temporary data in a .part object. Maybe in the end, this is not much different than just buffering the parts in memory.

@Acconut Acconut self-assigned this Feb 12, 2024
@namandureja
Copy link

namandureja commented Apr 18, 2024

Hi @Acconut , this is an interesting find. I was studying the tusd implementation and also thought of this as an issue. Do you have any updates on this experimentation ?

@Acconut
Copy link
Member Author

Acconut commented Apr 19, 2024

There has been no further investigation since I opened this issue. I think it's still interesting, but I am more concerned about the memory usage here. In the comment above I wrote:

However, this would mean that we have to keep some data in memory, so we can fallback to storing the temporary data in a .part object.

If the part size 50MB, then 100 uploads would already consume 5GB of memory. When we want to scale tusd to support many parallel uploads, this might lead to issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants