Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve S3 read performance by not copying buffer #284

Merged
merged 10 commits into from
Apr 19, 2019

Conversation

aperiodic
Copy link
Contributor

@aperiodic aperiodic commented Apr 10, 2019

Add ByteBuffer Class

This class encapsulates the buffer functionality implemented in
smart_open.s3.BufferedInputBase and in
smart_open.http.BufferedInputBase as a standalone class. It uses the
read implementation introduced to the S3 BufferedInputBase in the
previous commit, to minimize copying of the buffer.

Because the use case for these ByteBuffers is to amortize the network
transfer overhead costs of reading data from S3 or via HTTP, it's better
to read a consistent number of bytes when filling up the buffer to
consistently amortize that overhead cost, rather than filling the buffer
up to some a priori capacity. To that end, the ByteBuffer constructor
takes a chunk_size that sets the number of bytes to read when filling
the buffer, rather than a fixed buffer capacity (which could easily
lead to reading a different number of bytes from the reader if the
caller ever filled the buffer when it was not empty). However, the
caller can still ask for fewer than chunk_size bytes to be read during
filling using the fill method's size argument.

Change the BufferedInputBase class (and descendents) in both
smart_open.s3 and smart_open.http to use instances of ByteBuffer for
their read buffers, instead of bytestrings. This simplifies their
implementation, while still avoiding copying on reads to maintain the
94% performance improvement for many small reads on an S3 file
introduced in the first version of this PR.

Initial, Now Discarded Implementation

Previously, the buffer of an S3 file opened in 'rb' mode was split into
two copies whenever its read() method was called: one copy was
returned to the caller, and the other copy became the new buffer. This
lead to bad performance when performing many small reads with the buffer
set to a large enough size to amortize the cost of transferring data
from S3.

In our use case, we set the buffer to 4 MiB when reading a 16.5 MiB file
containing 694k serialized protobuf messages, each prefixed by their
byte length. This required calling the S3 file's read() method nearly
1.4 million times, which in turn meant that the up-to-4MiB buffer needed
to be copied and garbage collected the same number of times. All of that
copying and garbage collecting alone took 152 seconds, for a total of
179 seconds to download and parse the file.

Eliminate the copying on each read() by adding an instance attribute
_buffer_pos to smart_open.s3's BufferedInputBase and its descendents,
in order to track the current position within the buffer. Change
read() to use _buffer_pos to create the slice of the buffer to
return to the caller, and then move _buffer_pos by the size of that
slice.

Add two new internal instance methods to BufferedInputBase to accomodate
the fact that the buffer now contains already-read data. The first,
_len_remaining_buffer(), calculates how many unread bytes remain in
the buffer (previously this was simply the length of the buffer); this
is now used everywhere that len(self._buffer) was previously used in
BufferedInputBase. The second method, _remaining_buffer(), returns
a copy of the unread portion of the buffer; this is used in
BufferedInputBase's readline() method to find the next unread newline.

With these changes, that same 16.5 MiB protobuf file with 694k messages
can be downloaded and parsed in 10 seconds, a 94% improvement in
performance!

Previously, the buffer of an S3 file opened in 'rb' mode was split into
two copies whenever its `read()` method was called: one copy was
returned to the caller, and the other copy became the new buffer. This
lead to bad performance when performing many small reads with the buffer
set to a large enough size to amortize the cost of transferring data
from S3.

In our use case, we set the buffer to 4 MiB when reading a 16.5 MiB file
containing 694k serialized protobuf messages, each prefixed by their
byte length. This required calling the S3 file's `read()` method nearly
1.4 million times, which in turn meant that the up-to-4MiB buffer needed
to be copied and garbage collected the same number of times. All of that
copying and garbage collecting alone took 152 seconds, for a total of
179 seconds to download and parse the file.

Eliminate the copying on each `read()` by adding an instance attribute
`_buffer_pos` to smart_open.s3's BufferedInputBase and its descendents,
in order to track the current position within the buffer. Change
`read()` to use `_buffer_pos` to create the slice of the buffer to
return to the caller, and then move `_buffer_pos` by the size of that
slice.

Add two new internal instance methods to BufferedInputBase to accomodate
the fact that the buffer now contains already-read data. The first,
`_len_remaining_buffer()`, calculates how many unread bytes remain in
the buffer (previously this was simply the length of the buffer); this
is now used everywhere that `len(self._buffer)` was previously used in
BufferedInputBase. The second method, `_remaining_buffer()`, returns
a copy of the unread portion of the buffer; this is used in
BufferedInputBase's `readline()` method to find the next unread newline.

With these changes, that same 16.5 MiB protobuf file with 694k messages
can be downloaded and parsed in 10 seconds, a 94% improvement in
performance!
@piskvorky piskvorky requested a review from mpenkov April 10, 2019 20:39
smart_open/s3.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@aperiodic aperiodic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about creating a new class to encapsulate the bytestring-based buffer functionality.

smart_open/s3.py Outdated Show resolved Hide resolved
Dan Lidral-Porter added 3 commits April 12, 2019 21:55
This class encapsulates the buffer functionality implemented in
smart_open.s3.BufferedInputBase and in
smart_open.http.BufferedInputBase as a standalone class. It uses the
read implementation introduced to the S3 BufferedInputBase in the
previous commit, to minimize copying of the buffer.

Because the use case for these ByteBuffers is to amortize the network
transfer overhead costs of reading data from S3 or via HTTP, it's better
to read a consistent number of bytes when filling up the buffer to
consistently amortize that overhead cost, rather than filling the buffer
up to some a priori capacity. To that end, the ByteBuffer constructor
takes a `chunk_size` that sets the number of bytes to read when filling
the buffer, rather than a fixed buffer capacity (which could easily
lead to reading a different number of bytes from the reader if the
caller ever filled the buffer when it was not empty). However, the
caller can still ask for fewer than `chunk_size` bytes to be read during
filling using the fill method's `size` argument.
Change the BufferedInputBase class (and descendents) in both
smart_open.s3 and smart_open.http to use instances of ByteBuffer for
their read buffers, instead of bytestrings. This simplifies their
implementation, while still avoiding copying on reads to maintain the
94% performance improvement for many small reads on an S3 file
introduced two commits prior.
Change the tests in the size arg default-setting conditions for
smart_open.s3.BufferedInputBase from `size > 0` to `size >= 0`. The
first one could be a genuine bug (though it was never called this way by
the class), because when a caller would ask for zero bytes, they'd get
up to a whole buffer_size's worth back. The second one is a semantics
change: if a caller really wants to fill the buffer with zero bytes,
we'll let them.
Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good. Please have a look at the comments I left.

smart_open/bytebuffer.py Outdated Show resolved Hide resolved
smart_open/bytebuffer.py Outdated Show resolved Hide resolved
smart_open/bytebuffer.py Show resolved Hide resolved
smart_open/bytebuffer.py Outdated Show resolved Hide resolved
@aperiodic
Copy link
Contributor Author

Ok, I'll address these comments on Monday UTC, between 16:00 and 23:00 UTC.

Dan Lidral-Porter added 4 commits April 15, 2019 10:42
Move the smart_open.bytebuffer module's docstring to be the ByteBuffer
class's docstring instead, and replace the module docstring with a short
summary of its contents.

Explicitly document all the parameters for ByteBuffer methods.
It would be beneficial to fill ByteBuffers with lists that contain
bytestrings, in order to support experimentation at the REPL and to make
doctest examples more succinct. To that end, change the implementation
of its `fill` method to use a for loop to iterate over the byte source
when it's not a file-like object, and break out of the loop once the
condition is met (rather than a while loop that uses `next`, which
doesn't work on lists).

Now that `fill` accepts lists in addition to file-like objects and
iterators, change the byte source argument name from the clunky
`reader_or_iterable` to `source`, and explain what it's allowed to be in
the parameter section of the docstring.

Add a test to ensure that filling from lists works.
@aperiodic aperiodic closed this Apr 15, 2019
@aperiodic
Copy link
Contributor Author

Sorry, hit the wrong button.

I've addressed all of your latest comments except the doctest example in the ByteBuffer class docstring. I wasn't able to get the same example to pass in both Python 2 and 3; you can see the commit I pushed up to another branch for the details.

@aperiodic aperiodic reopened this Apr 15, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Apr 16, 2019

Don't worry about Py2/3 compatibility in the doctests. As long as it passes under Py3 and the example clearly states that, it's OK. More in-depth Py2/3 support details get handled in the unit tests.

So, please add your doctests to this PR.

This only passes as a doctest in Python 3 because of the bytestring
literals for some of the expected values. In Python 2, those values are
simply regular strings and doctest seems to work by comparing the string
representations of the values rather than the values themselves, so
those lines fail because they don't have the `b` prefix.
@aperiodic
Copy link
Contributor Author

Sorry, I force-pushed because I accidentally pushed up the earlier version of the doctest commit that was lacking the note about it only passing in Python 3. Hope it's not too much of an issue.

@aperiodic
Copy link
Contributor Author

Is this ready to merge? I believe I've addressed all your comments (I don't know why GitHub still shows changes requested given all those comments have been resolved).

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 18, 2019

From your description, the performance gains are huge, but our test benchmarks don't pick them up. This is likely because the benchmarks are lacking. It would be good to add a test case to our integration tests (see integration-tests/test_s3.py) to prevent future regressions. Can you please have a look at this?

@aperiodic
Copy link
Contributor Author

I've got a benchmark, I'm iterating on the size so it demonstrates the improvement without taking an excessively long time (I'm shooting for a minute or so).

Add the `test_s3_performance_small_reads` benchmark to the S3
integration tests. This benchmark creates a test file one megabyte in
size that's filled with 65,536 (64 KiB) "messages", each prefixed with
a byte indicating their length. The file is downloaded into a buffer of
the same size, and then each length byte and message is read one at
a time from the buffer, for a total of 131,072 `read()` calls.

Before the changes in this PR, downloading and reading the file this way
took 4.97 seconds on average. After this PR's changes, it takes only
1.61 seconds, a 68% improvement in performance. The magnitude of
improvement increases as the size of the buffer increases, or the size
of the messages decreases.
@aperiodic
Copy link
Contributor Author

Results from running the benchmark on the current master branch:

-------------------------------------------------- benchmark: 1 tests --------------------------------------------------
Name (time in s)                       Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------
test_s3_performance_small_reads     4.8049  5.2253  4.9774  0.1563  4.9470  0.1763       2;0  0.2009       5           1
------------------------------------------------------------------------------------------------------------------------

Results from running benchmark on this PR branch:

-------------------------------------------------- benchmark: 1 tests --------------------------------------------------
Name (time in s)                       Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------
test_s3_performance_small_reads     1.5846  1.6942  1.6350  0.0407  1.6360  0.0519       2;0  0.6116       5           1
------------------------------------------------------------------------------------------------------------------------

@mpenkov mpenkov merged commit 2ea4155 into piskvorky:master Apr 19, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Apr 19, 2019

@aperiodic Congrats on your first merged PR to smart_open 🥇 Awesome work!

@aperiodic
Copy link
Contributor Author

Thank you, and thanks for the review! I enjoyed contributing.

@aperiodic aperiodic deleted the no-buffer-copying branch April 22, 2021 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants