Multireaderfilestream Redesign #4595

dma1dma1 · 2024-05-23T05:40:00Z

Reasons for making this change

The previous Multireaderfilestream design did not allow for backwards seeks, which causes uploads of directories above a certain size to break (~15MB). This was due to the process of reading directory bundles requiring seeks backwards up to a limit of 20MiB. The new design doesn't use BytesBuffers, but rather stores raw bytes, from 32MiB before the slowest reader, up to a maximum of 64MiB after the slowest reader. If a faster reader tries to read past this threshold, it will sleep until it is allowed to read more.

Added Tests

UPLOAD TESTS

test_fileobj_tar_gz: basic directory bundle test, did not exist previously. Checks that a basic directory has the correct files and file sizes.

test_large_fileobj_tar_gz: Tests a large directory bundle past the size threshold limit of the previous Multireaderfilestream design. Checks that all files are uploaded.

test_large_fileobj_tar_gz2: Tests a large directory bundle past the size threshold limit of the previous Multireaderfilestream design, with multiple large files. Checks that all files are uploaded and that sizes of files are correct.

test_upload_memory: Tests that the memory usage of the Multireaderfilestream uploading is expected at < 100MB

Manual speed test: Achieves faster than previous speeds with larger 1MB chunk sizes on upload, modified from 16KB

MRFS TESTS

test_reader_distance: Tests that 2 readers are always within the expected thresholds.

test_seek: Tests that backwards seeks within lookback range are working as expected.

test_toofar_seek: Tests that backwards seeks that go too far raise errors.

Related issues

Screenshots

Checklist

I've added a screenshot of the changes, if this is a frontend change
I've added and/or updated tests, if this is a backend change
I've run the pre-commit.sh script
I've updated docs, if needed

…r performantly

epicfaace · 2024-05-27T20:25:05Z

codalab/lib/upload_manager.py

@@ -255,7 +255,7 @@ def write_fileobj(
            conn_str = os.environ.get('AZURE_STORAGE_CONNECTION_STRING', '')
            os.environ['AZURE_STORAGE_CONNECTION_STRING'] = bundle_conn_str
        try:
-            CHUNK_SIZE = 16 * 1024
+            CHUNK_SIZE = 1024 * 1024


Why increase?

The upload speed with the smaller chunk size was too slow due to the sleep behavior that occurs on the faster reader, which is always the index reader.

It seems like there's no super meaningful reason to keep chunk size smallish since the speed tradeoff is too large

codalab/lib/beam/MultiReaderFileStream.py

tests/unit/beam/multireaderfilestream_test.py

codalab/lib/beam/MultiReaderFileStream.py

percyliang · 2024-05-28T19:17:23Z

codalab/lib/beam/MultiReaderFileStream.py

+            return
+        self._buffer += s
+
+    def read(self, index: int, num_bytes=0):  # type: ignore


Why 0? If optional, use None, and add type hint Optional[int]

codalab/lib/beam/MultiReaderFileStream.py

percyliang · 2024-05-28T19:19:48Z

codalab/lib/beam/MultiReaderFileStream.py

        return s

    def peek(self, index: int, num_bytes):   # type: ignore
-        self._fill_buf_bytes(index, num_bytes)
-        s = self._bufs[index].peek(num_bytes)
+        while (self._pos[index] + num_bytes) - self._buffer_pos > self.MAX_THRESHOLD:


Can we avoid duplicate code with read?

I made the change to its current iteration, logic is that read is just a peek and position change. My concern is that this requires the lock to be released, which could potentially mean another thread could interpose during this release and do another read or peek. From what I can tell this shouldn't affect any logic since the actual buffer reading occurs before the lock releases, but I could be wrong. Definitely might cause some context switches though, but with only 2 threads the effect should be minimal

codalab/lib/upload_manager.py

tests/unit/beam/multireaderfilestream_test.py

codalab/lib/beam/MultiReaderFileStream.py

percyliang · 2024-05-30T19:45:20Z

codalab/lib/beam/MultiReaderFileStream.py

-            num_bytes = len(self._bufs[index])
-        s = self._bufs[index].read(num_bytes)
-        self._pos[index] += len(s)
+        if num_bytes == None:


Do we ever call this with None? If not, don't support it

Don't think so, but this emulates default read() behavior for other io streams in python, where None means read to the end of the fileobj, or in our case the buffer.

end of fileobj is different semantically from end of buffer (which is arbitrary), so I would avoid having None to set the wrong expectations of a behavior that we don't support

Makes sense, I'll make the change.

codalab/lib/beam/MultiReaderFileStream.py

tests/unit/beam/multireaderfilestream_test.py

percyliang

This is great overall. Left a few comments so we can clean things up.

dma1dma1 added 14 commits May 15, 2024 15:54

Initial Modifications, adding min-max heap to track min and max reade…

a6796a9

…r performantly

Change update function

17449a4

Read done

6de3afb

Some seek

0b109c2

Fixed race condition

9490a01

Implementation Done

035ba30

Added tests for tar gz

1c73adc

Fixed tar.gz test, added memory test

8450e30

Added memory profiler

2e5e6ce

Small test changes

28d642f

Modifying Chunk size

f491e0e

Using getmembers to get tarinfo

8b01add

Added multireaderfilestream tests

dedf6ca

Cleanup

04b524b

dma1dma1 marked this pull request as ready for review May 27, 2024 08:14

percyliang requested review from percyliang and epicfaace May 27, 2024 18:11

epicfaace reviewed May 27, 2024

View reviewed changes

Test fixes

7f33c66