LineDecoder is accidentally quadratic: iter_lines() seems to hang forever #2422

gtedesco-r7 · 2022-10-26T00:46:08Z

When calling Response.iter_lines(), things can seem to hang forever.

The problem is that LineDecoder is quadratic in it's string copying behaviour. If a 31MB chunk with 18,768 lines is passed in to LineDecoder() then it takes 1m45s to process it vs 0.1s for a simple text.splitlines(). This is readily reproduced by simply using LineDecoder() to decode a file, like /usr/share/dict/words (do not attempt, you may be waiting until the heat death of the universe).

It may be a bug somewhere else that iter_text() is returning chunks that are too big.

But either way, you probably want to reconsider the wisdom of slicing the string to chop the beginning off it vs. just keeping an index.

You can slice the string before returning from the decoder so it's only done once per chunk.

Anyhoo, thanks for the great work!

The text was updated successfully, but these errors were encountered:

gtedesco-r7 · 2022-10-26T00:50:19Z

Just a quick follow-on, even if the chunks are already line-decoded, processing the 31MB file takes 6 seconds!!

It would probably be a lot easier and a lot faster to just do something like:

if buf.endswith('\n'):
    yield from buf.splitlines()  # with nothing leftover
else:
    it = buf.splitlines()
    prev = next(it)
    for line in it:
        yield prev
        prev = line
    self.remainder = prev

gtedesco-r7 · 2022-10-26T00:57:02Z

Actually, it's easier than that since splitlines() already returns a list.

    def decode(self, text: str) -> typing.List[str]:
        if self.buffer:
            text = self.buffer + text

        if text.endswith('\n'):
            lines = text.splitlines()
            self.buffer = ""
        else:
            lines = text.splitlines()
            self.buffer = lines.pop()

        return lines

This implementation seems to work correctly and outperform in all the bad cases (ie. minutes down to milliseconds).

In the case where chunks are all tiny (3 bytes) and lines are long (up to 6KB), there appears to be a slight slowdown. But we're talking about 12s to 13.4s. I think it's because of the overhead of doing a text.splitlines() that doesn't achieve anything 1,000x in a row on average, but incurs the cost of creating a new list object every time.

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco · 2022-10-26T11:55:27Z

Actually, the unit tests clued me in that text.splitlines() is no good because you want to keep the delimiters in. Switched to re, and that makes the performance regression with tiny chunks and long lines go away (13s down to 9s on my test data). 🎉

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco · 2022-10-26T13:05:07Z

Ah but okay, there's some whacky behaviours to do with CRLF and CR line-feeds, I will work on the PR as I get some time 😂

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

…nt speed up (#2423) * Replace quadratic algo in LineDecoder Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue #2422 * Update httpx/_decoders.py * Update _decoders.py Handle text ending in `\r` more gracefully. Return as much content as possible. * Update test_decoders.py * Update _decoders.py * Update _decoders.py * Update _decoders.py * Update httpx/_decoders.py Co-authored-by: cdeler <serj.krotov@gmail.com> * Update _decoders.py --------- Co-authored-by: Tom Christie <tom@tomchristie.com> Co-authored-by: cdeler <serj.krotov@gmail.com>

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022

Replace quadratic algo in LineDecoder

a95ded2

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco mentioned this issue Oct 26, 2022

Change LineDecoder to match stdlib splitlines, resulting in significant speed up #2423

Merged

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022

Replace quadratic algo in LineDecoder

a641c25

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022

Replace quadratic algo in LineDecoder

99812ee

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022

Replace quadratic algo in LineDecoder

f9867ac

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022

Replace quadratic algo in LineDecoder

230614e

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022

Replace quadratic algo in LineDecoder

5583dac

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022

Replace quadratic algo in LineDecoder

08ad3ce

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022

Replace quadratic algo in LineDecoder

50c6811

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 4, 2022

Replace quadratic algo in LineDecoder

14951df

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 4, 2022

Replace quadratic algo in LineDecoder

7b928cb

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 4, 2022

Replace quadratic algo in LineDecoder

41fa49d

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 6, 2022

Replace quadratic algo in LineDecoder

1072ae1

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 6, 2022

Replace quadratic algo in LineDecoder

9cbd511

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 6, 2022

Replace quadratic algo in LineDecoder

27b880e

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

florimondmanca added the perf Issues relating to performance label Nov 6, 2022

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 21, 2022

Replace quadratic algo in LineDecoder

ee687ae

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 21, 2022

Replace quadratic algo in LineDecoder

042ab35

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Dec 10, 2022

Replace quadratic algo in LineDecoder

1c50d17

Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422

tomchristie closed this as completed in #2423 Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LineDecoder is accidentally quadratic: iter_lines() seems to hang forever #2422

LineDecoder is accidentally quadratic: iter_lines() seems to hang forever #2422

gtedesco-r7 commented Oct 26, 2022

gtedesco-r7 commented Oct 26, 2022 •

edited

Loading

gtedesco-r7 commented Oct 26, 2022 •

edited

Loading

giannitedesco commented Oct 26, 2022

giannitedesco commented Oct 26, 2022

LineDecoder is accidentally quadratic: iter_lines() seems to hang forever #2422

LineDecoder is accidentally quadratic: iter_lines() seems to hang forever #2422

Comments

gtedesco-r7 commented Oct 26, 2022

gtedesco-r7 commented Oct 26, 2022 • edited Loading

gtedesco-r7 commented Oct 26, 2022 • edited Loading

giannitedesco commented Oct 26, 2022

giannitedesco commented Oct 26, 2022

gtedesco-r7 commented Oct 26, 2022 •

edited

Loading

gtedesco-r7 commented Oct 26, 2022 •

edited

Loading