Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LineDecoder is accidentally quadratic: iter_lines() seems to hang forever #2422

Closed
gtedesco-r7 opened this issue Oct 26, 2022 · 4 comments · Fixed by #2423
Closed

LineDecoder is accidentally quadratic: iter_lines() seems to hang forever #2422

gtedesco-r7 opened this issue Oct 26, 2022 · 4 comments · Fixed by #2423
Labels
perf Issues relating to performance

Comments

@gtedesco-r7
Copy link

When calling Response.iter_lines(), things can seem to hang forever.

The problem is that LineDecoder is quadratic in it's string copying behaviour. If a 31MB chunk with 18,768 lines is passed in to LineDecoder() then it takes 1m45s to process it vs 0.1s for a simple text.splitlines(). This is readily reproduced by simply using LineDecoder() to decode a file, like /usr/share/dict/words (do not attempt, you may be waiting until the heat death of the universe).

It may be a bug somewhere else that iter_text() is returning chunks that are too big.

But either way, you probably want to reconsider the wisdom of slicing the string to chop the beginning off it vs. just keeping an index.

You can slice the string before returning from the decoder so it's only done once per chunk.

Anyhoo, thanks for the great work!

@gtedesco-r7
Copy link
Author

gtedesco-r7 commented Oct 26, 2022

Just a quick follow-on, even if the chunks are already line-decoded, processing the 31MB file takes 6 seconds!!

It would probably be a lot easier and a lot faster to just do something like:

if buf.endswith('\n'):
    yield from buf.splitlines()  # with nothing leftover
else:
    it = buf.splitlines()
    prev = next(it)
    for line in it:
        yield prev
        prev = line
    self.remainder = prev

@gtedesco-r7
Copy link
Author

gtedesco-r7 commented Oct 26, 2022

Actually, it's easier than that since splitlines() already returns a list.

    def decode(self, text: str) -> typing.List[str]:
        if self.buffer:
            text = self.buffer + text

        if text.endswith('\n'):
            lines = text.splitlines()
            self.buffer = ""
        else:
            lines = text.splitlines()
            self.buffer = lines.pop()

        return lines

This implementation seems to work correctly and outperform in all the bad cases (ie. minutes down to milliseconds).

In the case where chunks are all tiny (3 bytes) and lines are long (up to 6KB), there appears to be a slight slowdown. But we're talking about 12s to 13.4s. I think it's because of the overhead of doing a text.splitlines() that doesn't achieve anything 1,000x in a row on average, but incurs the cost of creating a new list object every time.

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
@giannitedesco
Copy link
Contributor

Actually, the unit tests clued me in that text.splitlines() is no good because you want to keep the delimiters in. Switched to re, and that makes the performance regression with tiny chunks and long lines go away (13s down to 9s on my test data). 🎉

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Oct 26, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
@giannitedesco
Copy link
Contributor

Ah but okay, there's some whacky behaviours to do with CRLF and CR line-feeds, I will work on the PR as I get some time 😂

giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 4, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 4, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 4, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 6, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 6, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 6, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
@florimondmanca florimondmanca added the perf Issues relating to performance label Nov 6, 2022
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 21, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Nov 21, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
giannitedesco added a commit to giannitedesco/httpx that referenced this issue Dec 10, 2022
Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue encode#2422
tomchristie added a commit that referenced this issue Mar 16, 2023
…nt speed up (#2423)

* Replace quadratic algo in LineDecoder

Leading to enormous speedups when doing things such as
Response(...).iter_lines() as described on issue #2422

* Update httpx/_decoders.py

* Update _decoders.py

Handle text ending in `\r` more gracefully.
Return as much content as possible.

* Update test_decoders.py

* Update _decoders.py

* Update _decoders.py

* Update _decoders.py

* Update httpx/_decoders.py

Co-authored-by: cdeler <serj.krotov@gmail.com>

* Update _decoders.py

---------

Co-authored-by: Tom Christie <tom@tomchristie.com>
Co-authored-by: cdeler <serj.krotov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf Issues relating to performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants