-
-
Notifications
You must be signed in to change notification settings - Fork 829
-
-
Notifications
You must be signed in to change notification settings - Fork 829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LineDecoder is accidentally quadratic: iter_lines() seems to hang forever #2422
Comments
Just a quick follow-on, even if the chunks are already line-decoded, processing the 31MB file takes 6 seconds!! It would probably be a lot easier and a lot faster to just do something like: if buf.endswith('\n'):
yield from buf.splitlines() # with nothing leftover
else:
it = buf.splitlines()
prev = next(it)
for line in it:
yield prev
prev = line
self.remainder = prev |
Actually, it's easier than that since def decode(self, text: str) -> typing.List[str]:
if self.buffer:
text = self.buffer + text
if text.endswith('\n'):
lines = text.splitlines()
self.buffer = ""
else:
lines = text.splitlines()
self.buffer = lines.pop()
return lines This implementation seems to work correctly and outperform in all the bad cases (ie. minutes down to milliseconds). In the case where chunks are all tiny (3 bytes) and lines are long (up to 6KB), there appears to be a slight slowdown. But we're talking about 12s to 13.4s. I think it's because of the overhead of doing a |
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Actually, the unit tests clued me in that |
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Ah but okay, there's some whacky behaviours to do with CRLF and CR line-feeds, I will work on the PR as I get some time 😂 |
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue encode#2422
…nt speed up (#2423) * Replace quadratic algo in LineDecoder Leading to enormous speedups when doing things such as Response(...).iter_lines() as described on issue #2422 * Update httpx/_decoders.py * Update _decoders.py Handle text ending in `\r` more gracefully. Return as much content as possible. * Update test_decoders.py * Update _decoders.py * Update _decoders.py * Update _decoders.py * Update httpx/_decoders.py Co-authored-by: cdeler <serj.krotov@gmail.com> * Update _decoders.py --------- Co-authored-by: Tom Christie <tom@tomchristie.com> Co-authored-by: cdeler <serj.krotov@gmail.com>
When calling
Response.iter_lines()
, things can seem to hang forever.The problem is that
LineDecoder
is quadratic in it's string copying behaviour. If a 31MB chunk with 18,768 lines is passed in toLineDecoder()
then it takes 1m45s to process it vs 0.1s for a simpletext.splitlines()
. This is readily reproduced by simply usingLineDecoder()
to decode a file, like/usr/share/dict/words
(do not attempt, you may be waiting until the heat death of the universe).It may be a bug somewhere else that
iter_text()
is returning chunks that are too big.But either way, you probably want to reconsider the wisdom of slicing the string to chop the beginning off it vs. just keeping an index.
You can slice the string before returning from the decoder so it's only done once per chunk.
Anyhoo, thanks for the great work!
The text was updated successfully, but these errors were encountered: