Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change LineDecoder to match stdlib splitlines, resulting in significant speed up #2423

Merged
merged 20 commits into from
Mar 16, 2023

Conversation

giannitedesco
Copy link
Contributor

@giannitedesco giannitedesco commented Oct 26, 2022

Closes #2422

Leading to enormous speedups when doing things such as Response(...).iter_lines()

@giannitedesco giannitedesco force-pushed the fast-line-decode branch 7 times, most recently from 08ad3ce to 50c6811 Compare October 26, 2022 12:23
@tomchristie
Copy link
Member

A solution using the built-in splitlines would probably be preferable to our own regex.
It might be worth taking a look at what requests does here?

@giannitedesco
Copy link
Contributor Author

A solution using the built-in splitlines would probably be preferable to our own regex. It might be worth taking a look at what requests does here?

Actually requests is using splitlines, too. So should I just switch to that and update the failing tests?

@tomchristie
Copy link
Member

So should I just switch to that and update the failing tests?

Switching to a splitlines based approach, using the requests implementation as a reference point would be a good starting point...

@giannitedesco
Copy link
Contributor Author

giannitedesco commented Nov 4, 2022

Haha, yes, the first implementation I did with splitlines happens to be identical to theirs when refactored (ie. theirs is a generator, not based an object 😂 but, yeah, that broke the tests which is when I went off on a tangent to fix that.

@giannitedesco giannitedesco force-pushed the fast-line-decode branch 7 times, most recently from 9cbd511 to 27b880e Compare November 6, 2022 07:12
@giannitedesco
Copy link
Contributor Author

@florimondmanca @tomchristie is this acceptable? I suppose the benefit of this API is that if you do something like ''.join(resp.iter_lines()) you recover the original text and that means you could, for example, hash, encrypt, or compress the response body incrementally by lines without altering its contents? So, while incompatible, maybe it could be considered an improvement? But not sure the reasoning behind the previous normalization of line endings so who can say?

@tomchristie
Copy link
Member

is this acceptable?

No. I don't believe resolving this issue should require the tests to change.

@tomchristie
Copy link
Member

tomchristie commented Nov 21, 2022

Hrm. I'd not appreciated that we have different behaviour from requests here. That's interesting.

(Our output includes trailing \n delimiters. Theirs does not.)

@giannitedesco
Copy link
Contributor Author

giannitedesco commented Nov 23, 2022

Right, I feel like there are 3 options to what the behaviour could be:

  1. no newlines (ala requests)
  2. all newline types replaced with "\n" (ala httpx now)
  3. all newlines preserved (result of this patch)

1 and 3 can be done efficiently and simply with "".splitlines(), and only 3 allows for reconstructing the exact original document after the fact (and therefore supports, eg. incrementally hashing a document line by line). (edit: i feel option 3 is the superior option since it supports the most use-cases)

Preservation of the existing semantics could be done with a regex perhaps, but it's tricky to get right. Or maybe a straight python implementation that just doesn't start from the beginning every time, it wouldn't be as fast as the ones which call to C code, but it could at least be linear-time?

Lemme know which direction you want to take it and I can give it a go..

@tomchristie
Copy link
Member

Right, I feel like there are 3 options to what the behaviour could be.

I'd suggest that we've got the current behaviour wrong, and that we need a breaking API change here.
The sensible behaviour for iter_lines would be to match the same style as the sodlib's splitlines().
So, yeah, I think we want option (1) here. (Right?)

@ofek
Copy link
Contributor

ofek commented Dec 7, 2022

I think we want option (1) here. (Right?)

Yes imo

@florimondmanca
Copy link
Member

florimondmanca commented Dec 8, 2022

After reading my comment pointing at a behavior change, it’s true that I would rather be expecting iter_lines() to return lines without newline separators at the end. This is what splitlines() does, but also other languages like Rust’s str.lines() (AdventOfCode season…), or JS when doing .split(/\r?\n/), etc.

What are the ways this change could impact existing code? I don’t see many. If code is currently using iter_lines() and calling trim() on the result, they shouldn’t be impacted, only those trim calls would become unnecessary. People can’t be relying on the current behavior to reconstruct the original content either, so there shouldn’t be breakage on this use case either.

httpx/_decoders.py Outdated Show resolved Hide resolved
httpx/_decoders.py Outdated Show resolved Hide resolved
@tomchristie tomchristie added the api change PRs that contain breaking public API changes label Jan 4, 2023
@tomchristie
Copy link
Member

Noticed that this implementation wouldn't return until flush for input that was just a sequence of \r characters, so I've adapted it slightly.

Implementation update in d3c6a8e.

Test change in 46cad89...

    decoder = LineDecoder()
    assert decoder.decode("") == []
    # This will now return `["a", "", "b"]` and *only* buffer the last portion.
    assert decoder.decode("a\r\rb\rc\r") == ["a", "", "b"]  
    assert decoder.flush() == ["c"]

@tomchristie tomchristie requested a review from a team January 10, 2023 14:40
httpx/_decoders.py Outdated Show resolved Hide resolved
httpx/_decoders.py Outdated Show resolved Hide resolved
@cdeler
Copy link
Member

cdeler commented Jan 13, 2023

LGTM (with a small style suggestion which can be easily ignored)

@zanieb
Copy link
Contributor

zanieb commented Jan 13, 2023

Can we also update this pull request title to note the change in behavior since it's no longer just a change in algo complexity?

tomchristie and others added 2 commits January 13, 2023 12:08
Co-authored-by: cdeler <serj.krotov@gmail.com>
@florimondmanca florimondmanca changed the title Replace quadratic algo in LineDecoder Change LineDecoder to match stdlib splitlines, resulting in significant speed up Feb 4, 2023
@tomchristie
Copy link
Member

Okay, I think we can merge this in now.
The line ending behaviour change means that we should bump the version on this one.

@tomchristie tomchristie merged commit 85c5898 into encode:master Mar 16, 2023
@ofek
Copy link
Contributor

ofek commented Mar 16, 2023

This is very exciting, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change PRs that contain breaking public API changes bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LineDecoder is accidentally quadratic: iter_lines() seems to hang forever
6 participants