Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse \uXXXX escapes faster #1172

Merged
merged 1 commit into from
Aug 12, 2024
Merged

Parse \uXXXX escapes faster #1172

merged 1 commit into from
Aug 12, 2024

Conversation

purplesyringa
Copy link
Contributor

When ignoring War and Peace (in Russian), this increases performance from 640 MB/s to 1080 MB/s (+70%).

When parsing into String, the savings are moderate but still significant: 275 MB/s to 320 MB/s (+15%).

@purplesyringa
Copy link
Contributor Author

purplesyringa commented Aug 11, 2024

Is changing error precision (see the changed test) okay? I couldn't sidestep that without sacrificing performance.

@dtolnay
Copy link
Member

dtolnay commented Aug 11, 2024

Is changing error precision (see the changed test) okay? I couldn't sidestep that without sacrificing performance.

The code previously pointed to a specific one of the expected hex digits:

"\u0000\u00#0\u0000"
           ^

Preserving that is not important to me.

The new code always points to the 6th byte after the backslash.

"\u0000\u00#0\u0000"
             ^

Would it be costly to point to the current escape's backslash instead? This would make more sense and hopefully be as simple as offsetting the index by -6 in the error codepath.

"\u0000\u00#0\u0000"
       ^

@purplesyringa
Copy link
Contributor Author

I can do that easily in the str case but not the I/O case. Is that fine?

@purplesyringa
Copy link
Contributor Author

Hmm. Now that I think about it, I'm not sure if I understand how error location works. position() is typically called after a call to next(), so the error points at the byte after the wrong one.

So e.g. here serde-json reports the error immediately after consuming malformed UTF-8:

json/tests/test.rs

Lines 1080 to 1083 in cf771a0

(
&[b'"', 159, 146, 150, b'"'],
"invalid unicode code point at line 1 column 5",
),

And here we say "line 2" after consuming the erroneous \n:

json/tests/test.rs

Lines 1108 to 1111 in cf771a0

(
&[b'"', b'\n', b'"'],
"control character (\\u0000-\\u001F) found while parsing a string at line 2 column 0",
),

I think throwing up pointing after an invalid escape (i.e. the current behavior) would be more consistent.

When ignoring *War and Peace* (in Russian), this increases performance
from 640 MB/s to 1080 MB/s (+70%).

When parsing into String, the savings are moderate but still
significant: 275 MB/s to 320 MB/s (+15%).
@dtolnay
Copy link
Member

dtolnay commented Aug 12, 2024

I'm not sure if I understand how error location works. position() is typically called after a call to next(), so the error points at the byte after the wrong one.

Ideally the position returned by position() is supposed to point at the byte most recently produced by next(). Separately there is peek_position() to return the position of the next byte not yet returned by next().

Sometimes there are codepaths where the byte which caused an error might have come from either next() or peek(), so the error position won't always be right in that case. Sometimes it's possible to move the error position computation earlier, so that it knows whether position() or peek_position() is the one to use. It is not intended in general that the reported error positions always refer to the next byte after the bad one.

json/src/read.rs

Lines 39 to 57 in b4bc643

/// Position of the most recent call to next().
///
/// The most recent call was probably next() and not peek(), but this method
/// should try to return a sensible result if the most recent call was
/// actually peek() because we don't always know.
///
/// Only called in case of an error, so performance is not important.
#[doc(hidden)]
fn position(&self) -> Position;
/// Position of the most recent call to peek().
///
/// The most recent call was probably peek() and not next(), but this method
/// should try to return a sensible result if the most recent call was
/// actually next() because we don't always know.
///
/// Only called in case of an error, so performance is not important.
#[doc(hidden)]
fn peek_position(&self) -> Position;

@dtolnay
Copy link
Member

dtolnay commented Aug 12, 2024

I can do that easily in the str case but not the I/O case. Is that fine?

That would be fine. Something approximate would also be fine, like column.saturating_sub(6) in the I/O case, which will be correct almost all of the time with the exception of something grossly malformed like:

"\u00
0"

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

I am interested in error position improvements (not just for \u but the other ones you called out too) but that can happen separately if you are interested in looking into it.

@dtolnay dtolnay merged commit d8921cd into serde-rs:master Aug 12, 2024
13 checks passed
@purplesyringa
Copy link
Contributor Author

Thanks!

@purplesyringa purplesyringa deleted the faster-hex branch August 12, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants