Parse \uXXXX escapes faster #1172

purplesyringa · 2024-08-11T19:19:05Z

When ignoring War and Peace (in Russian), this increases performance from 640 MB/s to 1080 MB/s (+70%).

When parsing into String, the savings are moderate but still significant: 275 MB/s to 320 MB/s (+15%).

purplesyringa · 2024-08-11T19:19:50Z

Is changing error precision (see the changed test) okay? I couldn't sidestep that without sacrificing performance.

dtolnay · 2024-08-11T21:18:22Z

Is changing error precision (see the changed test) okay? I couldn't sidestep that without sacrificing performance.

The code previously pointed to a specific one of the expected hex digits:

"\u0000\u00#0\u0000"
           ^

Preserving that is not important to me.

The new code always points to the 6th byte after the backslash.

"\u0000\u00#0\u0000"
             ^

Would it be costly to point to the current escape's backslash instead? This would make more sense and hopefully be as simple as offsetting the index by -6 in the error codepath.

"\u0000\u00#0\u0000"
       ^

purplesyringa · 2024-08-11T21:25:01Z

I can do that easily in the str case but not the I/O case. Is that fine?

purplesyringa · 2024-08-11T21:34:22Z

Hmm. Now that I think about it, I'm not sure if I understand how error location works. position() is typically called after a call to next(), so the error points at the byte after the wrong one.

So e.g. here serde-json reports the error immediately after consuming malformed UTF-8:

json/tests/test.rs

Lines 1080 to 1083 in cf771a0

    
           ( 
        
               &[b'"', 159, 146, 150, b'"'], 
        
               "invalid unicode code point at line 1 column 5", 
        
           ),

And here we say "line 2" after consuming the erroneous \n:

json/tests/test.rs

Lines 1108 to 1111 in cf771a0

    
           ( 
        
               &[b'"', b'\n', b'"'], 
        
               "control character (\\u0000-\\u001F) found while parsing a string at line 2 column 0", 
        
           ),

I think throwing up pointing after an invalid escape (i.e. the current behavior) would be more consistent.

When ignoring *War and Peace* (in Russian), this increases performance from 640 MB/s to 1080 MB/s (+70%). When parsing into String, the savings are moderate but still significant: 275 MB/s to 320 MB/s (+15%).

dtolnay · 2024-08-12T20:21:00Z

I'm not sure if I understand how error location works. position() is typically called after a call to next(), so the error points at the byte after the wrong one.

Ideally the position returned by position() is supposed to point at the byte most recently produced by next(). Separately there is peek_position() to return the position of the next byte not yet returned by next().

Sometimes there are codepaths where the byte which caused an error might have come from either next() or peek(), so the error position won't always be right in that case. Sometimes it's possible to move the error position computation earlier, so that it knows whether position() or peek_position() is the one to use. It is not intended in general that the reported error positions always refer to the next byte after the bad one.

json/src/read.rs

Lines 39 to 57 in b4bc643

    
           /// Position of the most recent call to next(). 
        
           /// 
        
           /// The most recent call was probably next() and not peek(), but this method 
        
           /// should try to return a sensible result if the most recent call was 
        
           /// actually peek() because we don't always know. 
        
           /// 
        
           /// Only called in case of an error, so performance is not important. 
        
           #[doc(hidden)] 
        
           fn position(&self) -> Position; 
        
           /// Position of the most recent call to peek(). 
        
           /// 
        
           /// The most recent call was probably peek() and not next(), but this method 
        
           /// should try to return a sensible result if the most recent call was 
        
           /// actually next() because we don't always know. 
        
           /// 
        
           /// Only called in case of an error, so performance is not important. 
        
           #[doc(hidden)] 
        
           fn peek_position(&self) -> Position;

dtolnay · 2024-08-12T20:24:46Z

I can do that easily in the str case but not the I/O case. Is that fine?

That would be fine. Something approximate would also be fine, like column.saturating_sub(6) in the I/O case, which will be correct almost all of the time with the exception of something grossly malformed like:

"\u00
0"

dtolnay

This looks good!

I am interested in error position improvements (not just for \u but the other ones you called out too) but that can happen separately if you are interested in looking into it.

purplesyringa · 2024-08-12T20:27:51Z

Thanks!

Parse \uXXXX escapes faster

86d0e11

When ignoring *War and Peace* (in Russian), this increases performance from 640 MB/s to 1080 MB/s (+70%). When parsing into String, the savings are moderate but still significant: 275 MB/s to 320 MB/s (+15%).

purplesyringa force-pushed the faster-hex branch from e10a88a to 86d0e11 Compare August 12, 2024 09:02

dtolnay approved these changes Aug 12, 2024

View reviewed changes

dtolnay merged commit d8921cd into serde-rs:master Aug 12, 2024
13 checks passed

purplesyringa deleted the faster-hex branch August 12, 2024 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse \uXXXX escapes faster #1172

Parse \uXXXX escapes faster #1172

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024 •

edited

Loading

dtolnay commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

dtolnay commented Aug 12, 2024

dtolnay commented Aug 12, 2024

dtolnay left a comment

purplesyringa commented Aug 12, 2024

Parse \uXXXX escapes faster #1172

Parse \uXXXX escapes faster #1172

Conversation

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024 • edited Loading

dtolnay commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

dtolnay commented Aug 12, 2024

dtolnay commented Aug 12, 2024

dtolnay left a comment

Choose a reason for hiding this comment

purplesyringa commented Aug 12, 2024

purplesyringa commented Aug 11, 2024 •

edited

Loading