Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient deserialization for Buffer<u8> #103

Merged
merged 7 commits into from
Mar 18, 2023

Conversation

jleibs
Copy link
Contributor

@jleibs jleibs commented Feb 22, 2023

This implements ArrowDeserialize for Buffer as part of #101

There are a few hoops to jump through because the BinaryArray<i32> normally exposes an iterator which is &[u8] slices, whereas we want to expose a Buffer primitive directly to take advantage of the Arc behavior of arrow Buffers.

@jleibs jleibs marked this pull request as draft February 22, 2023 00:59
@jleibs jleibs marked this pull request as ready for review February 22, 2023 01:27
@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2023

Codecov Report

Merging #103 (14e772c) into main (b2fd8f8) will decrease coverage by 0.05%.
The diff coverage is 91.30%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main     #103      +/-   ##
==========================================
- Coverage   94.63%   94.58%   -0.05%     
==========================================
  Files           8        8              
  Lines        1565     1588      +23     
==========================================
+ Hits         1481     1502      +21     
- Misses         84       86       +2     
Impacted Files Coverage Δ
arrow2_convert/src/field.rs 100.00% <ø> (ø)
arrow2_convert/src/deserialize.rs 98.78% <91.30%> (-1.22%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@emilk
Copy link
Contributor

emilk commented Feb 22, 2023

It could be nice to cover Buffer<u8> in arrow2_convert/benches/bench.rs too

@jleibs
Copy link
Contributor Author

jleibs commented Feb 22, 2023

It could be nice to cover Buffer<u8> in arrow2_convert/benches/bench.rs too

Interestingly these benches show this work might not have actually been all that necessary. Vec<u8> looks like it has same constant-time behavior as Buffer<u8>. Curious if this is unique to the bench or a more general optimization.

deserialize/BufferU8/10 time:   [61.707 ns 61.950 ns 62.262 ns]
                        thrpt:  [160.61 Melem/s 161.42 Melem/s 162.06 Melem/s]
                        
deserialize/VecU8/10    time:   [70.875 ns 71.151 ns 71.433 ns]
                        thrpt:  [139.99 Melem/s 140.55 Melem/s 141.09 Melem/s]

deserialize/BufferU32/10
                        time:   [162.66 ns 163.88 ns 165.04 ns]
                        thrpt:  [60.593 Melem/s 61.019 Melem/s 61.478 Melem/s]

deserialize/VecU32/10   time:   [169.86 ns 171.55 ns 173.81 ns]
                        thrpt:  [57.535 Melem/s 58.291 Melem/s 58.871 Melem/s]

deserialize/BufferU8/100
                        time:   [61.751 ns 61.865 ns 61.995 ns]
                        thrpt:  [1.6130 Gelem/s 1.6164 Gelem/s 1.6194 Gelem/s]

deserialize/VecU8/100   time:   [71.856 ns 72.057 ns 72.305 ns]
                        thrpt:  [1.3830 Gelem/s 1.3878 Gelem/s 1.3917 Gelem/s]

deserialize/BufferU32/100
                        time:   [160.29 ns 161.14 ns 161.90 ns]
                        thrpt:  [617.67 Melem/s 620.59 Melem/s 623.88 Melem/s]

deserialize/VecU32/100  time:   [214.01 ns 215.51 ns 217.32 ns]
                        thrpt:  [460.16 Melem/s 464.02 Melem/s 467.26 Melem/s]

deserialize/BufferU8/1000
                        time:   [61.111 ns 61.215 ns 61.338 ns]
                        thrpt:  [16.303 Gelem/s 16.336 Gelem/s 16.364 Gelem/s]

deserialize/VecU8/1000  time:   [73.965 ns 74.617 ns 75.273 ns]
                        thrpt:  [13.285 Gelem/s 13.402 Gelem/s 13.520 Gelem/s]

deserialize/BufferU32/1000
                        time:   [159.60 ns 160.39 ns 161.10 ns]
                        thrpt:  [6.2075 Gelem/s 6.2348 Gelem/s 6.2655 Gelem/s]

deserialize/VecU32/1000 time:   [659.44 ns 660.73 ns 662.17 ns]
                        thrpt:  [1.5102 Gelem/s 1.5135 Gelem/s 1.5164 Gelem/s]

deserialize/BufferU8/10000
                        time:   [61.588 ns 61.714 ns 61.856 ns]
                        thrpt:  [161.67 Gelem/s 162.04 Gelem/s 162.37 Gelem/s]

deserialize/VecU8/10000 time:   [70.506 ns 70.661 ns 70.800 ns]
                        thrpt:  [141.24 Gelem/s 141.52 Gelem/s 141.83 Gelem/s]

deserialize/BufferU32/10000
                        time:   [160.37 ns 161.42 ns 162.49 ns]
                        thrpt:  [61.544 Gelem/s 61.950 Gelem/s 62.354 Gelem/s]

deserialize/VecU32/10000
                        time:   [5.5109 µs 5.5266 µs 5.5450 µs]
                        thrpt:  [1.8034 Gelem/s 1.8094 Gelem/s 1.8146 Gelem/s]

@ncpenke
Copy link
Collaborator

ncpenke commented Mar 18, 2023

Thanks for the change!

@ncpenke ncpenke merged commit 6450acb into DataEngineeringLabs:main Mar 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants