Remove `Reader::scrach_buffer` field + resulting breaking API changes. #421

anforowicz · 2023-11-01T20:30:16Z

PTAL?

This mostly follows what we've discussed in #417.

If we merge this, then some coordination may be required for releasing a new version of the png crate and then updating the image crate:

I am happy to volunteer to update the image crate after a new version of the png crate is released. OTOH, I am not sure how to get notified when this happens?
I am not sure when exactly it would be desirable to release a new version of the png crate. It may be worth waiting until some additional changes get merged:
- additional performance-related improvements (e.g. PRs that stem from the "copy avoidance series of commits")
- additional performance-related breaking-changes (e.g. I think that Read => BufRead is on the table - I want to try measuring this after flushing out other changes from the "copy avoidance series of commits")

anforowicz · 2023-11-02T01:33:22Z

Hmmm... after having already submitted the PR, I now started to wonder:

Maybe the remaining API should just be called next_row (rather than next_interlaced_row)

Maybe I should simplify the description in CHANGES.md to something like:

  ```
  * Breaking API changes:
     - Removing the `Row` and `InterlacedRow` structs
     - Removing the `Reader::next_interlaced_row` method
     - Changing the signature of the `Reader::next_row` method
  ```

fintelia · 2023-11-02T01:46:07Z

I would like to eventually make this API change, but for now I wonder if it would be simpler to make next_frame call self.next_pass() and self.next_interlaced_row_impl() directly. The scratch buffer variable would still exist, but it would only be allocated/used if the user specifically requested rows one-by-one.

anforowicz · 2023-11-02T14:52:03Z

One motivation for the changes in this PR is to improve the performance of image::codecs::png::PngReader which calls into next_row. Changes to next_frame wouldn't help with that.

It seems that sometimes decoding row-by-row (or at least smaller-than-frame-chunk-by-chunk) may be desirable if pixels needs some kind of post-processing (e.g. applying gamma, alpha premultiplication, etc.) and the user of the png crate wants to do such post-processing while the image data is in L1 cache. image::codecs::png::PngReader is compatible with that idea, but next_frame would add a considerably larger L1 cache pressure. ("seems" because I don't have hard data and/or benchmarks to confirm this. But it does seem like a valid idea. I think.)

anforowicz · 2023-11-02T14:58:10Z

Is the main concern the desire to avoid breaking changes?

One way to address this concern may be to instead add a new API (next_row2?) and the #[deprecated] attribute to the old APIs (structs and methods).

OTOH, IMO some breaking API changes may be required for performance one way or another. So maybe another way to proceed is to wait until those other breaking changes have PRs with measurements that show their impact for the noncompressed benches and then land all the breaking changes around the same time. (For example, I plan to send the Read => BufRead changes after first flushing out 2 less controversial PRs - so, in a week or two maybe, depending on how the other PRs are received :-))

fintelia · 2023-11-03T18:27:58Z

The goal is to bundle all the breaking changes we'd like to make into a single 0.18 release, rather than have a couple breaking releases in rapid succession. But that means that breaking changes will take longer to land than non-breaking changes. So if there are pieces that can be split out, it likely makes sense to merge those first.

These changes generally look good to me though, so I'm fine with letting them linger until we're ready to do a breaking release

anforowicz · 2023-11-03T18:52:54Z

The goal is to bundle all the breaking changes we'd like to make into a single 0.18 release, rather than have a couple breaking releases in rapid succession.
...
These changes generally look good to me though, so I'm fine with letting them linger until we're ready to do a breaking release

Ack. That sounds good and is totally reasonable.

Just as an FYI, let me point out that currently I think that I may want to make the following 3 breaking changes:

This PR (Reader::next_row taking &mut [u8] instead of returning &[u8])
Requiring that the reader passed to the png crate is already BufRead and not just Read (this is in the spirit of copy avoidance but wasn't explicitly covered in my recent measurements). Like this PR, that other PR will probably also require follow-up changes in the image crate.
Changing ZlibStream::decompress (and ZlibStream::finish_compressed_chunks) so it takes &mut [u8] instead of taking and appending to &mut Vec<u8>. This has been a part of the copy-avoidance-series-of-commits and has been covered by my recent measurements. This is technically a breaking API change, but this is a fairly low-level API that I think/hope shouldn't see that much usage (or at least should see less usage than the other changes above).

I think that it's probably desirable to first flush out 3-4 other PRs with non-breaking changes (starting with the noncompressed benchmarks). Flushing those out will help to highlight/magnify the relative improvement from the breaking changes (at least from the last 2; the current PR isn't directly covered by the benchmarks in the png crate and I am hesitant to expand the scope/shape of the benchmarks just to justify the PR).

anforowicz · 2023-11-15T17:31:47Z

Status update:

I've made the small tweaks that I've listed in Remove Reader::scrach_buffer field + resulting breaking API changes. #421 (comment)
I've put together a PR for another breaking API change (BufRead): Performance: Require BufRead instead of just Read for inputs. #427
I am no longer sure if it is desirable to pursue "changing ZlibStream::decompress [...] so it takes &mut [u8] instead of [...] &mut Vec<u8>." (mentioned in Remove Reader::scrach_buffer field + resulting breaking API changes. #421 (comment) above). In particular, Reduce copying and allocations #422 (comment) shows that avoiding copying of prev can be achieved without this breaking API change. There may be additional desirable tweaks of ZlibStream, but hopefully they won't require a breaking API change.

This commit moves `expand_paletted_into_rgb8` and `expand_paletted_into_rgba8` (and their unit tests) into a separate `transform/palette.rs` module. This prepares room for encapsulating extra complexity in this module in follow-up commits, where we will start to precompute and memoize some data when creating a `TransformFn`. This commit just moves the code around - it should have no impact on correctness or performance.

The `PLTE` chunk's size should be a multiple of 3 (since it contains RGB entries - 3 bytes per entry). Additionally, taking 10000 samples in the `bench_create_fn` benchmarks is a bit excessive after memoization.

This commit changes the `TransformFn` type alias from `fn(...)` into `Box<dyn Fn(...)>`. This allows the `TransformFn` to have store some precomputer, memoized state that we plan to add in follow-up commits. In theory this commit may have negative performance impact, but in the grand scheme of things it disappears into the measurement noise. In particular, when there is no state, then `Box` shouldn't allocate.

Before this commit `expand_paletted_into_rgba8` would: * Perform 2 lookups - `palette.get(i)` and `trns.get(i)` * Check via `unwrap_or` if `i` was within the bounds of `palette`/`trns` This commit introduces `create_rgba_palette` which combines `palette` and `trns` into a fixed-size `[[u8;4]; 256]` look-up table (called `rgba_palette` in the code). After this commit `expand_paletted_into_rgba8` only needs to perform a single look-up and doesn't need to check the bounds. This helps to improve the expansion time by 60+%: - expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461: [-60.208% -60.057% -59.899%] (p = 0.00 < 0.05) - expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461: [-77.520% -77.407% -77.301%] (p = 0.00 < 0.05) `expand_paletted_into_rgb8` performs only a single lookup before and after this commit, but avoiding bounds checks still helps to improve the expansion time by ~12%: - expand_paletted(exec)/trns=no/src_bits=4/src_size=5461: [-12.357% -12.005% -11.664%] (p = 0.00 < 0.05) - expand_paletted(exec)/trns=no/src_bits=8/src_size=5461: [-13.135% -12.584% -12.092%] (p = 0.00 < 0.05) Understandably, this commit regresses the time of `create_transform_fn`. Future commits will reduce this regression 2-4 times: - expand_paletted(ctor)/plte=256/trns=256: [+3757.2% +3763.8% +3770.5%] (p = 0.00 < 0.05) - expand_paletted(ctor)/plte=224/trns=32: [+3807.3% +3816.2% +3824.6%] (p = 0.00 < 0.05) - expand_paletted(ctor)/plte=16/trns=1: [+1672.0% +1675.0% +1678.1%] (p = 0.00 < 0.05)

Before this commit `expand_into_rgb8` would copy 3 bytes at a time into the output. After this commit it copies 4 bytes at a time (possibly cloberring pixels that will be populated during the next iteration - this is ok). This improved the performance as follows: expand_paletted(exec)/trns=no/src_bits=8/src_size=5461 time: [-23.852% -23.593% -23.319%] (p = 0.00 < 0.05)

This improves the performance as follows: - expand_paletted(ctor)/plte=256/trns=256 [-40.581% -40.396% -40.211%] (p = 0.00 < 0.05) - expand_paletted(ctor)/plte=224/trns=32 [-24.070% -23.840% -23.592%] (p = 0.00 < 0.05) Small palettes are mostly unaffected: - expand_paletted(ctor)/plte=16/trns=1 [-0.2525% +0.0338% +0.3239%] (p = 0.81 > 0.05)

Remove remaining uses of miniz_oxide for decoding

…te-usages Remove usages of `extern crate`

Fixes image-rs#357.

Fixes image-rs#273.

This commit is desirable, because it avoids copying of image data across intermediate buffers. This commit was motivated by the data gathered in image-rs#416 (comment) The commit results in the followin performance gains seen in the recently introduced `row-by-row/128x128-4k-idat` benchmark: - time: [-18.401% -17.807% -17.192%] (p = 0.00 < 0.05) - time: [-9.4276% -8.8789% -8.2860%] (p = 0.00 < 0.05) - time: [-12.389% -11.780% -11.181%] (p = 0.00 < 0.05) Fixes image-rs#417

anforowicz · 2024-07-17T17:18:35Z

@fintelia, can you PTAL again?

Process notes

I see that this PR has been rebased on top of https://github.com/image-rs/image-png/tree/next, but this branch is 24 commits behind https://github.com/image-rs/image-png/tree/master and this makes GitHub is quite confused. In particular, it is not quite correct that in this PR “anforowicz wants to merge 26 commits” (as GitHub says at the top) - I just want to merge the 2 commits here: master...anforowicz:image-png:scratch-buffer-removal: b7d0c06 and 29e90cc. So, maybe you want to rebase on top of the master branch again?

I am also not sure what the right process is for landing breaking API changes. I went ahead and bumped up the version to 0.18.0 in CHANGE.md and Cargo.toml but please shout if I shouldn’t be doing that just yet.

Performance Impact

Performance of next_row matters mostly in scenarios that require some post-processing (e.g. doing a gamma correction or rgba8=>bgra8 conversion). Reasons for post-processing after decoding each row (rather than after decoding the whole image) include 1) incremental/progressive/partial decoding scenarios (this matters both for interlaced and non-interlaced images) and 2) desire to do the post-processing while the row pixels are still in L1 cache. The performance gains are mostly limited to scenarios where post-processing can happen in-place (otherwise, the memory copy would still happen as part of the post-processing) - assuming that rgba8 images can be post-processed in place this covers around 54% (rgba8 images) to 89% (rgba8 + indexed images) of scenarios involving web images (percentages are based on the data here).

This PR adds a benchmark that covers next_row of a generated, non-compressed image (i.e. focusing on parsing and memory copies instead of decompression overhead). The PR improves the benchmark results by 8-17% (see the commit description for more details).

In theory the changes in this PR can also result in a small performance gain in some non-incremental decoding scenarios in next_frame - avoiding an extra copy of 7th-pass pixels of interlaced images (decoding these pixels directly into the target buffer instead of calling adam7::expand_pass). I haven’t implemented these changes, because interlaced images seem to only happen in 1-2% of web images (see the data here).

Breaking Changes

At this point this PR is the only breaking change that I plan to do (given that the BufReader-related changes don’t give clear performance gains as discussed elsewhere). I think this means that the earlier comment “to bundle all the breaking changes” doesn’t apply anymore.

I note that a bit ago the image crate stopped calling the next_row function (see the commit here). This reduces the impact / scope of the proposed breaking changes (although image is just one crate).

If the breaking API changes are nevertheless concerning, then we can also keep the old APIs (maybe marking them as #[deprecated]) and introduce a separate new API. If we want to do that, then we’d have to decide what to call the new API:

The old APIs are pub fn next_row(&mut self) -> Result<Option<Row<'_>>, DecodingError> and pub fn next_interlaced_row ….
Maybe the new API can look like pub fn read_row(&mut self, row: &mut [u8]) -> Result<Option<InterlaceInfo>, DecodingError> (i.e. naming it read_row instead of next_row). I don’t quite like the asymmetry between next_frame and read_row, but this should work.

fintelia · 2024-07-28T21:34:27Z

I'm supportive of this PR, but I'm too burned out right now to review everything and manage a major release. As far as the BufRead changes, the last image crate release did add a BufRead bound for decoders so at least that aspect no longer applies.

anforowicz · 2024-08-16T14:25:23Z

I'm supportive of this PR, but I'm too burned out right now to review everything and manage a major release. As far as the BufRead changes, the last image crate release did add a BufRead bound for decoders so at least that aspect no longer applies.

Ack. My thinking is gradually evolving toward keeping Reader.read_row for now:

This avoids a breaking change. And even if we wanted to take the breaking changes, it probably would be desirable to initially mark read_row as #[deprecated] and only later (maybe after another major release?) remove it.
Reader.read_row has reasonable performance if the caller unconditionally post-processes and copies (because in this case no memory copy is saved). And in such a scenario having Reader.read_row is a nice convenience.

So, now I am leaning toward abandoning the current PR and introducing a separate, new API instead in
#493.

WDYT?

anforowicz force-pushed the scratch-buffer-removal branch from f4c1a10 to 7150f4d Compare November 15, 2023 17:31

anforowicz mentioned this pull request Nov 28, 2023

Performance: Require BufRead instead of just Read for inputs. #427

Draft

Shnatsel mentioned this pull request Jan 4, 2024

Allow using zune-png for PNG decoding image-rs/image#1852

Closed

fintelia changed the base branch from master to next January 20, 2024 22:43

fintelia and others added 19 commits January 28, 2024 17:44

Remove remaining uses of miniz_oxide for decoding

211833f

Import BoundedDecompressionError

2bd3dc9

Fix constants used in palette benchmarks.

d9df1d7

The `PLTE` chunk's size should be a multiple of 3 (since it contains RGB entries - 3 bytes per entry). Additionally, taking 10000 samples in the `bench_create_fn` benchmarks is a bit excessive after memoization.

Merge pull request image-rs#463 from fintelia/less-miniz-oxide

92540b3

Remove remaining uses of miniz_oxide for decoding

Return Info with static lifetime (image-rs#465)

a6425ca

Test minimal versions in CI (image-rs#467)

ec8dbe9

Release 0.17.12 (image-rs#468)

59e04df

Fix Send bound on Reader (image-rs#471)

c02bd7b

Release 0.17.13 (image-rs#472)

b5b0d48

Add an option to ignore iccp chunk (image-rs#477)

9dd2a89

Ignore iCCP section if it is invalid (image-rs#475)

f28bc07

Fix test and warnings (image-rs#480)

f00ca20

ci: Update to actions/checkout@v4 from v2

1864796

Merge pull request image-rs#483 from waywardmonkeys/update-ci

9517a52

waywardmonkeys and others added 7 commits June 29, 2024 11:39

Remove usages of extern crate

3308238

Merge pull request image-rs#484 from waywardmonkeys/remove-extern-cra…

edc3269

…te-usages Remove usages of `extern crate`

Fix clippy::legacy_numeric_constants lints

49f0cad

Fix typos.

d4390c4

Fixes image-rs#357.

README: Remove broken badges

2cfde02

Fixes image-rs#273.

Benchmark of next_row-based decoding.

b7d0c06

anforowicz force-pushed the scratch-buffer-removal branch from 7150f4d to 29e90cc Compare July 17, 2024 17:03

anforowicz mentioned this pull request Jul 17, 2024

Bump bitflags to v2 #400

Open

anforowicz mentioned this pull request Aug 16, 2024

Add Reader.read_row for decoding into caller-provided buffer. #493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `Reader::scrach_buffer` field + resulting breaking API changes. #421

Remove `Reader::scrach_buffer` field + resulting breaking API changes. #421

anforowicz commented Nov 1, 2023

anforowicz commented Nov 2, 2023

fintelia commented Nov 2, 2023

anforowicz commented Nov 2, 2023

anforowicz commented Nov 2, 2023 •

edited

Loading

fintelia commented Nov 3, 2023 •

edited

Loading

anforowicz commented Nov 3, 2023

anforowicz commented Nov 15, 2023

anforowicz commented Jul 17, 2024

fintelia commented Jul 28, 2024

anforowicz commented Aug 16, 2024 •

edited

Loading

Remove Reader::scrach_buffer field + resulting breaking API changes. #421

Are you sure you want to change the base?

Remove Reader::scrach_buffer field + resulting breaking API changes. #421

Conversation

anforowicz commented Nov 1, 2023

anforowicz commented Nov 2, 2023

fintelia commented Nov 2, 2023

anforowicz commented Nov 2, 2023

anforowicz commented Nov 2, 2023 • edited Loading

fintelia commented Nov 3, 2023 • edited Loading

anforowicz commented Nov 3, 2023

anforowicz commented Nov 15, 2023

anforowicz commented Jul 17, 2024

Process notes

Performance Impact

Breaking Changes

fintelia commented Jul 28, 2024

anforowicz commented Aug 16, 2024 • edited Loading

Remove `Reader::scrach_buffer` field + resulting breaking API changes. #421

Remove `Reader::scrach_buffer` field + resulting breaking API changes. #421

anforowicz commented Nov 2, 2023 •

edited

Loading

fintelia commented Nov 3, 2023 •

edited

Loading

anforowicz commented Aug 16, 2024 •

edited

Loading