-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel decoding #161
Parallel decoding #161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the task strategy is correct yet. It doesn't seem very 'rayon idiomatic'.
I'd be more comfortable with the task strategy if we allocate a fixed number of parallel resources (that is, prepare buffers for some determined number of parallelism that depends on the required buffer sizes vs. available allocation size vs. rayon's available core count). And have an iterator defining the chunks+target buffers that need processing.
If those task-resources are in a vec<_>
then we can use flat_map_iter
to actually execute the task. Each task could then create an 'iterator' that polls from the shared chunk iterator (by some means of synchronization). That would guarantee a precise upper bounds on the allocated buffer size while also preserving those allocations between chunks.
Can you imagine something like this working?
let condvar = std::sync::Condvar::new(); | ||
let bytes_allocated = std::sync::Mutex::new(Ok(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat questionable to me? Blocking a rayon worker seems very unintuitive, and after the jpeg-decoder experience there should be more comments on correctness/forward progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a bunch of comments. The general idea is that taking the lock should never deadlock because all threads are careful to never do any blocking operations while holding the lock. We do sleep on the condition variable from the calling thread, but never from one of rayon's worker threads. As long as the worker threads make forward progress the calling thread should eventually be woken up.
let condvar = std::sync::Condvar::new(); | ||
let bytes_allocated = std::sync::Mutex::new(Ok(0)); | ||
for x in 0..chunks_across { | ||
rayon::in_place_scope_fifo(|s| -> TiffResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already starting a new scope for each x coordinate, can't we ensure the allocation requirement by ensuring that at most n
tasks run in parallel, where n
is intermediate_buffer_size / chunk_bytes.iter().max()
. Or an adaptive version instead.
I'd also be more comfortable if tasks were generated as an iterator instead of this somewhat complex case distinction of different loops, if that makes sense. In particular this makes it easier to transition to using some of the rayon parallel iterators in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two loops are needed to work around a borrow checker issue for tiled images (for stripped images the outer loop executes exactly one iteration).
It may be possible to replace the inner loop with an iterator that generates the tasks. However, I think that would be tricky to implement while still doing file reads and chunk expands in parallel
@HeroicKatora how do you feel about the current state? If you'd prefer we iterate more, I can split out the rayon parallelized expand_chunk function into a separate PR to let us merged the the other code movement/refactoring parts. That would reduce potential for merge conflicts with other concurrent PRs |
Sounds good to split it up. Consolidating chunk is a very good and orthogonal change compared to the prior explicit line/tile differentiation. |
Ok, I've rebased this on the main branch so now this just includes the parallel decoding part. I also looked into other ways of using rayon for decoding. Unfortunately, we're rather constrained by the lack of a |
This PR adds parallel decoding via an optional (but default) dependency on rayon.