Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fold_mut alternative to Iterator trait #76746

Closed
wants to merge 3 commits into from
Closed

Add fold_mut alternative to Iterator trait #76746

wants to merge 3 commits into from

Conversation

mlodato517
Copy link
Contributor

This PR
Adds a fold_mut alternative to fold for the Iterator trait
Closes #76725

Why?
Most of the background is in #76725 but the TL;DR is that it might be faster and it might be ergonomic in some cases.

Questions:

  1. I've never contributed library code here before - once I added unstable the tests started failing. Do I need to use nightly rust? Do I need to specify the feature config somewhere? Do I need that specified for the tests and benchmarks?
  2. Are those docs are sufficient? I was offloading everything to fold() which is a little lazy maybe
  3. Do we want all those benchmarks? Is it bloating the library/causing unnecessary time increase in the benchmarks?
  4. Should this be implemented for other structs that have custom fold implementations like VecDeque, etc.?
  5. Should we track a separate issue to improve the performance of fold itself? I imagine that, in general, the runtimes should be the same

@rust-highfive
Copy link
Collaborator

r? @joshtriplett

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 15, 2020
@mlodato517 mlodato517 changed the title Add fold_mut alternative Add fold_mut alternative to Iterator trait Sep 15, 2020
@workingjubilee
Copy link
Member

workingjubilee commented Sep 15, 2020

Needs to be updated to pass x.py check / x.py test tidy.

warning: build failed, waiting for other jobs to finish...
error[E0658]: use of unstable library feature 'iterator_fold_mut'
--> library/core/tests/iter.rs:3229:30
|
3229 | let result = nums.iter().fold_mut(Vec::new(), |v, i| {
| ^^^^^^^^
|
= note: see issue #76725 #76725 for more information
= help: add #![feature(iterator_fold_mut)] to the crate attributes to enable

Yes, Rust is built with the nightly compiler (actually, the full bootstrap process initiated by x.py starts with the beta, usually).
You may want to read https://rustc-dev-guide.rust-lang.org/implementing_new_features.html

@jyn514 jyn514 added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. A-iterators Area: Iterators labels Sep 15, 2020
@mlodato517
Copy link
Contributor Author

Sorry, I'm not 100% sure what to do here - I rustup default nightly'ed, added #![feature(iterator_fold_mut)] in library/core/tests/iter.rs, and ran ./x.py c -i library/core/ and I see

error[E0658]: use of unstable library feature 'iterator_fold_mut'
    --> library/core/tests/iter.rs:3230:30
     |
3230 |     let result = nums.iter().fold_mut(Vec::new(), |v, i| {
     |                              ^^^^^^^^
     |
     = note: see issue #76725 <https://github.com/rust-lang/rust/issues/76725> for more information
     = help: add `#![feature(iterator_fold_mut)]` to the crate attributes to enable

What's the right way to set this up so that feature is enabled?

Also, I'm seeing

error: an inner attribute is not permitted in this context
    --> library/core/tests/iter.rs:3226:1
     |
3226 | #![feature(iterator_fold_mut)]
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     |
     = note: inner attributes, like `#![no_std]`, annotate the item enclosing them, and are usually found at the beginning of source files. Outer attributes, like `#[test]`, annotate the item following them.

in the doctest. I copied that from fold_first and it doesn't seem to obviously complain about that one - do you know what I'm missing there?

I wasn't sure if I needed to add it to the list in rustc_feature/src/active.rs but I tried it (I hope with the correct syntax) and that didn't help either, nor did adding it to 🤔

Is there a good commit/PR for me to track the right way to do things? I was originally following #65222 but I don't see anything particularly sneaky.

@leonardo-m
Copy link

Currently the stdlib lacks some essential iterators (present in Itertools). Every function added to iterators increases cognitive load, and time to find the right function, and memory load to remember its purpose. I am not saying this shouldn't fit in stdlib, but please consider the costs.

let mut accum = init;
while let Some(x) = self.next() {
f(&mut accum, x);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default should be implemented in terms of fold, so it automatically takes advantage of all the iterators that already specialize this path.

self.fold(init, |mut acc, x| { f(&mut acc, x); acc })

Even better if that isolates the closure generics like #62429, so it only depends on Self::Item rather than all of Self (plus B and F of course).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try with inputs like Chain or FlatMap to see the performance difference of custom fold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default should be implemented in terms of fold, so it automatically takes advantage of all the iterators that already specialize this path.

I like this idea! I tried it here and the results seem to indicate that this brought back the performance "issue" (assuming the benchmark is built right, Criterion is doing it's thing correctly, and I'm interpreting the results correctly):
criterion-plot

Should I push forward with that? I was originally making fold_mut for the performance but we could instead make it about the shape of the closure and just wait on improvements to fold if we think that's valuable

Try with inputs like Chain or FlatMap to see the performance difference of custom fold.

Good call, I'll whip up a few of these when I get the chance!

Copy link
Contributor Author

@mlodato517 mlodato517 Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eureka! I added some benchmarks (with terrible names and with terrible duplication that should be replaced with a macro) in 2921101 and it appears that fold_mut is much slower when using chain and flat_map!

iter-bench

Unfortunately they're a little bit out of order - I will definitely pick better names if we decide to go through with this PR but basically for chain and flat_map, fold_mut is like 50%-100% slower.

For a non-chain-non-flat-map fold on a hashmap, they're about the same (except fold_mut has a huge variance so I'm not sure about that there).

For a non-chain-non-flat-map fold on a number, fold_mut is slower now (interesting that has now switched. I guess that means they're basically "the same within precision"?).

For a non-chain-non-flat-map fold on a Vec, fold_mut is faster by some huge margin that is suspicious.

I think maybe I'll port these benchmarks back to my other repo so I can use criterion for (maybe) more consistent data? Or should we just pull the plug on this? Or redirect the effort towards the ergonomics and not worry about performance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that fold_mut is much slower when using chain and flat_map!

That's expected because of the while let (essentially a for loop), as cuviper said/implied. If you just replace the (conceptual) for loop with a .for_each( I suspect you'd recover the performance:

         let mut accum = init;
-        while let Some(x) = self.next() {
-            f(&mut accum, x);
-        }
+        self.for_each(|x| f(&mut accum, x));

(insert comment about #62429 here and how that shouldn't be the implementation in core, but it'd be fine for benchmarking)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that went a little over my head, let me see if I understand:

  1. Reduce the genericity of closures in the iterator traits #62429 introduced some specialized iterator methods (fold among them) for certain iterators. This reduced the number of types the iterator closure was generic over
  2. By implementing a separate fold_mut that doesn't use fold, fold_mut misses out on the specific fold implementations for those iterators
  3. By switching it to a for_each, fold_mut may recover the performance because, I assume, there are similarly specialized for_each implementations for several iterators

Interested to know if any of those are in the right ballpark!

And for expedience (maybe), assuming those points are in the correct ballpark, may I ask:

  1. how does the custom implementation, that reduces the generic types on the closure, improve performance? I can understand the comments about reducing the number of monomorphizations and then code size, but it's not immediately obvious how this plays into runtime performance? Or is it just smaller = faster here? Or is it that, with fewer trait bounds, the compiler is able to do some additional optimization?
  2. how should this be reflected in this PR? My original hope was to have a fold that was "closer" to a "zero cost abstraction". It's seeming more and more like that isn't super possible (except with maybe the for_each construction above?). Should I bail on the "performance" of fold_mut and double down on the ergonomics of the closure by defining fold_mut in terms of fold and then fold_mut will maybe become more of a zero cost abstraction as fold is specialized on other iterators like Vec::IntoIter or something?

Copy link
Member

@cuviper cuviper Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concerns of #62429 are mostly separate from the concerns of fold specialization. The overlap is just that using fold or for_each often introduces a generic closure.

The point of #62429 is that closures inherit all of the generic type parameters of the surrounding scope. In your example that would be B, F, and Self -- but the closure doesn't really need the whole iterator type, just Self::Item. So you can end up generating the same closure for <B, F, vec::IntoIter<i32>>, <B, F, Cloned<hash_set::Iter<'_, i32>>>, <B, F, Map<Chain<Range<i32>, Once<i32>>, map-closure{...}>>, etc., when our closure could be just <B, F, Item = i32>. This is a compile-time concern for excessive monomorphizations, and can also make it exceed type-length limits. There has been some compiler work to trim unused parameters, but it still doesn't know how to reduce that Self to Self::Item. So the trick in #62429 is to use a "closure factory" function with tighter control over the generic parameters.

The fold specialization is more about runtime performance, which is why it changes your benchmarks. For example, Chain::next() has to check its state every time it is called, whether to pull an item from its first or second iterator, whereas Chain::fold() can just fold all of the first iterator and then all of the second. The default for_each is just a fold with an empty () accumulator, which is why it also benefits here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, thank you for that explanation!

The fold specialization is more about runtime performance, which is why it changes your benchmarks. For example, Chain::next() has to check its state every time it is called, whether to pull an item from its first or second iterator, whereas Chain::fold() can just fold all of the first iterator and then all of the second.

Awesome, this makes a ton of sense

The default for_each is just a fold with an empty () accumulator, which is why it also benefits here.

Woah, did not see that coming! So I originally read that thinking, "Great, I'll rewrite fold_mut using for_each as mentioned above and I'll get the benefits of fold improvements plus maybe Rust can tell in for_each that moving and reassigning () is a no-op and we'll keep the performance improvements on 'simple' iterators".

I'm running out of benchmark steam (is there a better way to do this than running ./x.py bench -i library/core/ --test-args mark_bench_fold? - it takes about a half hour on my machine) but gave this a shot - I defined three methods

  1. fold_mut which uses while let
  2. fold_mut_fold which uses fold under the hood
  3. fold_mut_each which uses for_each under the hood

And ran 8 benchmarks. 4 were operating on (0..100000).map(black_box) (named _simple) and 4 were operating on (0i64..1000000).chain(0..1000000).map(black_box) (named _chain). Each test was (hopefully) calculating the sum of all the even numbers:

test iter::mark_bench_fold_chain                                ... bench:   2,287,252 ns/iter (+/- 160,890)
test iter::mark_bench_fold_chain_mut                            ... bench:   1,969,051 ns/iter (+/- 75,847)
test iter::mark_bench_fold_chain_mut_each                       ... bench:   2,516,875 ns/iter (+/- 123,425)
test iter::mark_bench_fold_chain_mut_fold                       ... bench:   2,363,194 ns/iter (+/- 123,658)
test iter::mark_bench_fold_simple                               ... bench:      57,271 ns/iter (+/- 3,691)
test iter::mark_bench_fold_simple_mut                           ... bench:      58,071 ns/iter (+/- 3,410)
test iter::mark_bench_fold_simple_mut_each                      ... bench:     540,887 ns/iter (+/- 5,221)
test iter::mark_bench_fold_simple_mut_fold                      ... bench:      57,627 ns/iter (+/- 4,896)

so I think based on all these shifty benchmarks moving around so much ... maybe these're all within statistical uncertainy (combined with whatever my computer is doing at any given time). This was super fun to play around with but I think I'm going to bow out of the "maybe fold_mut will be faster!" argument :-D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe Rust can tell in for_each that moving and reassigning () is a no-op

It's a zero sized type (ZST), so there's literally nothing to do in codegen.

I'm really surprised that it did poorly in your benchmark, but my guess is that there was some unlucky inlining (or lack thereof), especially if parts of that benchmark got split across codegen-units.

/// assert_eq!(counts[&'a'], 5);
/// ```
#[inline]
#[unstable(feature = "iterator_fold_mut", issue = "76725")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue needs to be a real tracking issue, not the feature request.
See other instances of issues labeled C-tracking-issue and T-libs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I created #76751 but I don't obviously see how to add T-libs to it 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It got tagged but in the future you can use @rustbot modify labels: +(label), -(label)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottmcm
Copy link
Member

scottmcm commented Sep 15, 2020

It's not obvious to me that this is materially different from a for_each with a FnMut closure.

To look at the example in the issue,

iter.fold(Vec::new(), |v, x| {
  v.push(x);
  v // this line is a little weird
})

How does the proposed fold_mut compare to just doing something like this instead?

let mut v = Vec::new();
iter.for_each(|x| v.push(x));
let v = v;

EDIT after thinking some more:

To me, the point of fold is that the accumulator is owned. Emulating a fold with a normal .for_each is annoying because you're stuck doing an Option::take dance (or similar) to get ownership of the original item. But when one only needs by-reference access to the accumulator (as in fold_mut), then I'd say it's better to just write it by capturing it in the closure, rather than manually threading it down through all the calls and always returning it.

Now I wonder if this could be made a clippy lint, especially given the optimization gap that you've identified here...

@rustbot

This comment has been minimized.

@workingjubilee
Copy link
Member

It is different because it enables my predilection of doing peculiar things with method chains. 😉

Perhaps that is not enough to justify the addition of a new method, but I would argue the included benchmarks should land, and the relevant code moved next to the benchmarks if necessary, as something to test against (and a future FIXME I guess?), because this overhead should in fact be optimized away... I write code in similar patterns fairly often, and I know I have seen many others do so as well, even without my prompting.

@mlodato517
Copy link
Contributor Author

It's not obvious to me that this is materially different from a for_each with a FnMut closure.

To look at the example in the issue,

iter.fold(Vec::new(), |v, x| {
  v.push(x);
  v // this line is a little weird
})

How does the proposed fold_mut compare to just doing something like this instead?

let mut v = Vec::new();
iter.for_each(|x| v.push(x));
let v = v;

I assume the performance is the same (because I assume for_each is the same as for which I've already benched). I think the main benefit is hiding the mutability inside the closure along with fold being a common idiom to reach for. I imagine most folds could be rewritten with for_eachs but I could be missing some ownership nuance there.

Generally, I'd prefer a faster fold over a fold_mut and I didn't know if someone would look at fold and fold_mut and go, "Aha! We must not be doing X when compiling fold! We can just turn this knob here and look at that, fold is twice as fast across the board!".

@mlodato517
Copy link
Contributor Author

So would the plan forward be to add benchmarks and add the fold_mut implementation directly in the benchmark to show that fold needs to be changed in some way and we leave the issue open for optimizations on fold?

@cuviper
Copy link
Member

cuviper commented Sep 15, 2020

I imagine most folds could be rewritten with for_eachs but I could be missing some ownership nuance there.

If the operation needs to consume the accumulator to produce the next one, you need fold. You can't consume captured values in a for_each closure because that makes it FnOnce, or else you have to play ugly games with Option::take.

@mlodato517
Copy link
Contributor Author

Yeah I'd be interested in seeing how people generally use fold. I haven't rusted much but I've only ever written inline closures. I imagine there are cases where the user wants to leverage a pre-existing function that doesn't take references and already returns the folded value and I'd be interested in seeing how prevalent each style is!

@scottmcm
Copy link
Member

Well, it wasn't hard to find a fold-a-string example in the compiler. I don't know what it says, though, since it also turned out to be a place where rewriting to the mutable closure was simple and (negligibly) shorter:

         fn tokens_to_string(tokens: &[TokenType]) -> String {
             let mut i = tokens.iter();
             // This might be a sign we need a connect method on `Iterator`.
-            let b = i.next().map_or(String::new(), |t| t.to_string());
-            i.enumerate().fold(b, |mut b, (i, a)| {
+            let mut b = i.next().map_or(String::new(), |t| t.to_string());
+            i.enumerate().for_each(|(i, a)| {
                 if tokens.len() > 2 && i == tokens.len() - 2 {
                     b.push_str(", or ");
                 } else if tokens.len() == 2 && i == tokens.len() - 2 {
                     b.push_str(" or ");
                 } else {
                     b.push_str(", ");
                 }
                 b.push_str(&a.to_string());
-                b
-            })
+            });
+            b
         }

         let mut expected = edible

Passing methods directly seems pretty rare; it seems more common to end up putting them in a closure anyway, like the following few examples I found quickly:

self.attrs.iter().fold(self.span, |acc, attr| acc.to(attr.span))
.fold(Fingerprint::ZERO, |a, b| a.combine_commutative(b));
exprs.iter().rev().fold(succ, |succ, expr| self.propagate_through_expr(&expr, succ))

@mlodato517
Copy link
Contributor Author

Ah, thanks for doing the digging! Definitely looks cleaner with the method in there. So maybe fold_mut isn't particularly useful? Do we care to add it for the performance difference or just punt and leave this as a feature request to improve fold? Or in the docs we could note that if the closure doesn't require ownership of the accumulator, it could be faster to use for_each? Or is that kind of a gross "suggest the wrong thing for the (questionably) right reason"?

I guess if anyone has a performance issue with fold they might discover this issue/PR and know what to do up until fold itself is improved?

@mlodato517
Copy link
Contributor Author

After lots of benchmarking and discussing (and maybe learning 🤞), I'm feeling that fold_mut probably isn't faster than fold in any statistically relevant way. I'm still happy to push this through to offer an optional method for when a consuming closure is "awkward" but I'm having a hard time deciding if that's valuable enough to the standard library/community to add.

Looking for input - do we want this alternate method in the standard library?

@cuviper
Copy link
Member

cuviper commented Sep 17, 2020

do we want this alternate method in the standard library?

I'm leaning toward "no", given that the current options are pretty short and reasonable.
Maybe it would fit the itertools crate?

@mlodato517 mlodato517 closed this Sep 18, 2020
@mlodato517 mlodato517 deleted the ml-fold-mut branch September 18, 2020 18:22
@mlodato517
Copy link
Contributor Author

mlodato517 commented Sep 18, 2020

I was trying to figure out why the Vec benchmarks were consistently "in favor of" fold_mut while the number and HashMap benchmarks didn't care so much.

I began to wonder if it has to do with the size of the accumulator. I don't know how HashMap is represented but Vec is roughly three usizes which, on a 64 bit machine, is three times the size of an i64 so maybe the moving that happens in each fold iteration has to move more data and so it slows down.

Did a super quick 'n dirty test here and it seems to hold up. It looks like fold gets slower as the accumulator gets larger and the movement in the closure gets more cumbersome (or some other reason).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Iterator::fold is a little slow compared to bare loop
9 participants