Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Improve the performance of enumerating over interface types. #62266

Closed
geeknoid opened this issue Dec 2, 2021 · 18 comments
Closed
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime
Milestone

Comments

@geeknoid
Copy link
Member

geeknoid commented Dec 2, 2021

Background and motivation

Interface-based programming is encouraged as a best practice. Unfortunately, enumerating collections that are exposed through interface types triggers memory allocations and involves 2 interface calls per item iterated. In other words, enumerating an interface is slow:

List<int> l0 = new List<int>();
IList<int> l1 = new List<int>();

foreach (var x in l0)
{
    // speedy due to List<T>.Enumerator
}

foreach (var x in l1)
{
    // slow due to allocating an enumerator object, and making 2 interface calls
    // (IEnumerator<T>.MoveNext and IEnumerator<T>.Current) for each item
}

API Proposal

Please see https://github.com/geeknoid/AcceleratedEnumeration for a simple model that can deliver considerable perf benefits by avoiding allocations in many cases and cutting down the number of interface calls. The README file in that repo has perf numbers. The code I show provides benefits for most enumerations of IEnumerable, IList, IReadOnlyList, IDictionary, IReadOnlyDictionary, ISet, and IReadOnlySet.

The basic idea is to do something like:

foreach (var x in l.AcceleratedEnum())
{
}

The above works like normal enumeration, just faster. Although the model I show here requires explicit coding by the programmer to leverage, it would be conceivable to build similar tech directly into C# so that enumeration of interface types just gets faster by magic.

Note that my implementation also handles null collections, which is a nice simplification for the developer. You don't need to do an if check before you enumerate.

API Usage

foreach (var x in l.AcceleratedEnum())
{
}

Alternative Designs

As I mentioned above, you can imagine building this ugly stuff straight into the C# compiler. In fact, putting this into the compiler can likely deliver perf benefits by using byref semantics for the enumerator structs, which would avoid copying overhead. This is not possible right now since the C# compiler doesn't know how to deal with enumerators by reference.

Risks

The design intentionally avoids extra error checks, so edge cases will throw different exceptions than in the non-accelerated case. In particular, the implementation of IEnumerator.Current when there isn't anything in the collection throws different exceptions. That could be remedied at the cost of a few cycles.

@geeknoid geeknoid added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Dec 2, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Dec 2, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@davidfowl
Copy link
Member

@EgorBo @AndyAyersMS Would guarded devirt + DynamicPGO help here? Is that our best avenue?

@huoyaoyuan
Copy link
Member

Note that in your AcceleratedEnumerationExtensions.List changed the semantic of enumeration. It saves a lot to not checking for modification during enumeration.

@EgorBo
Copy link
Member

EgorBo commented Dec 2, 2021

@davidfowl @geeknoid with DynamicPGO we devirtualize all calls here but are not able to avoid allocation (boxing) of an Enumerator (happens just once before the loop):
image

I suspect in order to also optimize that small allocation we need that @AndyAyersMS's idea to use escape analysis.

@geeknoid
Copy link
Member Author

geeknoid commented Dec 2, 2021

Note that getting rid of the allocation, especially when enumerating empty collections, was the motivation for trying out this idea .

@geeknoid
Copy link
Member Author

geeknoid commented Dec 2, 2021

Note that in your AcceleratedEnumerationExtensions.List changed the semantic of enumeration. It saves a lot to not checking for modification during enumeration.

@huoyaoyuan Good point, I forgot to bring that up. I've always felt it was an architecture misstep to put in the versioning logic in such low-level abstractions as List and Dictionary, so there's no love lost for me there. But it does change the semantics relative to plain ol' enumeration.

@EgorBo
Copy link
Member

EgorBo commented Dec 2, 2021

Note that getting rid of the allocation, especially when enumerating empty collections, was the motivation for trying out this idea .

well, it's a small allocation that doesn't happen every iteration of that foreach so I personally don't think it is worth any special API while it could be eventually fixed in jit instead.

@geeknoid
Copy link
Member Author

geeknoid commented Dec 2, 2021

I'm looking at this in the broader context of optimizing our service fleet. Small allocations done dozens or hundreds of times per request add up. We currently have ~300 uses of foreach throughout our code base, and we're just a set of libraries that services build on. So they'll add their own enumeration overhead on top.

I highly doubt this can be systematically addressed by the compiler. There are many cases where an interface is in fact polymorphic in nature and so it can devirtualize. Or the code can potentially be used in LINQ queries and so it has to remain general and allocate an IEnumerable, etc.

Enumeration is a very common pattern in code, dedicated optimizations to reduce the implicit overhead that enumeration induces is meaningful to the bottom line.

@ghost
Copy link

ghost commented Dec 2, 2021

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Interface-based programming is encouraged as a best practice. Unfortunately, enumerating collections that are exposed through interface types triggers memory allocations and involves 2 interface calls per item iterated. In other words, enumerating an interface is slow:

List<int> l0 = new List<int>();
IList<int> l1 = new List<int>();

foreach (var x in l0)
{
    // speedy due to List<T>.Enumerator
}

foreach (var x in l1)
{
    // slow due to allocating an enumerator object, and making 2 interface calls
    // (IEnumerator<T>.MoveNext and IEnumerator<T>.Current) for each item
}

API Proposal

Please see https://github.com/geeknoid/AcceleratedEnumeration for a simple model that can deliver considerable perf benefits by avoiding allocations in many cases and cutting down the number of interface calls. The README file in that repo has perf numbers. The code I show provides benefits for most enumerations of IEnumerable, IList, IReadOnlyList, IDictionary, IReadOnlyDictionary, ISet, and IReadOnlySet.

The basic idea is to do something like:

foreach (var x in l.AcceleratedEnum())
{
}

The above works like normal enumeration, just faster. Although the model I show here requires explicit coding by the programmer to leverage, it would be conceivable to build similar tech directly into C# so that enumeration of interface types just gets faster by magic.

Note that my implementation also handles null collections, which is a nice simplification for the developer. You don't need to do an if check before you enumerate.

API Usage

foreach (var x in l.AcceleratedEnum())
{
}

Alternative Designs

As I mentioned above, you can imagine building this ugly stuff straight into the C# compiler. In fact, putting this into the compiler can likely deliver perf benefits by using byref semantics for the enumerator structs, which would avoid copying overhead. This is not possible right now since the C# compiler doesn't know how to deal with enumerators by reference.

Risks

The design intentionally avoids extra error checks, so edge cases will throw different exceptions than in the non-accelerated case. In particular, the implementation of IEnumerator.Current when there isn't anything in the collection throws different exceptions. That could be remedied at the cost of a few cycles.

Author: geeknoid
Assignees: -
Labels:

api-suggestion, area-System.Runtime, untriaged

Milestone: -

@AndyAyersMS
Copy link
Member

As @EgorBo says above, using PGO to de-abstract things like this is a long-term goal. Not clear how much progress we'll make on this in the next release, as we are still working through .net 7 plans.

@geeknoid if you have realistic polymorphic examples in mind, it would be nice to add those to your perf suite so we can see how well some new ideas we have in mind for PGO might cope.

@geeknoid
Copy link
Member Author

geeknoid commented Dec 3, 2021

@AndyAyersMS I've got a few examples already in my benchmark. I have IEnumerable implemented on top of different concrete types, or IList implemented on top of different concrete types. If you then write code such as:

void DoSomething(IEnumerable<T> x)
{
}

DoSomething can be called with different implementations of IEnumerable. Now you're already generating different code for different instances of T, so conceivably you could generate different code for different concrete types implementing IEnumerable too. But I suspect a lot of this analysis will peter out at some point and leave you with persistent allocations and some interface based dispatch.

@davidfowl
Copy link
Member

FWIW We have LOTS of examples of this in ASP.NET and in dotnet/runtime. I made a set of changes in .NET 6 to remove IEnumerable.GetEnumerator allocations when the backing collection as a T[] (which is always is when using dependency injection). ASP.NET Core uses abstract IList, IReadOnlyList and IEnumerable ALOT. Recently we've been taking advantage of CollectionMarshal.AsSpan to remove enumerator allocations and speed up enumeration over Lists

well, it's a small allocation that doesn't happen every iteration of that foreach so I personally don't think it is worth any special API while it could be eventually fixed in jit instead.

Its worth optimizing it if we can.

@davidfowl @geeknoid with DynamicPGO we devirtualize all calls here but are not able to avoid allocation (boxing) of an Enumerator (happens just once before the loop):

@EgorBo can you explain why that is? I don't quite understand the connection.

@EgorBo
Copy link
Member

EgorBo commented Dec 3, 2021

@EgorBo can you explain why that is? I don't quite understand the connection.

It can be simplified to this (even without PGO):

static void Caller() => Callee(new FooStruct());

static void Callee(IFoo foo) => foo.SayHello();


public interface IFoo
{
    void SayHello();
}

public struct FooStruct : IFoo
{
    public void SayHello() => Console.WriteLine("Hello");
}

where codegen for Caller is:

; Method Program:Caller()
G_M23712_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25

G_M23712_IG02:              ;; offset=0004H
       48B918035904FE7F0000 mov      rcx, 0x7FFE04590318      ; FooStruct
       E82DBD9D5F           call     CORINFO_HELP_NEWSFAST
       4883C008             add      rax, 8
       C60000               mov      byte  ptr [rax], 0
       3900                 cmp      dword ptr [rax], eax
       48B980550098FE010000 mov      rcx, 0x1FE98005580      ; "Hello"
       488B09               mov      rcx, gword ptr [rcx]
       E8D2FCFFFF           call     System.Console:WriteLine(System.String)
       90                   nop      
						;; bbWeight=1    PerfScore 9.00

G_M23712_IG03:              ;; offset=002FH
       4883C428             add      rsp, 40
       C3                   ret      
						;; bbWeight=1    PerfScore 1.25
; Total bytes of code: 52

so we were able to inline Callee but the boxing is still here (and it's useless). It's something we (Andy) eventually will fix I hope e.g. via escape-analysis (multi-use boxing problem).

Its worth optimizing it if we can.

My point was if we can optimize it in jit we should do it in jit, because this API looks like a leaking abstraction to me / temp workaround.

DoSomething can be called with different implementations of IEnumerable.

As Andy noted, it's something we can handle in JIT too - via context-sensitive PGO or/and mutliple guards (#59604)

@geeknoid
Copy link
Member Author

geeknoid commented Dec 3, 2021

@EgorBo You mention "leaky abstraction", but I was thinking more in terms of "compiler support types". You can imagine the C# compiler transforming the foreach loop transparently, and users don't know what happened. I know that some heroics are possible in the JIT, but it cannot possibly deal with all cases, so it's useful to consider the full spectrum of possibilities.

@davidfowl The MarshalAsSpan thing is nice. But wouldn't it be even nicer if the compiler just did this transformation for you?

Note that even if you completely ignore the implementation of my proposal, there is a clear big win possible with very little change in the C# compiler: If you have a foreach loop over an IList or IReadOnlyList, it is substantially more efficient to implement the foreach by indexing rather than using GetEnumerator. You go from 2 interface calls per item down to 1. You lose the "feature" that enumeration throws if a mutation occurs during iteration, so presumably to enable this feature would require a new compiler switch since the semantics are slightly different.

@EgorBo
Copy link
Member

EgorBo commented Dec 3, 2021

@geeknoid btw, your API noticeably regresses performance for cases when underlying enumeration is not an array/list/readonlylist - it will do several casts some of them aren't cheap (such as cast to covariant IReadOnlyCollection). I'd prefer PGO to do this work since it knows for sure what kinds of objects are hidden under abstractions

@AndyAyersMS
Copy link
Member

@EgorBo have you tried running all these tests with PGO? Curious to see what the difference is between the default enumeration and the accelerated enumeration is with PGO in play.

There is still a bunch of work ahead to fully de-abstract the simple case where PGO sees that one collection type predominates. I should sketch out what we eventually need to do. It is not necessarily heroics as each of the sub-pieces is useful on its own.

@geeknoid
Copy link
Member Author

geeknoid commented Dec 4, 2021

Thanks to some help from @EgorBo , I updated my benchmark. The JIT was doing some things that gave an unfair advantage to the non-accelerated code. I cancelled those out now and it ends up making the accelerated enumeration technique look even more compelling overall.

If the JIT can achieve this, and perhaps more, it'll be great. In the meantime, we're definitely going to leverage my hack on our code base and get us those beautiful free cycles back :-)

@tannergooding tannergooding added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jul 15, 2022
@tannergooding tannergooding added this to the Future milestone Jul 15, 2022
@tannergooding
Copy link
Member

Going to close this as its providing a new form of enumeration within the BCL is an extremely complicated task that would require revving the whole ecosystem to support.

Getting codegen improvements in the form of DPGO and guarded devirtualization is the most viable path forward here.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 26, 2022
@tannergooding tannergooding removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime
Projects
None yet
Development

No branches or pull requests

7 participants