Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileStream rewrite: Caching the ValueTaskSource in AsyncWindowsFileStreamStrategy #51363

Merged
merged 15 commits into from
Apr 16, 2021

Conversation

carlossanlop
Copy link
Member

@carlossanlop carlossanlop commented Apr 16, 2021

Fixes #50972
Fixes #25074

When AsyncWindowsFileStreamStrategy is wrapped by a BufferedFileStreamStrategy, we need to make sure the ValueTaskSource instance is cached to reduce the number of allocations when calling ReadAsync or WriteAsync multiple times in a row.

This PR is a continuation of #50802, in which we switched from using TaskCompletionSource to IValueTaskSource.

Changes:

  • Moved the PreAllocatedOverlapped instance inside ValueTaskSource, so the latter becomes its owner. This was done because we are only supposed to have an instance of a PreAllocatedOverlapped if the ValueTaskSource was created from OnBufferedAllocated, which is a method called only by BufferedFileStreamStrategy right before writing or reading.
  • Removed MemoryValueTaskSource and moved the cases handled by it to ValueTaskSource.
  • Created a method that refreshes the value of the NativeOverlapped*. This is done every time we call ReadAsync/WriteAsync, to make sure we are pinning the memory passed by the user.

@ghost
Copy link

ghost commented Apr 16, 2021

Tagging subscribers to this area: @carlossanlop
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #50972

DAFT: Pending benchmarks.

When AsyncWindowsFileStreamStrategy is wrapped by a BufferedFileStreamStrategy, we need to make sure the ValueTaskSource instance is cached to reduce the number of allocations when calling ReadAsync or WriteAsync multiple times in a row.

This PR is a continuation of #50802, in which we switched from using TaskCompletionSource to IValueTaskSource.

Changes:

  • Moved the PreAllocatedOverlapped instance inside ValueTaskSource, so the latter becomes its owner. This was done because we are only supposed to have an instance of a PreAllocatedOverlapped if the ValueTaskSource was created from OnBufferedAllocated, which is a method called only by BufferedFileStreamStrategy right before writing or reading.
  • Removed MemoryValueTaskSource and moved the cases handled by it to ValueTaskSource.
  • Created a method that refreshes the value of the NativeOverlapped*. This is done every time we call ReadAsync/WriteAsync, to make sure we are pinning the memory passed by the user.
Author: carlossanlop
Assignees: carlossanlop
Labels:

area-System.IO

Milestone: 6.0.0

@adamsitnik
Copy link
Member

Initial benchmark results (base is #50802):

Method Toolchain fileSize userBufferSize options Mean Ratio Gen 0 Allocated
ReadAsync \cache\corerun.exe 1024 1024 Asynchronous 83.99 us 1.02 0.3360 5 KB
ReadAsync \base\corerun.exe 1024 1024 Asynchronous 81.98 us 1.00 0.6545 5 KB
WriteAsync \cache\corerun.exe 1024 1024 Asynchronous 487.70 us 1.02 - 5 KB
WriteAsync \true\corerun.exe 1024 1024 Asynchronous 476.89 us 1.00 - 5 KB
ReadAsync \cache\corerun.exe 1048576 512 Asynchronous 2,649.16 us 1.09 10.4167 127 KB
ReadAsync \true\corerun.exe 1048576 512 Asynchronous 2,421.94 us 1.00 8.9286 83 KB
WriteAsync \cache\corerun.exe 1048576 512 Asynchronous 4,527.33 us 1.11 15.6250 119 KB
WriteAsync \true\corerun.exe 1048576 512 Asynchronous 4,100.58 us 1.00 - 75 KB
ReadAsync \cache\corerun.exe 1048576 4096 Asynchronous 2,321.45 us 1.01 8.9286 71 KB
ReadAsync \true\corerun.exe 1048576 4096 Asynchronous 2,308.26 us 1.00 8.9286 69 KB
WriteAsync \cache\corerun.exe 1048576 4096 Asynchronous 4,191.59 us 1.03 - 97 KB
WriteAsync \true\corerun.exe 1048576 4096 Asynchronous 4,077.05 us 1.00 - 74 KB
ReadAsync_NoBuffering \cache\corerun.exe 1048576 16384 Asynchronous 730.67 us 1.00 - 18 KB
ReadAsync_NoBuffering \true\corerun.exe 1048576 16384 Asynchronous 732.58 us 1.00 - 17 KB
WriteAsync_NoBuffering \cache\corerun.exe 1048576 16384 Asynchronous 2,747.11 us 0.99 - 18 KB
WriteAsync_NoBuffering \true\corerun.exe 1048576 16384 Asynchronous 2,787.66 us 1.00 - 17 KB
ReadAsync \cache\corerun.exe 104857600 4096 Asynchronous 250,998.78 us 1.01 - 7,001 KB
ReadAsync \true\corerun.exe 104857600 4096 Asynchronous 249,280.08 us 1.00 - 6,801 KB
WriteAsync \cache\corerun.exe 104857600 4096 Asynchronous 382,734.73 us 1.04 1000.0000 9,205 KB
WriteAsync \true\corerun.exe 104857600 4096 Asynchronous 368,403.91 us 1.00 - 6,905 KB
ReadAsync_NoBuffering \cache\corerun.exe 104857600 16384 Asynchronous 80,072.69 us 0.97 - 1,750 KB
ReadAsync_NoBuffering \true\corerun.exe 104857600 16384 Asynchronous 82,376.92 us 1.00 - 1,700 KB
WriteAsync_NoBuffering \cache\corerun.exe 104857600 16384 Asynchronous 120,385.98 us 0.98 - 1,751 KB
WriteAsync_NoBuffering \true\corerun.exe 104857600 16384 Asynchronous 123,215.33 us 1.00 - 1,701 KB

it looks like we are allocating less when buffering is disabled, but more than before when it's enabled

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, but we need to track and solve the allocation regression which is visible in the scenarios where buffering is enabled

…to the moment after _source.SetException|SetResult is called
@adamsitnik
Copy link
Member

@stephentoub your suggestions were great!

Method Job fileSize userBufferSize options Mean Ratio Gen 0 Allocated
ReadAsync after 1024 1024 Asynchronous 87.48 us 1.04 0.3613 5,240 B
ReadAsync before 1024 1024 Asynchronous 84.38 us 1.00 0.3472 5,216 B
WriteAsync after 1024 1024 Asynchronous 494.43 us 0.99 - 4,960 B
WriteAsync before 1024 1024 Asynchronous 501.52 us 1.00 - 4,936 B
CopyToFileAsync after 1024 ? None 513.31 us 1.01 - 5,593 B
CopyToFileAsync before 1024 ? None 510.34 us 1.00 - 5,593 B
CopyToFileAsync after 1024 ? Asynchronous 539.68 us 1.00 - 6,336 B
CopyToFileAsync before 1024 ? Asynchronous 541.15 us 1.00 - 6,320 B
ReadAsync after 1048576 512 Asynchronous 2,436.66 us 0.98 - 58,279 B
ReadAsync before 1048576 512 Asynchronous 2,480.83 us 1.00 8.9286 84,775 B
WriteAsync after 1048576 512 Asynchronous 4,142.67 us 0.92 - 50,029 B
WriteAsync before 1048576 512 Asynchronous 4,511.11 us 1.00 - 76,527 B
ReadAsync after 1048576 4096 Asynchronous 2,153.31 us 0.90 - 916 B
ReadAsync before 1048576 4096 Asynchronous 2,397.80 us 1.00 8.9286 70,241 B
WriteAsync after 1048576 4096 Asynchronous 3,976.06 us 0.97 - 27,596 B
WriteAsync before 1048576 4096 Asynchronous 4,100.25 us 1.00 - 75,562 B
ReadAsync_NoBuffering after 1048576 16384 Asynchronous 689.04 us 0.94 - 760 B
ReadAsync_NoBuffering before 1048576 16384 Asynchronous 733.21 us 1.00 - 17,864 B
WriteAsync_NoBuffering after 1048576 16384 Asynchronous 2,781.44 us 0.99 - 762 B
WriteAsync_NoBuffering before 1048576 16384 Asynchronous 2,808.65 us 1.00 - 17,866 B
CopyToFileAsync after 1048576 ? None 2,360.16 us 0.98 - 3,244 B
CopyToFileAsync before 1048576 ? None 2,405.43 us 1.00 - 3,244 B
CopyToFileAsync after 1048576 ? Asynchronous 2,593.22 us 1.01 - 2,044 B
CopyToFileAsync before 1048576 ? Asynchronous 2,568.19 us 1.00 - 3,924 B
ReadAsync after 104857600 4096 Asynchronous 247,853.42 us 0.95 - 1,368 B
ReadAsync before 104857600 4096 Asynchronous 259,775.67 us 1.00 - 6,963,952 B
WriteAsync after 104857600 4096 Asynchronous 358,502.58 us 0.97 - 2,258,912 B
WriteAsync before 104857600 4096 Asynchronous 371,005.92 us 1.00 - 7,070,648 B
ReadAsync_NoBuffering after 104857600 16384 Asynchronous 76,073.40 us 0.94 - 796 B
ReadAsync_NoBuffering before 104857600 16384 Asynchronous 81,316.51 us 1.00 - 1,741,292 B
WriteAsync_NoBuffering after 104857600 16384 Asynchronous 114,458.43 us 0.88 - 832 B
WriteAsync_NoBuffering before 104857600 16384 Asynchronous 130,608.83 us 1.00 - 1,741,328 B
CopyToFileAsync after 104857600 ? None 74,758.01 us 1.01 - 180,828 B
CopyToFileAsync before 104857600 ? None 73,939.22 us 1.00 - 180,828 B
CopyToFileAsync after 104857600 ? Asynchronous 86,110.81 us 0.98 - 2,220 B
CopyToFileAsync before 104857600 ? Asynchronous 87,546.69 us 1.00 - 219,524 B

@adamsitnik adamsitnik marked this pull request as ready for review April 16, 2021 16:30
@adamsitnik
Copy link
Member

@stephentoub I believe I have addressed all your feedback, PTAL one more time. I hope that we can merge it today and include it in Preview 4

@adamsitnik
Copy link
Member

@stephentoub we have addressed the feedback, please take a look. I am going to post the benchmark results in 20-30 minutes

@stephentoub
Copy link
Member

This now also fixes #25074

@adamsitnik
Copy link
Member

The results (see the Allocated column)

Method Job fileSize userBufferSize options Mean Ratio Allocated
ReadAsync after 1024 1024 Asynchronous 84.39 us 0.98 5,240 B
ReadAsync before 1024 1024 Asynchronous 85.86 us 1.00 5,216 B
WriteAsync after 1024 1024 Asynchronous 483.68 us 1.01 4,960 B
WriteAsync before 1024 1024 Asynchronous 478.92 us 1.00 4,936 B
CopyToFileAsync after 1024 ? None 492.53 us 1.01 5,593 B
CopyToFileAsync before 1024 ? None 489.18 us 1.00 5,593 B
CopyToFileAsync after 1024 ? Asynchronous 529.35 us 1.02 6,336 B
CopyToFileAsync before 1024 ? Asynchronous 521.85 us 1.00 6,311 B
ReadAsync after 1048576 512 Asynchronous 2,371.68 us 1.00 58,279 B
ReadAsync before 1048576 512 Asynchronous 2,372.78 us 1.00 84,775 B
WriteAsync after 1048576 512 Asynchronous 4,081.33 us 0.97 50,028 B
WriteAsync before 1048576 512 Asynchronous 4,214.84 us 1.00 76,517 B
ReadAsync after 1048576 4096 Asynchronous 2,138.36 us 0.92 913 B
ReadAsync before 1048576 4096 Asynchronous 2,332.66 us 1.00 70,241 B
WriteAsync after 1048576 4096 Asynchronous 3,951.54 us 0.95 27,562 B
WriteAsync before 1048576 4096 Asynchronous 4,153.60 us 1.00 75,562 B
ReadAsync_NoBuffering after 1048576 16384 Asynchronous 674.58 us 0.91 760 B
ReadAsync_NoBuffering before 1048576 16384 Asynchronous 740.07 us 1.00 17,864 B
WriteAsync_NoBuffering after 1048576 16384 Asynchronous 2,711.34 us 0.99 762 B
WriteAsync_NoBuffering before 1048576 16384 Asynchronous 2,736.59 us 1.00 17,866 B
CopyToFileAsync after 1048576 ? None 1,961.68 us 0.95 3,243 B
CopyToFileAsync before 1048576 ? None 2,080.32 us 1.00 3,244 B
CopyToFileAsync after 1048576 ? Asynchronous 2,283.20 us 1.08 2,044 B
CopyToFileAsync before 1048576 ? Asynchronous 2,161.78 us 1.00 3,924 B
ReadAsync after 104857600 4096 Asynchronous 228,777.04 us 0.93 1,056 B
ReadAsync before 104857600 4096 Asynchronous 245,615.52 us 1.00 6,963,952 B
WriteAsync after 104857600 4096 Asynchronous 353,822.32 us 0.96 2,257,976 B
WriteAsync before 104857600 4096 Asynchronous 370,913.51 us 1.00 7,070,648 B
ReadAsync_NoBuffering after 104857600 16384 Asynchronous 74,983.01 us 0.95 796 B
ReadAsync_NoBuffering before 104857600 16384 Asynchronous 79,064.04 us 1.00 1,741,292 B
WriteAsync_NoBuffering after 104857600 16384 Asynchronous 116,660.24 us 0.95 832 B
WriteAsync_NoBuffering before 104857600 16384 Asynchronous 123,323.67 us 1.00 1,741,328 B
CopyToFileAsync after 104857600 ? None 73,465.73 us 0.99 180,828 B
CopyToFileAsync before 104857600 ? None 74,040.22 us 1.00 180,828 B
CopyToFileAsync after 104857600 ? Asynchronous 85,207.68 us 0.98 2,220 B
CopyToFileAsync before 104857600 ? Asynchronous 87,037.81 us 1.00 219,524 B

@jeffhandley
Copy link
Member

Wowza some of those allocation improvements are incredible!

@Anipik Anipik merged commit 7878130 into dotnet:main Apr 16, 2021
@carlossanlop carlossanlop deleted the Caching branch April 16, 2021 23:07
@danmoseley
Copy link
Member

Nice.

@@ -1067,7 +1069,8 @@ private void EnsureBufferAllocated()

void AllocateBuffer() // logic kept in a separate method to get EnsureBufferAllocated() inlined
{
_strategy.OnBufferAllocated(_buffer = new byte[_bufferSize]);
_buffer = GC.AllocateUninitializedArray<byte>(_bufferSize,
pinned: true); // this allows us to avoid pinning when the buffer is used for the syscalls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That correct, but pinned: true also allocates the array in Gen2 as side-effect so this may actually hurt real-world scenarios at the end..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can allocate it and use a GCHandle if that ends up being better. Previously it was pinned as part of a PreallocatedOverlapped.

(It's still not at all obvious when this newfangled POH should be used. )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also investigate not pinning at all, in which case the code this interacts with will just create a gchandle for each operation.

And/or look at using a pool array, but we'd want to ensure enough synchronization was in place to minimize erroneous usage causing us to return an array still in use. We do that in a few other streams.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a simple test:

using System;
using System.IO;

for (int i = 0; i < 100_000; i++)
{
    using (var f = new FileStream("test", FileMode.Create))
    {
        f.WriteByte(42);
    }
}
Console.WriteLine($"Allocated: {GC.GetTotalAllocatedBytes()} Gen2 GCs: {GC.CollectionCount(2)}");
  • .NET 5: Allocated: 442474096 Gen2 GCs: 0
  • This PR: Allocated: 448051624 Gen2 GCs: 103

It will be interesting to see whether these excessive Gen2 GCs hit performance gates of services trying .NET 6 previews.

It's still not at all obvious when this newfangled POH should be used.

Agree. It is very hard to use.

we'd want to ensure enough synchronization was in place to minimize erroneous usage causing us to return an array still in use

If you can cover all these cases, it may be better to use unmanaged buffer. It is pinned too, and it does not cause excessive Gen2 GCs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So allocating on the POH contributes to the gen2 budget. This is why we disable the buffer using size 1, that still works right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in this file is used at all if buffer size is 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking forward to taking another stab at optimizing static files in .NET 6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can cover all these cases, it may be better to use unmanaged buffer. It is pinned too, and it does not cause excessive Gen2 GCs.

I'm going to start with a GCHandle and a normally allocated array. I believe in that case I can mostly restrict synchronization to the async code paths (plus disposal). If we use a native buffer, we'll need to protect the sync code paths as well. We can start with this and then see if it makes sense to use a pooled or native buffer as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually, I'm going to start by not pinning here at all (it'll then pin/unpin in the rest of the implementation per operation). If there's no measurable impact, we can stick with that for now.

@adamsitnik adamsitnik mentioned this pull request Apr 17, 2021
5 tasks
@ghost ghost locked as resolved and limited conversation to collaborators May 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
9 participants