-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slice can be simplified does not produce the same code #47629
Comments
@stephentoub This was intentionally opened in dotnet/runtime. The assumption here is that while the IL is different, the JIT should be able to optimize both versions equally. See dotnet/csharplang#3879 and #47626 for previous discussions of this. |
@svick The C# compiler and the types it's generating use of are resulting in significantly more complex code for the JIT to then need to unwind. Such issues are covered by existing issues like dotnet/runtime#11848 and dotnet/runtime#11870. If this issue is about the JIT, it should just be closed. That said, the C# compiler is doing a disservice here by using Slice(start, count) rather than just Slice(start). The latter has less code in it, less checks to be performed, and less code required to invoke it. I don't know that the JIT would ever be able to make them identical; in theory it could, in practice that's a whole lot of analysis. This was all raised when the indexing feature was introduced, and it was decided by the language team that such level of optimization wasn't important. If the analyzer from dotnet/roslyn is encouraging this transformation and folks expect the resulting code to be identical, then the dotnet/roslyn compiler should be updated to abide. There's no guarantee the JIT will produce the same asm, otherwise. |
Note that suggestions do not, and have never, assured that you will get the exact same performance (or necessarily even the exact same behavior). The suggestion here is merely that, making you aware that you could write the above in a different fashion if you would prefer to do so. This is why it's only a suggestion, and not a warning (or an error). In this case, we are suggesting this because the type abides by the Slice-pattern that the language/compiler recognizes, and for most people the code may be clearer using the new dedicated syntax for this. If you don't want to use the new language form due to the potential for increased performance, you can suppress the issue here, or disable that feature entirely :) -- Now, as to the compiler. I believe @agocke worked on this. Andy, what are your thoughts on this:
Seems like we could likely take a reasonably safe change to allow that given that the risk of breakage would be quite low. Given the benefits @stephentoub mentions, it may be worth it. |
I don't think it's necessarily a bad idea, just wasn't strictly necessary to do as part of the feature. Certainly the produced IL could be improved, regardless of which method is being called, That it wasn't was purely a question of implementation budget, not technical possibility.
|
Also, there are comments around about the JIT being unable to "see through" the Range type. That's true, but that was a comment on the old design where the syntax required a |
To be clear, we're talking about a very micro-optimization that will have little bearing on the majority of code developers write; for most code bases, Here's the current asm difference... ; Program.Slice(System.Span`1<Byte>)
sub rsp,28
mov rax,[r8]
mov ecx,[r8+8]
cmp ecx,32
jb short M01_L00
add ecx,0FFFFFFCE
add rax,32
mov [rdx],rax
mov [rdx+8],ecx
mov rax,rdx
add rsp,28
ret
M01_L00:
call System.ThrowHelper.ThrowArgumentOutOfRangeException()
int 3
; Total bytes of code 43 and for ; Program.Range(System.Span`1<Byte>)
sub rsp,28
mov rax,[r8]
mov ecx,[r8+8]
lea r8d,[rcx+0FFCE]
mov r9d,r8d
add r9,32
mov ecx,ecx
cmp r9,rcx
ja short M01_L00
add rax,32
mov [rdx],rax
mov [rdx+8],r8d
mov rax,rdx
add rsp,28
ret
M01_L00:
call System.ThrowHelper.ThrowArgumentOutOfRangeException()
int 3
; Total bytes of code 54 |
In the end for me it's OK, because I know now that it's different to use one or the other. If would be nice if we talk about numbers. I can try to create a benchmark with the situation I currently have. Today we benchmarked our vectorized math operations when we switched from using regular arrays to Span. |
OK here's the benchmark. I have provided it for both, net5.0 and netcoreapp3.1. At a first look it doesn't seem to be much (~0.5ms) but we are doing this at least 1.000.000 times which would be 500.000ms, we is not negligible. As I said I can live with I'm also happy that the array-based method seems to be worse than the
Benchmark code attached in a situation we are currently using for image processing. using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Numerics;
namespace ConsoleApp8
{
class Program
{
static void Main(string[] args)
{
BenchmarkRunner.Run<VectorizedBenchmark>();
}
}
public class VectorizedBenchmark
{
int N = 2048 * 2048;
float[] DataA;
float[] DataB;
float[] Result;
[GlobalSetup]
public void GlobalSetup()
{
DataA = new float[N];
DataB = new float[N];
Result = new float[N];
for (int i = 0; i < N; i++)
{
DataA[i] = i;
DataB[i] = i;
}
}
[Benchmark]
public float[] AddSpanWithSlice()
{
int StepSize = Vector<float>.Count;
int L = N - (N % StepSize);
int i = 0;
ReadOnlySpan<float> A = DataA;
ReadOnlySpan<float> B = DataA;
Span<float> C = Result;
for (; i < L; i += StepSize)
{
(new Vector<float>(A.Slice(i)) + new Vector<float>(B.Slice(i))).CopyTo(C.Slice(i));
}
for (; i < N; i++)
{
Result[i] = DataA[i] + DataB[i];
}
return Result;
}
[Benchmark]
public float[] AddSpanWithRange()
{
int StepSize = Vector<float>.Count;
int L = N - (N % StepSize);
int i = 0;
ReadOnlySpan<float> A = DataA;
ReadOnlySpan<float> B = DataA;
Span<float> C = Result;
for (; i < L; i += StepSize)
{
(new Vector<float>(A[i..]) + new Vector<float>(B[i..])).CopyTo(C[i..]);
}
for (; i < N; i++)
{
Result[i] = DataA[i] + DataB[i];
}
return Result;
}
[Benchmark]
public float[] AddArrayWithSlice()
{
int StepSize = Vector<float>.Count;
int L = N - (N % StepSize);
int i = 0;
for (; i < L; i += StepSize)
{
(new Vector<float>(DataA, i) + new Vector<float>(DataB, i)).CopyTo(Result, i);
}
for (; i < N; i++)
{
Result[i] = DataA[i] + DataB[i];
}
return Result;
}
}
} |
I just wanted to point another benchmark on another system. I just wanted to point that there was a long way from coming from the "old" without-
Just for reference, here's is the code: using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Numerics;
using System.Runtime.CompilerServices;
namespace ConsoleApp8
{
class Program
{
static void Main(string[] args)
{
BenchmarkRunner.Run<VectorizedBenchmark>();
}
}
public class VectorizedBenchmark
{
int N = 2048 * 2048;
float[] DataA;
float[] DataB;
float[] Result;
[GlobalSetup]
public void GlobalSetup()
{
DataA = new float[N];
DataB = new float[N];
Result = new float[N];
for (int i = 0; i < N; i++)
{
DataA[i] = i;
DataB[i] = i;
}
}
[Benchmark]
public float[] AddVectorizedSpanWithSlice()
{
int StepSize = Vector<float>.Count;
int L = N - (N % StepSize);
int i = 0;
ReadOnlySpan<float> A = DataA;
ReadOnlySpan<float> B = DataA;
Span<float> C = Result;
for (; i < L; i += StepSize)
{
(new Vector<float>(A.Slice(i)) + new Vector<float>(B.Slice(i))).CopyTo(C.Slice(i));
}
for (; i < N; i++)
{
Result[i] = DataA[i] + DataB[i];
}
return Result;
}
[Benchmark]
public float[] AddVectorizedSpanWithRange()
{
int StepSize = Vector<float>.Count;
int L = N - (N % StepSize);
int i = 0;
ReadOnlySpan<float> A = DataA;
ReadOnlySpan<float> B = DataA;
Span<float> C = Result;
for (; i < L; i += StepSize)
{
(new Vector<float>(A[i..]) + new Vector<float>(B[i..])).CopyTo(C[i..]);
}
for (; i < N; i++)
{
Result[i] = DataA[i] + DataB[i];
}
return Result;
}
[Benchmark]
public float[] AddVectorizedArray()
{
int StepSize = Vector<float>.Count;
int L = N - (N % StepSize);
int i = 0;
for (; i < L; i += StepSize)
{
(new Vector<float>(DataA, i) + new Vector<float>(DataB, i)).CopyTo(Result, i);
}
for (; i < N; i++)
{
Result[i] = DataA[i] + DataB[i];
}
return Result;
}
[Benchmark]
public float[] AddArrayClassic()
{
for (int i = 0; i < N; i++)
{
Result[i] = DataA[i] + DataB[i];
}
return Result;
}
[Benchmark]
public float[] AddArrayUnsafe()
{
unsafe
{
fixed (float* dataA = DataA)
fixed (float* dataB = DataB)
fixed (float* result = Result)
{
float* dataAptr = dataA, dataBptr = dataB, resultptr = result;
for (int i = 0; i < N; i++, dataAptr++, dataBptr++, resultptr++)
{
*resultptr = *dataAptr + *dataBptr;
}
}
}
return Result;
}
[Benchmark]
public float[] AddVectorizedArrayUnsafe()
{
int StepSize = Vector<float>.Count;
int L = N - (N % StepSize);
int i = 0;
ref float A = ref DataA[0];
ref float B = ref DataB[0];
ref float C = ref Result[0];
for (; i < L; i += StepSize)
{
var vectorA = Unsafe.ReadUnaligned<Vector<float>>(ref Unsafe.As<float, byte>(ref Unsafe.Add(ref A, i)));
var vectorB = Unsafe.ReadUnaligned<Vector<float>>(ref Unsafe.As<float, byte>(ref Unsafe.Add(ref A, i)));
var vectorC = vectorA + vectorB;
Unsafe.WriteUnaligned<Vector<float>>(ref Unsafe.As<float, byte>(ref Unsafe.Add(ref C, i)), vectorC);
}
for (; i < N; i++)
{
Result[i] = DataA[i] + DataB[i];
}
return Result;
}
}
} |
The slices of I wonder whether we should fix this issue by modifying only LocalRewriter or both it and Binder. |
Currently the IDE suggest with Span not to use slice but instead use the Range indexer (IDE0057: Slice can be simplified to this). So the original code:
is then translated to
the resulting generated code is different (which is OK)
and the result taken from sharplab:
but the JIT Asm seems to be different. I didn't make any performance comparisons, but for my environment it is crucial that the suggested rewrite does not alter the performance of the code.
The text was updated successfully, but these errors were encountered: