-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose RoundUpToPowerOf2 from BitOperations and optimize it #43135
Comments
Tagging subscribers to this area: @tannergooding, @pgovind, @jeffhandley |
namespace System.Numerics
{
public static class BitOperations
{
public static uint RoundUpToPowerOf2(uint i);
public static ulong RoundUpToPowerOf2(ulong i);
}
} |
What's the behavior of |
Is this already done? runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs Lines 83 to 86 in 21734a4
|
As of last week, yes, but there's now a new TODO: runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs Lines 76 to 80 in 473f8f6
|
For reference, the main two concerns are Given Likewise, the "logical" solution for It would be great if we can figure out trivial additions to get the expected result in both cases without significant overhead. |
Also noting I am slightly less concerned about overhead on the fallback path. Due to |
Is that the result that would be most useful for an algorithm (ie it would actually matter) and what other platforms do? Just curious. |
The most useful to the algorithm. It allows Other platforms typically use this as an internal helper, most often using the "software fallback" case where they can just treat I would also be fine with saying |
In all my uses of similar helpers I've never thought about boundary cases - what happens at 0 and Max. It was always outside of possible scenarios. Typical use is rounding up a buffer size so that I think if we could just use |
I agree, but we still need to document the expected behavior here and ensure it is tested in case someone does pass it in. We've done this for all our publicly exposed cross-platform helpers and I believe we need to do the same here as well. |
0 -> 1 and Max -> 0 seems right. 0 -> 0 would also be ok, but Max -> 1 seems odd |
For lzcnt - would this work: int shift = 32 - LeadingZeroCount(value - 1);
return (1u ^ (shift >> 5)) << shift; not sure if it will be faster than a branch though, since the branch will be always the same way. |
Did a benchmark: BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19043.1052 (21H1/May2021Update)
public IEnumerable<object[]> Values()
{
yield return new object[] { 0u };
yield return new object[] { 0x7FFF_FFFFu };
yield return new object[] { 0x8000_0000u };
yield return new object[] { 0xFFFF_FFFFu };
}
[Benchmark(Baseline = true), ArgumentsSource(nameof(Values))]
public uint NoBranch(uint value)
{
int shift = 32 - BitOperations.LeadingZeroCount(value - 1);
return (1u ^ (uint)(shift >> 5)) << shift;
}
[Benchmark, ArgumentsSource(nameof(Values))]
public uint Branch1(uint value)
{
if (value > unchecked((uint)int.MinValue))
return 0;
return 1u << (32 - BitOperations.LeadingZeroCount(value - 1));
}
[Benchmark, ArgumentsSource(nameof(Values))]
public uint Branch2(uint value)
{
if (value <= unchecked((uint)int.MinValue))
return 1u << (32 - BitOperations.LeadingZeroCount(value - 1));
return 0;
} The no branch approach wins for "regular" cases. It also wins in code size. |
Ah, if we use 64bit shift operation, it can eliminate some corner cases. |
return (uint)(0x1_0000_0000ul >> BitOperations.LeadingZeroCount(value - 1)); OK this definitely wins. |
Background and Motivation
Rounding a number to the closest >= power of two is a relatively common operation to perform, especially for developers implementing custom collection type that require a power of two to be used in one of the internal data structures (for reference, CoreCLR uses this method in the
ConcurrentQueue
type (here). This proposal has two main points:BitOperations
classProposed API
Usage Examples
I'm using this same method in a couple places in the
Microsoft.Toolkit.HighPerformance
package:StringPool
type, to initialize the internal buckets for the cachedstring
-s (to get a fast%
op)ArrayPoolBufferWriter<T>
type, to round up the requested size toArrayPool<T>
and avoid repeated new[] allocationsWithin CoreCLR, there's also that
ConcurrentQueue
usage example I mentioned above.Details
The implementation would include vectorized paths like
LeadingZeroCount
does, checking forLzcnt
,ArmBase
and thenX86Base
, or alternatively it would use software fallback currently (always) used withinConcurrentQueue
.Notes
If the API proposal is approved I'd be happy to help out and make a PR for this 😄
The text was updated successfully, but these errors were encountered: