[proposal] fast path for general integer division for x64 #2439

MaxGraey · 2020-11-22T20:51:17Z

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register's width. Also it doesn't need for ARM.

cfallin · 2020-11-22T22:58:40Z

Hi @MaxGraey, thanks for bringing this up!

From an instruction-selector perspective, this would certainly be possible, though in general I want to be careful about adding complexity without evidence that it will bring benefit in some way. (It's perhaps possible for one case, but if the isel starts growing conditionals for microarchitecture and for optimized cases everywhere, it soon becomes unmaintainable if we don't have a framework for reasoning about machine costs in a more principled way.) Have you seen workloads where division latency in particular became a bottleneck?

MaxGraey · 2020-11-22T23:22:15Z

Have you seen workloads where division latency in particular became a bottleneck?

Div / Rem usually most expensive operations. There are many possibilities to reduce cost of this operations, most often this is done before instruction scheduling. But there is usually very little opportunity to speed up the most general case. In this case, I think this is interesting approach which shouldn't be too complex in term of implementation.

cfallin · 2020-11-22T23:54:39Z

Indeed, I agree that division is an expensive instruction! What I'm wondering is whether workloads exist where (i) the presence of divide instructions is a bottleneck, i.e., the latency is not hidden by other operations and is a visible portion of total runtime, and (ii) the workload uses 64-bit divides but values always have zeroed top-32-bits.

That data would show the benefit; the cost, in turn is (i) an additional comparison (logical-or, right-shift, branch-if-zero, so three instructions) in every divide's main code path, (ii) additional code size due to the comparison and the fastpath, (iii) complexity in isel, which imposes maintenance burden and a small compile time increase.

So, basically, I have a better handle on what the cost would be and I'm curious if we have any data points on the potential benefit. I'm hesitant to include something like this, because of the above downsides, but evidence of real upside could convince me :-)

MaxGraey · 2020-11-23T01:54:07Z

Oh, so you ask about real scenarios which could potentially benefit from this?

I guess all range of modern cryptography which based on modular arithmetic with small primes. For example 64-bit integer modulo intensively used in Montgomery reduction for example in modPow (x1 ^ x2 mod m) part. I guess @jedisct1 could tell much more than me. Another potential user case is multi-precision float points and integers which also use modular arithmetic

cfallin · 2020-11-23T02:33:06Z

Fair enough -- I think that we could investigate further. I'm still worried about the impact of the three additional instructions for every divide; so I'd want us to collect data on various numeric benchmarks (compression and media codecs come to mind) to make sure this impact isn't an issue. The benchmarking situation is looking to improve soon (bytecodealliance/rfcs#4) so we should be able to study this fairly easily!

@MaxGraey would you be willing to prototype it?

MaxGraey · 2020-11-23T02:40:04Z

historically, this was implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys from LLVM have learned pretty well the specifics of this BypassSlowDivision transformation.

The benchmarking situation is looking to improve soon

Nice!

@MaxGraey would you be willing to prototype it?

Thanks for the suggestions, but I think this is a bit tricky and low-level case for me)

cfallin · 2020-11-23T02:52:34Z

Thanks for the link to the LLVM transform!

Re: overhead, one option would be to add a flag to our CPU-specific backend flags to enable this; then the embedder could enable on platforms where this helps.

I'll see if I can get to this at some point (lots of things on my plate at the moment) but anyone feel free to try this in the meantime!

bjorn3 · 2020-11-23T06:43:59Z

As an example where division had disproportionate effects: gimli-rs/gimli#476 made line debuginfo generation twice as fast by avoiding a division in the common case.

MaxGraey changed the title ~~[proposal] fast path for general integer division for x84/x64~~ [proposal] fast path for general integer division for x64 Nov 22, 2020

bnjbvr added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. labels Jan 27, 2021

MaxGraey mentioned this issue Oct 12, 2021

[Costs] More precise costs for int div & rem WebAssembly/binaryen#4229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal] fast path for general integer division for x64 #2439

[proposal] fast path for general integer division for x64 #2439

MaxGraey commented Nov 22, 2020 •

edited

Loading

cfallin commented Nov 22, 2020

MaxGraey commented Nov 22, 2020 •

edited

Loading

cfallin commented Nov 22, 2020

MaxGraey commented Nov 23, 2020 •

edited

Loading

cfallin commented Nov 23, 2020

MaxGraey commented Nov 23, 2020 •

edited

Loading

cfallin commented Nov 23, 2020

bjorn3 commented Nov 23, 2020 •

edited

Loading

[proposal] fast path for general integer division for x64 #2439

[proposal] fast path for general integer division for x64 #2439

Comments

MaxGraey commented Nov 22, 2020 • edited Loading

Feature

Benefit

cfallin commented Nov 22, 2020

MaxGraey commented Nov 22, 2020 • edited Loading

cfallin commented Nov 22, 2020

MaxGraey commented Nov 23, 2020 • edited Loading

cfallin commented Nov 23, 2020

MaxGraey commented Nov 23, 2020 • edited Loading

cfallin commented Nov 23, 2020

bjorn3 commented Nov 23, 2020 • edited Loading

MaxGraey commented Nov 22, 2020 •

edited

Loading

MaxGraey commented Nov 22, 2020 •

edited

Loading

MaxGraey commented Nov 23, 2020 •

edited

Loading

MaxGraey commented Nov 23, 2020 •

edited

Loading

bjorn3 commented Nov 23, 2020 •

edited

Loading