Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] fast path for general integer division for x64 #2439

Open
MaxGraey opened this issue Nov 22, 2020 · 8 comments
Open

[proposal] fast path for general integer division for x64 #2439

MaxGraey opened this issue Nov 22, 2020 · 8 comments
Labels
cranelift:area:x64 Issues related to x64 codegen cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift Issues related to the Cranelift code generator

Comments

@MaxGraey
Copy link
Contributor

MaxGraey commented Nov 22, 2020

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
comparision

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register's width. Also it doesn't need for ARM.

@MaxGraey MaxGraey changed the title [proposal] fast path for general integer division for x84/x64 [proposal] fast path for general integer division for x64 Nov 22, 2020
@cfallin
Copy link
Member

cfallin commented Nov 22, 2020

Hi @MaxGraey, thanks for bringing this up!

From an instruction-selector perspective, this would certainly be possible, though in general I want to be careful about adding complexity without evidence that it will bring benefit in some way. (It's perhaps possible for one case, but if the isel starts growing conditionals for microarchitecture and for optimized cases everywhere, it soon becomes unmaintainable if we don't have a framework for reasoning about machine costs in a more principled way.) Have you seen workloads where division latency in particular became a bottleneck?

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Nov 22, 2020

Have you seen workloads where division latency in particular became a bottleneck?

Div / Rem usually most expensive operations. There are many possibilities to reduce cost of this operations, most often this is done before instruction scheduling. But there is usually very little opportunity to speed up the most general case. In this case, I think this is interesting approach which shouldn't be too complex in term of implementation.

@cfallin
Copy link
Member

cfallin commented Nov 22, 2020

Indeed, I agree that division is an expensive instruction! What I'm wondering is whether workloads exist where (i) the presence of divide instructions is a bottleneck, i.e., the latency is not hidden by other operations and is a visible portion of total runtime, and (ii) the workload uses 64-bit divides but values always have zeroed top-32-bits.

That data would show the benefit; the cost, in turn is (i) an additional comparison (logical-or, right-shift, branch-if-zero, so three instructions) in every divide's main code path, (ii) additional code size due to the comparison and the fastpath, (iii) complexity in isel, which imposes maintenance burden and a small compile time increase.

So, basically, I have a better handle on what the cost would be and I'm curious if we have any data points on the potential benefit. I'm hesitant to include something like this, because of the above downsides, but evidence of real upside could convince me :-)

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Nov 23, 2020

Oh, so you ask about real scenarios which could potentially benefit from this?

I guess all range of modern cryptography which based on modular arithmetic with small primes. For example 64-bit integer modulo intensively used in Montgomery reduction for example in modPow (x1 ^ x2 mod m) part. I guess @jedisct1 could tell much more than me. Another potential user case is multi-precision float points and integers which also use modular arithmetic

@cfallin
Copy link
Member

cfallin commented Nov 23, 2020

Fair enough -- I think that we could investigate further. I'm still worried about the impact of the three additional instructions for every divide; so I'd want us to collect data on various numeric benchmarks (compression and media codecs come to mind) to make sure this impact isn't an issue. The benchmarking situation is looking to improve soon (bytecodealliance/rfcs#4) so we should be able to study this fairly easily!

@MaxGraey would you be willing to prototype it?

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Nov 23, 2020

historically, this was implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys from LLVM have learned pretty well the specifics of this BypassSlowDivision transformation.

The benchmarking situation is looking to improve soon

Nice!

@MaxGraey would you be willing to prototype it?

Thanks for the suggestions, but I think this is a bit tricky and low-level case for me)

@cfallin
Copy link
Member

cfallin commented Nov 23, 2020

Thanks for the link to the LLVM transform!

Re: overhead, one option would be to add a flag to our CPU-specific backend flags to enable this; then the embedder could enable on platforms where this helps.

I'll see if I can get to this at some point (lots of things on my plate at the moment) but anyone feel free to try this in the meantime!

@bjorn3
Copy link
Contributor

bjorn3 commented Nov 23, 2020

As an example where division had disproportionate effects: gimli-rs/gimli#476 made line debuginfo generation twice as fast by avoiding a division in the common case.

@bnjbvr bnjbvr added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. labels Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:x64 Issues related to x64 codegen cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

No branches or pull requests

4 participants