Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fft gpu optimization #79

Closed
wants to merge 14 commits into from
19 changes: 18 additions & 1 deletion halo2_proofs/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,24 @@ rand = "0.8"
rand_core = { version = "0.6", default-features = false }
blake2b_simd = "1"
pairing = { git = 'https://github.com/appliedzkp/pairing', package = "pairing_bn256", "tag" = "v0.1.1"}
pairing-bn256 = {package = "pairing_ce", version = "0.24.*" }
subtle = "2.3"
cfg-if = "0.1"
ark-std = { version = "0.3", features = ["print-trace"] }

log = "0.4.8"
thiserror = "1.0.10"
crossbeam-channel = "0.5.0"
crossbeam = {version = "0.7", optional = true}
lazy_static = {version = "1", optional = true}
futures = {package = "futures", version = "0.3", default_features = false, features = ["executor"]}
num_cpus = "1"
env_logger = "0.8.1"

# gpu feature
rust-gpu-tools = { version = "0.3.0", optional = true }
ff-cl-gen = { path = "../../ff-cl-gen", version = "0.2.0", optional = true }
bchyl marked this conversation as resolved.
Show resolved Hide resolved
fs2 = { version = "0.4.3", optional = true }

# Developer tooling dependencies
plotters = { version = "0.3.0", optional = true }
Expand All @@ -62,11 +78,12 @@ rand_core = { version = "0.6", default-features = false, features = ["getrandom"
getrandom = { version = "0.2", features = ["js"] }

[features]
default = ["shplonk"]
default = ["shplonk", "gpu"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about adding this as default now.

We will need some time to update the CI infra to support GPU usage.
cc: @AronisAt79 @ntampakas @barryWhiteHat

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thanks very much.

Copy link
Author

@bchyl bchyl Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about adding this as default now.

We will need some time to update the CI infra to support GPU usage. cc: @AronisAt79 @ntampakas @barryWhiteHat

hi @CPerezz.
May I ask how long it will take for the CI infra to be ready?

In addition to update FFT data, we ported the multiexp operation patch compatible with pairing library based on filecoin ec-gpu last week. The performance data as following:

running 1 test
Testing Multiexp for 1024 elements...
GPU took 10ms.
CPU took 6ms.
Speedup: x0.6
============================
Testing Multiexp for 2048 elements...
GPU took 8ms.
CPU took 5ms.
Speedup: x0.625
============================
Testing Multiexp for 4096 elements...
GPU took 8ms.
CPU took 6ms.
Speedup: x0.75
============================
Testing Multiexp for 8192 elements...
GPU took 10ms.
CPU took 12ms.
Speedup: x1.2
============================
Testing Multiexp for 16384 elements...
GPU took 12ms.
CPU took 22ms.
Speedup: x1.8333334
============================
Testing Multiexp for 32768 elements...
GPU took 16ms.
CPU took 40ms.
Speedup: x2.5
============================
Testing Multiexp for 65536 elements...
GPU took 25ms.
CPU took 83ms.
Speedup: x3.32
============================
Testing Multiexp for 131072 elements...
GPU took 34ms.
CPU took 169ms.
Speedup: x4.970588
============================
Testing Multiexp for 262144 elements...
GPU took 59ms.
CPU took 287ms.
Speedup: x4.8644066
============================
Testing Multiexp for 524288 elements...
GPU took 91ms.
CPU took 469ms.
Speedup: x5.1538463
============================
Testing Multiexp for 1048576 elements...
GPU took 152ms.
CPU took 864ms.
Speedup: x5.6842103
============================
Testing Multiexp for 2097152 elements...
GPU took 246ms.
CPU took 1707ms.
Speedup: x6.9390244
============================
Testing Multiexp for 4194304 elements...
GPU took 373ms.
CPU took 3431ms.
Speedup: x9.198391
============================
Testing Multiexp for 8388608 elements...
GPU took 574ms.
CPU took 6788ms.
Speedup: x11.825784
============================
Testing Multiexp for 16777216 elements...
GPU took 928ms.
CPU took 14723ms.
Speedup: x15.865302
============================
Testing Multiexp for 33554432 elements...
GPU took 1541ms.
test multiexp::tests::gpu_multiexp_consistency has been running for over 60 seconds
CPU took 32038ms.
Speedup: x20.790396
============================
Testing Multiexp for 67108864 elements...
GPU took 2997ms.
CPU took 65527ms.
Speedup: x21.864197
============================

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bchyl, some of these GPU times for FFT/MSM look great! For the MSM however it seems like you'll have to test for larger number of elements because the current numbers are very small and you don't yet see the expected ~linear increase in time you'd expect (so for small multiexps there are other overheads which makes sense but that doesn't really matter).

I do have a question about the CPU times for MSM and FFTs. They seem to be extremely slow for some reason, much slower than even when running them on a normal desktop CPU, and you're running them on a very powerful machine so that doesn't make any sense. On a standard machine (8 CPU cores) you can do an FFT of 2^20 in roughly 0.25s and an MSM of 2^20 in less than 10 seconds (the exact numbers of course dependent on the specific implementation). The numbers make it look like you're running only on a single core or something. Am I misinterpreting the data or do you think there may be something up with the CPU performance numbers?

Copy link
Author

@bchyl bchyl Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Brechtpd, Thank you very much for your reference 8 cores performance data.
We have double checked the data of our CPU (test machine is 80 cores) and it has been updated as mentioned above.

In general, for MSM it was been consistent with your results, 2^20 less than 1s(864ms). But for fft our result was 0.263s same as your theoretical value 0.25s, it look like cpu acceleration is not obvious beyond a certain number of cores.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the updated numbers! I think the limited scalability beyond a certain number of CPU cores is at least partially caused by the multi-threading approach followed by the current CPU FFT/MSM implementations. Not really sure how big the impact is of that with a CPU with that many cores though.

dev-graph = ["plotters", "tabbycat"]
gadget-traces = ["backtrace"]
sanity-checks = []
shplonk = []
gpu = ["rust-gpu-tools", "ff-cl-gen", "fs2", "lazy_static", "crossbeam", "futures/thread-pool"]
gwc = []

[lib]
Expand Down
31 changes: 31 additions & 0 deletions halo2_proofs/src/gpu/error.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#[cfg(feature = "gpu")]
bchyl marked this conversation as resolved.
Show resolved Hide resolved
use rust_gpu_tools::opencl;

#[derive(thiserror::Error, Debug)]
pub enum GPUError {
#[error("GPUError: {0}")]
Simple(&'static str),
#[cfg(feature = "gpu")]
#[error("OpenCL Error: {0}")]
OpenCL(#[from] opencl::GPUError),
#[cfg(feature = "gpu")]
#[error("GPU taken by a high priority process!")]
GPUTaken,
#[cfg(feature = "gpu")]
#[error("No kernel is initialized!")]
KernelUninitialized,
#[error("GPU accelerator is disabled!")]
GPUDisabled,
}

pub type GPUResult<T> = std::result::Result<T, GPUError>;

#[cfg(feature = "gpu")]
impl From<std::boxed::Box<dyn std::any::Any + std::marker::Send>> for GPUError {
fn from(e: std::boxed::Box<dyn std::any::Any + std::marker::Send>) -> Self {
match e.downcast::<Self>() {
Ok(err) => *err,
Err(_) => GPUError::Simple("An unknown GPU error happened!"),
}
}
}
205 changes: 205 additions & 0 deletions halo2_proofs/src/gpu/fft.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
use crate::gpu::{
error::{GPUError, GPUResult},
get_lock_name_and_gpu_range, locks, sources,
};
use group::ff::Field;
//use group::Group;
bchyl marked this conversation as resolved.
Show resolved Hide resolved
use crate::arithmetic::Group;
use crate::worker::THREAD_POOL;
use log::{error, info};
use rayon::join;
use rust_gpu_tools::*;
use std::cmp::min;
use std::{cmp, env};
use std::ops::MulAssign;

const LOG2_MAX_ELEMENTS: usize = 32; // At most 2^32 elements is supported.
const MAX_LOG2_RADIX: u32 = 9; // Radix512
const MAX_LOG2_LOCAL_WORK_SIZE: u32 = 8; // 256

pub struct SingleFFTKernel<G>
where
G: Group,
{
program: opencl::Program,
pq_buffer: opencl::Buffer<G::Scalar>,
omegas_buffer: opencl::Buffer<G::Scalar>,
priority: bool,
}

impl<G> SingleFFTKernel<G>
where
G: Group,
{
pub fn create(device: opencl::Device, priority: bool) -> GPUResult<SingleFFTKernel<G>> {
let src = sources::kernel::<G>(device.brand() == opencl::Brand::Nvidia);

let program = opencl::Program::from_opencl(device, &src)?;
let pq_buffer = program.create_buffer::<G::Scalar>(1 << MAX_LOG2_RADIX >> 1)?;
let omegas_buffer = program.create_buffer::<G::Scalar>(LOG2_MAX_ELEMENTS)?;

info!("FFT: Device: {}", program.device().name());

Ok(SingleFFTKernel {
program,
pq_buffer,
omegas_buffer,
priority,
})
}

/// Peforms a FFT round
/// * `log_n` - Specifies log2 of number of elements
/// * `log_p` - Specifies log2 of `p`, (http://www.bealto.com/gpu-fft_group-1.html)
/// * `deg` - 1=>radix2, 2=>radix4, 3=>radix8, ...
/// * `max_deg` - The precalculated values pq` and `omegas` are valid for radix degrees up to `max_deg`
fn radix_fft_round(
&mut self,
src_buffer: &opencl::Buffer<G::Scalar>,
dst_buffer: &opencl::Buffer<G::Scalar>,
log_n: u32,
log_p: u32,
deg: u32,
max_deg: u32,
) -> GPUResult<()> {
// if locks::PriorityLock::should_break(self.priority) {
// return Err(GPUError::GPUTaken);
// }

let n = 1u32 << log_n;
let local_work_size = 1 << cmp::min(deg - 1, MAX_LOG2_LOCAL_WORK_SIZE);
let global_work_size = (n >> deg) * local_work_size;
let kernel = self.program.create_kernel(
"radix_fft",
global_work_size as usize,
Some(local_work_size as usize),
);
kernel
.arg(src_buffer)
.arg(dst_buffer)
.arg(&self.pq_buffer)
.arg(&self.omegas_buffer)
.arg(opencl::LocalBuffer::<G::Scalar>::new(1 << deg))
.arg(n)
.arg(log_p)
.arg(deg)
.arg(max_deg)
.run()?;
Ok(())
}

/// Share some precalculated values between threads to boost the performance
fn setup_pq_omegas(&mut self, omega: &G::Scalar, n: usize, max_deg: u32) -> GPUResult<()> {
// Precalculate:
// [omega^(0/(2^(deg-1))), omega^(1/(2^(deg-1))), ..., omega^((2^(deg-1)-1)/(2^(deg-1)))]
let mut pq = vec![G::Scalar::zero(); 1 << max_deg >> 1];
let twiddle = omega.pow_vartime([(n >> max_deg) as u64]);
pq[0] = G::Scalar::one();
if max_deg > 1 {
pq[1] = twiddle;
for i in 2..(1 << max_deg >> 1) {
pq[i] = pq[i - 1];
pq[i].mul_assign(&twiddle);
}
}
self.pq_buffer.write_from(0, &pq)?;

// Precalculate [omega, omega^2, omega^4, omega^8, ..., omega^(2^31)]
let mut omegas = vec![G::Scalar::zero(); 32];
omegas[0] = *omega;
for i in 1..LOG2_MAX_ELEMENTS {
omegas[i] = omegas[i - 1].pow_vartime([2u64]);
}
self.omegas_buffer.write_from(0, &omegas)?;

Ok(())
}

/// Performs FFT on `a`
/// * `omega` - Special value `omega` is used for FFT over finite-fields
/// * `log_n` - Specifies log2 of number of elements
pub fn radix_fft(&mut self, a: &mut [G::Scalar], omega: &G::Scalar, log_n: u32) -> GPUResult<()> {
let n = 1 << log_n;
let mut src_buffer = self.program.create_buffer::<G::Scalar>(n)?;
let mut dst_buffer = self.program.create_buffer::<G::Scalar>(n)?;

let max_deg = cmp::min(MAX_LOG2_RADIX, log_n);
self.setup_pq_omegas(omega, n, max_deg)?;

src_buffer.write_from(0, &*a)?;
let mut log_p = 0u32;
while log_p < log_n {
let deg = cmp::min(max_deg, log_n - log_p);
self.radix_fft_round(&src_buffer, &dst_buffer, log_n, log_p, deg, max_deg)?;
log_p += deg;
std::mem::swap(&mut src_buffer, &mut dst_buffer);
}

src_buffer.read_into(0, a)?;

Ok(())
}
}

pub struct MultiFFTKernel<G>
where
G: Group,
{
kernels: Vec<SingleFFTKernel<G>>,
_lock: locks::GPULock,
}

impl<G> MultiFFTKernel<G>
where
G: Group,
{
pub fn create(priority: bool) -> GPUResult<MultiFFTKernel<G>> {
let mut all_devices = opencl::Device::all();
let all_num = all_devices.len();
let (lock_index, gpu_range) = get_lock_name_and_gpu_range(all_num);

let lock = locks::GPULock::lock(lock_index);

let devices: Vec<&opencl::Device> = all_devices.drain(gpu_range).collect();

// use all of the GPUs
let kernels: Vec<_> = devices
.into_iter()
.map(|d| (d, SingleFFTKernel::<G>::create(d.clone(), priority)))
.filter_map(|(device, res)| {
if let Err(ref e) = res {
error!(
"Cannot initialize kernel for device '{}'! Error: {}",
device.name(),
e
);
}
res.ok()
})
.collect();

Ok(MultiFFTKernel {
kernels,
_lock: lock,
})
}

pub fn fft_multiple(
&mut self,
polys: &mut [&mut [G::Scalar]],
omega: &G::Scalar,
log_n: u32,
) -> GPUResult<()> {
use rayon::prelude::*;

for poly in polys.chunks_mut(self.kernels.len()) {
crate::worker::THREAD_POOL.install(|| {
poly.par_iter_mut()
.zip(self.kernels.par_iter_mut())
.for_each(|(p, kern)| kern.radix_fft(p, omega, log_n).unwrap())
});
}

Ok(())
}
}
76 changes: 76 additions & 0 deletions halo2_proofs/src/gpu/fft/fft.cl
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
uint bitreverse(uint n, uint bits) {
uint r = 0;
for(int i = 0; i < bits; i++) {
r = (r << 1) | (n & 1);
n >>= 1;
}
return r;
}

/*
* FFT algorithm is inspired from: http://www.bealto.com/gpu-fft_group-1.html
*/
__kernel void radix_fft(__global FIELD* x, // Source buffer
__global FIELD* y, // Destination buffer
__global FIELD* pq, // Precalculated twiddle factors
__global FIELD* omegas, // [omega, omega^2, omega^4, ...]
__local FIELD* u, // Local buffer to store intermediary values
uint n, // Number of elements
uint lgp, // Log2 of `p` (Read more in the link above)
uint deg, // 1=>radix2, 2=>radix4, 3=>radix8, ...
uint max_deg) // Maximum degree supported, according to `pq` and `omegas`
{
uint lid = get_local_id(0);
uint lsize = get_local_size(0);
uint index = get_group_id(0);
uint t = n >> deg;
uint p = 1 << lgp;
uint k = index & (p - 1);

x += index;
y += ((index - k) << deg) + k;

uint count = 1 << deg; // 2^deg
uint counth = count >> 1; // Half of count

uint counts = count / lsize * lid;
uint counte = counts + count / lsize;

// Compute powers of twiddle
const FIELD twiddle = FIELD_pow_lookup(omegas, (n >> lgp >> deg) * k);
FIELD tmp = FIELD_pow(twiddle, counts);
for(uint i = counts; i < counte; i++) {
u[i] = FIELD_mul(tmp, x[i*t]);
tmp = FIELD_mul(tmp, twiddle);
}
barrier(CLK_LOCAL_MEM_FENCE);

const uint pqshift = max_deg - deg;
for(uint rnd = 0; rnd < deg; rnd++) {
const uint bit = counth >> rnd;
for(uint i = counts >> 1; i < counte >> 1; i++) {
const uint di = i & (bit - 1);
const uint i0 = (i << 1) - di;
const uint i1 = i0 + bit;
tmp = u[i0];
u[i0] = FIELD_add(u[i0], u[i1]);
u[i1] = FIELD_sub(tmp, u[i1]);
if(di != 0) u[i1] = FIELD_mul(pq[di << rnd << pqshift], u[i1]);
}

barrier(CLK_LOCAL_MEM_FENCE);
}

for(uint i = counts >> 1; i < counte >> 1; i++) {
y[i*p] = u[bitreverse(i, deg)];
y[(i+counth)*p] = u[bitreverse(i + counth, deg)];
}
}

/// Multiplies all of the elements by `field`
__kernel void mul_by_field(__global FIELD* elements,
uint n,
FIELD field) {
const uint gid = get_global_id(0);
elements[gid] = FIELD_mul(elements[gid], field);
}
Loading