Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Design: StorageBuffers #249

Open
charles-r-earp opened this issue Nov 18, 2020 · 4 comments
Open

API Design: StorageBuffers #249

charles-r-earp opened this issue Nov 18, 2020 · 4 comments
Labels
mcp: rfc needed Issues that need an RFC before implementation. t: design Design of our rust-gpu language and std

Comments

@charles-r-earp
Copy link
Contributor

Summary

This proposal attempts to address #232, #180, and #8. StorageBuffers, aka Buffer Blocks, are the primary inputs / outputs to compute shaders, though they can also be used in other stages. Iterators are one of Rust's highlights, and I propose emulating them in gpu code.

Example: Scaled Add aka Saxpy

// Rust Cpu
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
    for (x, mut y) in x.iter().copied().zip(y) {
        *y += alpha * x;
    }
}

// OpenCL 
__kernel void scaled_add(__global const float* x, __global float* y, float alpha, uint n) {
    uint gid = get_global_id(0);
    if gid < n {
        y[gid] += alpha * x[gid];
    }
}

// GLSL 
#version 450

layout(set=0, binding=0) buffer Input {
    float x[];    
}

layout(set=0, binding=1) buffer Output {
    float y[];    
}

layout(push_constant) uniform PushConsts {
    float alpha;
    uint n;
}

void main() {
    uint gid = gl_GlobalInvocationID.x;
    if gid < n {
        y[gid] += alpha * x[gid];
    }
}

Saxpy is trivially parallel, that is, it can be separated into n independent operations. There are plenty of other similar kinds of operations common in CUDA / OpenCL code, which do not require any synchronization / barriers. In fact, this is probably the most common case.

Possible Naive Implementation in rust-gpu

#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: Buffer<[f32]>, mut y: BufferMut<[f32]>, alpha: f32, n: u32) {
    let gid = spirv_std::global_x();
    if gid < n {
        unsafe {
            *y.get_mut_unchecked(gid) = x.get_unchecked(gid);
        }    
    } 
} 

This is the most straightforward translation of the above to rust gpu code. If we neglect concerns about aliasing with other shaders, then the only potential failure mode would be that the user provided n is outside the bounds of either x or y. The programmer is responsible, by using unsafe, to ensure no aliasing within the shader.

The rayon crate allows for easy parallel processing on cpu's, with its ParallelIterator trait. It looks like this:

// rayon
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
    use rayon::iter::ParallelIterator;
    for (mut y, x) in y.par_iter_mut().zip(x.iter().copied()) {
        *y += alpha * x;
    }
}

Rayon divides the work into n parts, where n is the number of workers. It knows that partioning a slice, even a mutable one, is safe. I propose a similar api for rust-gpu.

Proposal

Buffers: Runtime Arrays of T: Copy

  • GlobalBuffer: StorageBuffer
  • GlobalBufferMut: StorageBuffer, mutable
  • GroupBufferMut: Workgroup, mutable

Arrays: Like buffers, but with a const size, either const generics or array type ie <T = f32, const N: usize = 1024>, or <T = [f32; 1024]>

  • GlobalArray: StorageBuffer
  • GlobalArrayMut: StorageBuffer, mutable
  • GroupArrayMut: Workgroup, mutable

Matrices!?? Would probably want const generics, ie Matrix2<T, D1 = 100, D2 = 64>

  • An abstraction over Array, maybe even Buffer
  • Useful for iterating over dimensions, see ndarray
  • If we express the access as an iterator, then we can prevent aliases or perform appropriate synchronization
  • Long term idea

Blocks: A single T struct item, T: Copy

  • GlobalBlock: StorageBuffer
  • load() -> T
  • Potentially borrow() / as_ref() -> &T

If necessary, mutable Block fields could be accomplished via some sort of Mutex or ArcCell equivalent, but that would require relaxing the Copy requirement.

Arrays and Blocks are safer because the runtime can validate the inputs prior to launching. Start with blocks, then arrays, then buffers. Note that all of these require a special "Block" decorated struct wrapper, at least per SPIR-V specification. Slices are also tricky / not allowed in exports because of Rust's unstable abi.

Iterators

GlobalIterator trait

  • Like rayon::iter::ParallelIterator
  • Divides the work into global size pieces, and each invocation works on one piece
  • fn for_each(self, f: impl Fn(Self::Item))
  • fn enumerate()
  • fn enumerate_xyz() Same as enumerate, but provides the coordinates rather than the index
  • fn zip() for iterating over multiple buffers
  • others from Iterator if possible
  • some sort of hidden drive / next function
  • unsafe to implement, may even be sealed

GroupIterator trait

  • Like GlobalIterator, but iterates over local size

*Iter's are like slice::Iter, they iterate over a borrow.

IntoGlobalIterator / IntoGroupIterator

*IntoIter's: consume their container, but still yield borrowed values. This allows them to mutate the output, but they are consumed from the scope of the shader.

Unsafe

Buffers, Arrays, may have unsafe access to their internal Slice. It may not always be possible to prove that a program is safe, and it will take time to implement enough safe wrappers to fit every need.

Likewise, access to invocation specific values, like the global xyz, must be unsafe or even not allowed at all to ensure that the safe GlobalIterator construct is in fact safe. This means that the shader cannot get the u32 value, but could say, manipulate it mathmatically, and index a Buffer / slice / or pointer with it (this would require unsafe). The key thing is that it can't be read and it can't be used in control flow outside of the inner closure of for_each. This could be implemented as a wrapper, ie Idx(u32), which implements the appropriate traits.

Barriers

Barriers are emitted by *Iter and *IterMut iterators as required. IntoIter's should be able to ommit barriers, since they consume their inputs, so that they cannot be read / written to again within the shader.

Globally Const vs Per Invocation

For trivial cases like axpy, there is no need for the shader to access non-const memory (outside of the closure passed to for_each). The closure cannot mutate it's environment, or even borrow it, and nothing is returned from it. This prevents non-static control flow. Only the push_constants would be copied into the closure (via move).

Single Invocation operations

In some cases, it may be necessary to have only one invocation peform some work. This could potentially handled with a special SingleGlobalBufferIntoIter or the like.

Putting it all together

#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: GlobalArray<f32, 100>, y: GlobalArrayMut<f32, 100>, alpha: f32) {
    y.into_global_iter_mut().zip(x)
        .for_each(|(mut y, x)| *y += alpha * x);
}
@charles-r-earp charles-r-earp added the mcp: proposed A major change to the compiler, that hasn't yet been approved. label Nov 18, 2020
@Jasper-Bekkers
Copy link
Contributor

Related to #216 so might be good to get @Tobski's eyes on this as well.

@Jasper-Bekkers
Copy link
Contributor

For trivial cases like axpy, there is no need for the shader to access non-const memory (outside of the closure passed to for_each). The closure cannot mutate it's environment, or even borrow it, and nothing is returned from it. This prevents non-static control flow. Only the push_constants would be copied into the closure (via move).

Would you mind elaborating "non-const memory" here?

I've had discussions with @Tobski before, and the iterator based approach to data access is actually quite appealing to us - especially having something like you've proposed for the simpler cases.

One extension we discussed back then was to have a UniformIndex-style type which would allow safe remapping operations on the equivalent of gl_globalinvocationid (at least in such a way that they wouldn't clobber other elements in the buffer). Kind of like this:

#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: GlobalArray<f32, 100>, y: GlobalArrayMut<f32, 100>, alpha: f32) {
    y.into_global_iter_mut().map_idx(|idx| idx ^ 2).zip(x)
        .for_each(|(mut y, x)| *y += alpha * x);
}

@XAMPPRocky
Copy link
Member

We discussed this at the meeting today, and the consensus that we're generally in favour of idea of adding storage buffers, and we really like the style of using iterators for this task. However we think that there's still a lot of design work to be done in this, and would like to see a full RFC on this topic.

Additionally before we'd accept an RFC on a safe API for storage buffers, we'd like to first see an unsafe API for storage buffers available, that allows people to prototype and build their own safe abstractions before merging it into spirv-std.

@XAMPPRocky XAMPPRocky added mcp: rfc needed Issues that need an RFC before implementation. and removed mcp: proposed A major change to the compiler, that hasn't yet been approved. labels Nov 26, 2020
@charles-r-earp
Copy link
Contributor Author

For trivial cases like axpy, there is no need for the shader to access non-const memory (outside of the closure passed to for_each). The closure cannot mutate it's environment, or even borrow it, and nothing is returned from it. This prevents non-static control flow. Only the push_constants would be copied into the closure (via move).

Would you mind elaborating "non-const memory" here?

Hmmm, by non-const I mean the data in buffers, and global_id's, sizes etc. The idea is that the closure passed to for_each can't borrow anything, and doesn't return anything, and is the only way, at least safely, to access the data in buffers or the global id. If we know that at entry level that the code only has access to things that are uniform across all invocations, then control flow must also be uniform. The non uniform part is either in the closure, or expressed via the iterator, ie if the buffer is len 100 and the global size is 128, then the extra 28 are invalidated, rather than reading out of bounds. This acts like zipping a collection with a range.

@oisyn oisyn added the t: design Design of our rust-gpu language and std label Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mcp: rfc needed Issues that need an RFC before implementation. t: design Design of our rust-gpu language and std
Projects
None yet
Development

No branches or pull requests

4 participants