CeedRequestWait is documented but not implemented #700

FreddieWitherden · 2021-01-28T02:06:32Z

I am currently reviewing libCEED as part of openjournals/joss-reviews#2945.

The method CeedRequestWait is documented and seems extremely interesting. However, looking through the source it does not appear to be actually implemented. It might be prudent to relegate this method and associated paraphernalia to a feature branch until it is ready for mainline. If nothing else it may be that when implementing that that a blocker raises its head or a simpler approach shows itself.

jedbrown · 2021-01-28T17:23:47Z

This is a good point. We envisioned this as a consistent way to handle CPU threads and GPU streams, but now I'm less confident that this is the right interface (at least for GPUs). We've recently been experimenting in PETSc with offloading the scalar math for Krylov methods, such that the host doesn't need to block. We might want to switch to a stream model, something like CeedOperatorApplyWithStream(CeedOperator op, CeedVector in, CeedVector out, Stream s), which would return as soon as the operation is enqueued, and thus support hiding kernel launch latency. Would you prefer this model? Other ideas?

Cc: @YohannDudouit @nbeams

FreddieWitherden · 2021-01-28T19:29:19Z

I think the community is slowly moving away from streams/queues and towards more declarative/graph based APIs. Good examples here are the (relatively recent) CUDA graph management APIs and the command list approach taken by Intel's Level Zero (which is probably the gold standard in terms of HPC API design at the moment, albeit being somewhat low level).

The difficulty I've always had with designing these kind of interfaces (and we have our own queues in PyFR) is how to handle CPUs. If we execute the kernel immediately from the API then the call is not asynchronous (whereas it would be on a GPU if you launched it in a stream). This confuses users who expect you to return immediately so they can go off and do some I/O while waiting for the kernel to finish. If you defer execution until a later wait/sync call then you may catch out some users who expect forward progress to be made. (MPI libraries often run into this problem with non-blocking communication where if you're not probing/waiting then nothing happens.) Finally, if you spawn a thread you may and do the work there you're increasing overhead and may be considered a bad citizen.

The problem is even more fiddly when you're a library and need to integrate with the models of a large number of stakeholders.

nbeams · 2021-01-29T21:50:41Z

I agree this is a tricky issue for a library like libCEED, which may need to use various task/execution queue models (through backends targeting different architecture), while also being used by other libraries/applications.

Currently, the CUDA/HIP/MAGMA backends all use the NULL stream (and only the NULL stream). For the CUDA/HIP backends, this is baked into the kernel launch functions, see e.g. CeedRunKernelCuda. For the MAGMA backend, we do store a magma_queue_t in the backend data, just to avoid repeatedly creating/destroying the queue -- the queue is required for the MAGMA library calls used in the non-tensor basis routines -- but there's currently no way to set the queue for any stream other than the NULL stream. (This is for two reasons: 1) there's no libCEED interface to tell the backend which stream you want, and 2) we want to use the same stream as the CUDA/HIP backends, since we delegate to them for QFunction and Operator actions). So the addition of a stream/queue/command list object could potentially require some far-reaching refactoring across all the CUDA/HIP and MAGMA backends. Not that that's a reason not to do it -- in fact, I think it could be nice to separate this somewhat from the backends themselves, rather than the implicit setup we have now across CUDA/HIP/MAGMA-- it's just something to keep in mind when envisioning how this setup could/should work.

jedbrown mentioned this issue Mar 31, 2021

Full assembly of CeedOperator #712

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CeedRequestWait is documented but not implemented #700

CeedRequestWait is documented but not implemented #700

FreddieWitherden commented Jan 28, 2021

jedbrown commented Jan 28, 2021

FreddieWitherden commented Jan 28, 2021

nbeams commented Jan 29, 2021 •

edited

Loading

CeedRequestWait is documented but not implemented #700

CeedRequestWait is documented but not implemented #700

Comments

FreddieWitherden commented Jan 28, 2021

jedbrown commented Jan 28, 2021

FreddieWitherden commented Jan 28, 2021

nbeams commented Jan 29, 2021 • edited Loading

nbeams commented Jan 29, 2021 •

edited

Loading