Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CeedRequestWait is documented but not implemented #700

Open
FreddieWitherden opened this issue Jan 28, 2021 · 3 comments
Open

CeedRequestWait is documented but not implemented #700

FreddieWitherden opened this issue Jan 28, 2021 · 3 comments

Comments

@FreddieWitherden
Copy link

I am currently reviewing libCEED as part of openjournals/joss-reviews#2945.

The method CeedRequestWait is documented and seems extremely interesting. However, looking through the source it does not appear to be actually implemented. It might be prudent to relegate this method and associated paraphernalia to a feature branch until it is ready for mainline. If nothing else it may be that when implementing that that a blocker raises its head or a simpler approach shows itself.

@jedbrown
Copy link
Member

This is a good point. We envisioned this as a consistent way to handle CPU threads and GPU streams, but now I'm less confident that this is the right interface (at least for GPUs). We've recently been experimenting in PETSc with offloading the scalar math for Krylov methods, such that the host doesn't need to block. We might want to switch to a stream model, something like CeedOperatorApplyWithStream(CeedOperator op, CeedVector in, CeedVector out, Stream s), which would return as soon as the operation is enqueued, and thus support hiding kernel launch latency. Would you prefer this model? Other ideas?

Cc: @YohannDudouit @nbeams

@FreddieWitherden
Copy link
Author

I think the community is slowly moving away from streams/queues and towards more declarative/graph based APIs. Good examples here are the (relatively recent) CUDA graph management APIs and the command list approach taken by Intel's Level Zero (which is probably the gold standard in terms of HPC API design at the moment, albeit being somewhat low level).

The difficulty I've always had with designing these kind of interfaces (and we have our own queues in PyFR) is how to handle CPUs. If we execute the kernel immediately from the API then the call is not asynchronous (whereas it would be on a GPU if you launched it in a stream). This confuses users who expect you to return immediately so they can go off and do some I/O while waiting for the kernel to finish. If you defer execution until a later wait/sync call then you may catch out some users who expect forward progress to be made. (MPI libraries often run into this problem with non-blocking communication where if you're not probing/waiting then nothing happens.) Finally, if you spawn a thread you may and do the work there you're increasing overhead and may be considered a bad citizen.

The problem is even more fiddly when you're a library and need to integrate with the models of a large number of stakeholders.

@nbeams
Copy link
Contributor

nbeams commented Jan 29, 2021

I agree this is a tricky issue for a library like libCEED, which may need to use various task/execution queue models (through backends targeting different architecture), while also being used by other libraries/applications.

Currently, the CUDA/HIP/MAGMA backends all use the NULL stream (and only the NULL stream). For the CUDA/HIP backends, this is baked into the kernel launch functions, see e.g. CeedRunKernelCuda. For the MAGMA backend, we do store a magma_queue_t in the backend data, just to avoid repeatedly creating/destroying the queue -- the queue is required for the MAGMA library calls used in the non-tensor basis routines -- but there's currently no way to set the queue for any stream other than the NULL stream. (This is for two reasons: 1) there's no libCEED interface to tell the backend which stream you want, and 2) we want to use the same stream as the CUDA/HIP backends, since we delegate to them for QFunction and Operator actions). So the addition of a stream/queue/command list object could potentially require some far-reaching refactoring across all the CUDA/HIP and MAGMA backends. Not that that's a reason not to do it -- in fact, I think it could be nice to separate this somewhat from the backends themselves, rather than the implicit setup we have now across CUDA/HIP/MAGMA-- it's just something to keep in mind when envisioning how this setup could/should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants