Distributed Ensemble (MPI Support) #1090

Robadob · 2023-07-14T14:27:03Z

The implementation of MPI Ensembles within this PR is designed for each CUDAEnsemble to have exclusive access to all the GPUs available to it (or specified with the devices config). Ideally a user will launch 1 MPI worker per node, however it could be 1 worker per GPU per node.

It would be possible to use MPI shared-memory groups to identify workers on the same node and negotiate division of GPUs, and/or for some workers to become idle, however this has not been implemented.

Full notes of identified edge cases are in the below todo list.

~~Do we need to handle a race condition with RTC cache?~~

Closes #1073
Closes #1114

Edit for visibility (by @ptheywood) - needs to be clear in the next release notes that this includes a breaking change to the return type of CUDAEnsemble::getLogs from a std::vector to a std::map

Robadob · 2023-07-24T14:59:08Z

I've created a simple test case that will either run on a local node (if worker count <= device count) or across multiple nodes (I could probably extend this to ensure it's 1 worker per node, but that would require some MPI comms to setup the test).

The issue with MPI testing is that MPI_Init(), MPI_Finalize() can only be called once per process. With CUDAEnsemble auto cleaning up and triggering MPI_Finalize() which waits for all runners to also call it. A second MPI test case cannot be run.

Perhaps an argument for Pete's CMake test magic, as I understand that runs the test suite for each individual test. Alternative would be to add a backdoor, that tells Ensemble not to finalize when it detects tests (and add some internal finalize equivalent to ensure sync).

Requires discussion/agreement.

Simplest option would be to provide a CUDAEnsemble config to disable auto finalize, and expose a finalize wrapper to users.

The only possible use-case I can see for distributed ensemble calling CUDAEnsemble::simulate() multiple times, would be a large genetic algorithm. If we wish to support that, then it will be affected by this too.

Changes for err test

Add this to FLAMEGPU_STEP_FUNCTION(model_step)

    if (FLAMEGPU->getStepCounter() == 1 && counter%13==0) {
        throw flamegpu::exception::VersionMismatch("Counter - %d", counter);
    }

Add this to the actual test body, adjust through Off, Slow, Fast.

        ensemble.Config().error_level = CUDAEnsemble::EnsembleConfig::Fast;

Robadob · 2023-07-24T15:36:06Z

https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/

Setting up MPI to run across mav+waimu seems a bit involved, probably better to try Bede. I would hope the fact it works on single node is evidence that it will work though.

Robadob · 2023-07-26T12:51:20Z

Happy for this to be tested on Bede and merged whilst I'm on leave. Functionality should be complete, may just want to test on Bede and refine how we wish to test it (e.g. make it ctest exclusive and include error handling test).

ptheywood · 2023-07-27T11:39:42Z

I'll review this and test it on bede while you're on leave, and try to figure out a decent way to test it (and maybe move mpi_finalize to cleanup or similar, though again that would mean it can only be tested once).

Robadob · 2023-07-27T14:05:18Z

As discussed with @ptheywood (on Slack), will move MPI_Finalize() to cleanup(), and replace with MPI_Barrier() (to ensure synchronisation before all workers leave the call to CUDAEnsemble::simulate().

This will require adjustments to the documentation and tests.

mondus

Need to add MPI to requirements list in README if option is enabled.

ptheywood · 2023-07-28T11:36:06Z

Also need to consider how this will behave with telemetry: flag indicating mpi use, number of ranks?, how to do the list of devices from each node etc.

(this is a note mostly for me when I review this in the near future)

Robadob · 2023-07-28T11:42:33Z

I can throw in a telemetry envelope at the final barrier if desired so rank 0 receives all gpu names.

Robadob · 2023-07-28T12:28:05Z

I've now added MPI to readme requirements and ensured all tests pass with local MPI and sans MPI.

…king further reported errors. Current changes *should* now report late errors on rank 0's local runners, but need more code to handle non-rank 0 late errors.

Fixed 1 lint issue, another requires some restructuring (to reduce the LoC in the main ensemble fn)

Moves MPI commands into a seperate file. Still need to lint and split h into h+cpp May also move local error process into a util fn.

Fix a race condition, and logger always being killed on error Test now consistently pass with mpirun -n 2

A single simple Python test case (it calls no MPI commands externally to make it more precise)

…separately

Co-authored-by: Peter Heywood <p.heywood@sheffield.ac.uk>

…/Size (they aren't getters). Also marks them as static methods, they do not do anything to the instance

… detail::MPIEnsemble

…string changes

…rations MPI ensembles can use multiple mpi ranks per node, evenly(ish) distributing GPUs across the ranks per shared memory system. If more MPI ranks are used on a node than GPUs, additional ranks will do nothing and a warning is reported. I.e. any number of mpi ranks can be launched, but only the sensible amount will be used. If the user specifies device indices, they will be load balanced, otherwise all visible devices within the node will be balanced. Only one rank per node sends the device string back for telemetry, others send back an empty string (while the assembleGPUsString method is expecting a message from each rank in the world. If no valid cuda devices are provided, an exception is raised Device allocation is implemented in a static method so it can be tested programatically, withotu launching the test N times with different MPI configurations.

…change

…be called once per process

…time mpi

Changes resolved

Robadob added enhancement cmake labels Jul 14, 2023

Robadob self-assigned this Jul 14, 2023

Robadob force-pushed the distributed_ensemble branch 2 times, most recently from 3a1c2ce to 1987d9a Compare July 24, 2023 14:49

Robadob force-pushed the distributed_ensemble branch from 1987d9a to fce684e Compare July 24, 2023 15:13

Robadob force-pushed the distributed_ensemble branch from fce684e to 775e1f4 Compare July 26, 2023 10:45

This comment was marked as resolved.

Sign in to view

Robadob force-pushed the distributed_ensemble branch 5 times, most recently from 0ad57e6 to 423af91 Compare July 26, 2023 12:50

Robadob requested a review from ptheywood July 26, 2023 12:50

This comment was marked as resolved.

Sign in to view

Robadob force-pushed the distributed_ensemble branch from 423af91 to 6da1d88 Compare July 26, 2023 13:58

Robadob marked this pull request as ready for review July 26, 2023 15:09

Robadob force-pushed the distributed_ensemble branch from 6da1d88 to 1dcfaff Compare July 27, 2023 14:03

mondus previously requested changes Jul 28, 2023

View reviewed changes

Robadob force-pushed the distributed_ensemble branch from 74c99b0 to 1189af9 Compare July 28, 2023 12:24

Robadob force-pushed the distributed_ensemble branch from ae2d823 to 928cef9 Compare July 28, 2023 13:44

Robadob and others added 25 commits December 15, 2023 11:42

fixup progress stuff

8faf040

WIP. Found that after Rank 0 tells all runners to exit, it stops trac…

e3916f4

…king further reported errors. Current changes *should* now report late errors on rank 0's local runners, but need more code to handle non-rank 0 late errors.

Require more tests. But I think it should work.

7b14f6c

Tests now all pass.

ea5ac56

Fixed 1 lint issue, another requires some restructuring (to reduce the LoC in the main ensemble fn)

Start of cleanup

4a4a6a1

Moves MPI commands into a seperate file. Still need to lint and split h into h+cpp May also move local error process into a util fn.

various cleanup, lint should now be fixed.

51dcbd8

split out local err processing

c9093eb

lint fix

03bfb21

Fixes

d3298ee

Fix a race condition, and logger always being killed on error Test now consistently pass with mpirun -n 2

Bugfix: Resolve crash when EnsembleConfig::devices is left empty

9e83d71

lint

34e85c6

Duplicate error tests to force error on rank 0 and rank 1

302fa94

Fix MPIEnsemble init order

245bf57

A single simple Python test case (it calls no MPI commands externally to make it more precise)

Remove unneeded printf

3fd69af

Adjust MPI CI to explicitly build the tests_mpi and ensemble targets …

474a056

…separately

Update src/flamegpu/simulation/CUDAEnsemble.cu

7357153

Co-authored-by: Peter Heywood <p.heywood@sheffield.ac.uk>

Rename MPIensemble getWorldRank and getWorldSize to queryMPIWorldRank…

574f108

…/Size (they aren't getters). Also marks them as static methods, they do not do anything to the instance

Add static methods to get the shared memory rank and size from mpi in…

44eb064

… detail::MPIEnsemble

Add detail::MPIEnsemble members for the group rank and size. Plus doc…

b663ea0

…string changes

Fix create_test_project cmake macro gpu arch init

c1660fe

Fix Debug builds of tests_mpi

7785326

Fix pytest tests not updated for CUDASimulation::getLogs return type …

ad763cd

…change

Skip cleanup tests in python if MPI is enabled, as finalize can only …

7b2b176

…be called once per process

Remove checks where mpi could be nullptr since the removal of config-…

9e2d0cd

…time mpi

ptheywood force-pushed the distributed_ensemble branch from 9a55497 to 9e2d0cd Compare December 15, 2023 11:43

ptheywood approved these changes Dec 16, 2023

View reviewed changes

ptheywood merged commit 75c6a5b into master Dec 16, 2023
24 checks passed

ptheywood deleted the distributed_ensemble branch December 16, 2023 12:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Ensemble (MPI Support) #1090

Distributed Ensemble (MPI Support) #1090

Robadob commented Jul 14, 2023 •

edited by ptheywood

Loading

Robadob commented Jul 24, 2023 •

edited

Loading

Robadob commented Jul 24, 2023

This comment was marked as resolved.

Robadob commented Jul 26, 2023 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

ptheywood commented Jul 27, 2023

Robadob commented Jul 27, 2023

mondus left a comment

ptheywood commented Jul 28, 2023

Robadob commented Jul 28, 2023

Robadob commented Jul 28, 2023

Distributed Ensemble (MPI Support) #1090

Distributed Ensemble (MPI Support) #1090

Conversation

Robadob commented Jul 14, 2023 • edited by ptheywood Loading

Robadob commented Jul 24, 2023 • edited Loading

Robadob commented Jul 24, 2023

This comment was marked as resolved.

Robadob commented Jul 26, 2023 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

ptheywood commented Jul 27, 2023

Robadob commented Jul 27, 2023

mondus left a comment

Choose a reason for hiding this comment

ptheywood commented Jul 28, 2023

Robadob commented Jul 28, 2023

Robadob commented Jul 28, 2023

Robadob commented Jul 14, 2023 •

edited by ptheywood

Loading

Robadob commented Jul 24, 2023 •

edited

Loading

Robadob commented Jul 26, 2023 •

edited

Loading