Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Ensemble (MPI Support) #1090

Merged
merged 58 commits into from
Dec 16, 2023
Merged

Distributed Ensemble (MPI Support) #1090

merged 58 commits into from
Dec 16, 2023

Conversation

Robadob
Copy link
Member

@Robadob Robadob commented Jul 14, 2023

The implementation of MPI Ensembles within this PR is designed for each CUDAEnsemble to have exclusive access to all the GPUs available to it (or specified with the devices config). Ideally a user will launch 1 MPI worker per node, however it could be 1 worker per GPU per node.

It would be possible to use MPI shared-memory groups to identify workers on the same node and negotiate division of GPUs, and/or for some workers to become idle, however this has not been implemented.

Full notes of identified edge cases are in the below todo list.


  • Setup CMake
  • Update CUDAEnsemble
    • Handle the 3 possible error configs
    • Do something about programmatic output logs (currently each copy will provide full RunLog, with those not handled empty of data, checking step counter > 0 is a hack, this presumably also affects failed runs under normal error modes)
      • Each runner has their own logs in RunLog (This requires an API break, as RunLog vector does not identify run index or RunPlan (it was initially assumed that vector would be parallel with RunPlanVector used as input). [Agreed with Paul 2023-07-26]
      • Rank 0 gets all the logs, others get empty
      • All runners get all logs
    • Catch and handle local MPI runs inside CUDAEnsemble ? (we want users to avoid using multiple MPI runners that have access to same GPU)
    • Should we expose world_size/rank to HostAPI? (I don't think it's necessary)
  • Design a test case
  • Test on Mavericks/Waimu (single node)
  • Test on Bede (multi node)
  • Document (readme)
  • Document (userguide) Distributed Ensembles (MPI) FLAMEGPU2-docs#156
  • Update telemetry to account for MPI
  • Do something with ensemble example? (e.g. set CudaEnsemble::Config().mpi=false;)
  • Do we need to handle a race condition with RTC cache?

Closes #1073
Closes #1114


Edit for visibility (by @ptheywood) - needs to be clear in the next release notes that this includes a breaking change to the return type of CUDAEnsemble::getLogs from a std::vector to a std::map

@Robadob Robadob self-assigned this Jul 14, 2023
@Robadob Robadob force-pushed the distributed_ensemble branch 2 times, most recently from 3a1c2ce to 1987d9a Compare July 24, 2023 14:49
@Robadob
Copy link
Member Author

Robadob commented Jul 24, 2023

I've created a simple test case that will either run on a local node (if worker count <= device count) or across multiple nodes (I could probably extend this to ensure it's 1 worker per node, but that would require some MPI comms to setup the test).

The issue with MPI testing is that MPI_Init(), MPI_Finalize() can only be called once per process. With CUDAEnsemble auto cleaning up and triggering MPI_Finalize() which waits for all runners to also call it. A second MPI test case cannot be run.

Perhaps an argument for Pete's CMake test magic, as I understand that runs the test suite for each individual test. Alternative would be to add a backdoor, that tells Ensemble not to finalize when it detects tests (and add some internal finalize equivalent to ensure sync).

Requires discussion/agreement.

Simplest option would be to provide a CUDAEnsemble config to disable auto finalize, and expose a finalize wrapper to users.

The only possible use-case I can see for distributed ensemble calling CUDAEnsemble::simulate() multiple times, would be a large genetic algorithm. If we wish to support that, then it will be affected by this too.


Changes for err test

Add this to FLAMEGPU_STEP_FUNCTION(model_step)

    if (FLAMEGPU->getStepCounter() == 1 && counter%13==0) {
        throw flamegpu::exception::VersionMismatch("Counter - %d", counter);
    }

Add this to the actual test body, adjust through Off, Slow, Fast.

        ensemble.Config().error_level = CUDAEnsemble::EnsembleConfig::Fast;

@Robadob
Copy link
Member Author

Robadob commented Jul 24, 2023

https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/

Setting up MPI to run across mav+waimu seems a bit involved, probably better to try Bede. I would hope the fact it works on single node is evidence that it will work though.

@Robadob

This comment was marked as resolved.

@Robadob Robadob force-pushed the distributed_ensemble branch 5 times, most recently from 0ad57e6 to 423af91 Compare July 26, 2023 12:50
@Robadob Robadob requested a review from ptheywood July 26, 2023 12:50
@Robadob
Copy link
Member Author

Robadob commented Jul 26, 2023

Happy for this to be tested on Bede and merged whilst I'm on leave. Functionality should be complete, may just want to test on Bede and refine how we wish to test it (e.g. make it ctest exclusive and include error handling test).

@Robadob

This comment was marked as resolved.

@Robadob

This comment was marked as resolved.

@Robadob Robadob marked this pull request as ready for review July 26, 2023 15:09
@ptheywood
Copy link
Member

I'll review this and test it on bede while you're on leave, and try to figure out a decent way to test it (and maybe move mpi_finalize to cleanup or similar, though again that would mean it can only be tested once).

@Robadob
Copy link
Member Author

Robadob commented Jul 27, 2023

As discussed with @ptheywood (on Slack), will move MPI_Finalize() to cleanup(), and replace with MPI_Barrier() (to ensure synchronisation before all workers leave the call to CUDAEnsemble::simulate().

This will require adjustments to the documentation and tests.

mondus
mondus previously requested changes Jul 28, 2023
Copy link
Member

@mondus mondus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add MPI to requirements list in README if option is enabled.

@ptheywood
Copy link
Member

Also need to consider how this will behave with telemetry: flag indicating mpi use, number of ranks?, how to do the list of devices from each node etc.

(this is a note mostly for me when I review this in the near future)

@Robadob
Copy link
Member Author

Robadob commented Jul 28, 2023

I can throw in a telemetry envelope at the final barrier if desired so rank 0 receives all gpu names.

@Robadob
Copy link
Member Author

Robadob commented Jul 28, 2023

I've now added MPI to readme requirements and ensured all tests pass with local MPI and sans MPI.

Robadob and others added 25 commits December 15, 2023 11:42
…king further reported errors.

Current changes *should* now report late errors on rank 0's local runners, but need more code to handle non-rank 0 late errors.
Fixed 1 lint issue, another requires some restructuring (to reduce the LoC in the main ensemble fn)
Moves MPI commands into a seperate file.

Still need to lint and split h into h+cpp

May also move local error process into a util fn.
Fix a race condition, and logger always being killed on error

Test now consistently pass with mpirun -n 2
A single simple Python test case (it calls no MPI commands externally to make it more precise)
Co-authored-by: Peter Heywood <p.heywood@sheffield.ac.uk>
…/Size (they aren't getters).

Also marks them as static methods, they do not do anything to the instance
…rations

MPI ensembles can use multiple mpi ranks per node, evenly(ish) distributing GPUs across the ranks per shared memory system.
If more MPI ranks are used on a node than GPUs, additional ranks will do nothing and a warning is reported.

I.e. any number of mpi ranks can be launched, but only the sensible amount will be used.

If the user specifies device indices, they will be load balanced, otherwise all visible devices within the node will be balanced.

Only one rank per node sends the device string back for telemetry, others send back an empty string (while the assembleGPUsString method is expecting a message from each rank in the world.

If no valid cuda devices are provided, an exception is raised

Device allocation is implemented in a static method so it can be tested programatically, withotu launching the test N times with different MPI configurations.
@ptheywood ptheywood merged commit 75c6a5b into master Dec 16, 2023
24 checks passed
@ptheywood ptheywood deleted the distributed_ensemble branch December 16, 2023 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPI: CMake 3.18+ support Distributed CUDAEnsemble
3 participants