-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce new VCE with ActorPool #1969
Conversation
Nice. I ask because I wonder if using actors will eventually allow specific GPUs to be used by specific actors, and prevent multiple actors from keeping memory in GPUs they are not using. This would probably solve the |
I'm unsure if If specific actors could be mapped to specific GPUs (i guess here you are thinking of the setting where not all GPUs are the same, specially in terms of VRAM) that would be nice. Maybe something Placement Groups can achieve? |
I personally think this implementation makes a lot of sense. Some comments on the limitation: the second note there might be a better practice as you've said. |
Regarding the limitations:
I used
Removing the requirement of
|
@pedropgusmao , yeah ideally we keep the ClientProxy fully backwards compatible and under the same API as the standard client proxy dealing with gRPC devices. What I have done is use a "buffer" where a result are cached if it doesn't belong to the |
…s removed from pool
Co-authored-by: Daniel J. Beutel <daniel@flower.dev>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, merging it now 🚀
This is a draft implementation of the VCE making use of ActorPools as a replacement of using Tasks.
Context
The reasons why we want to move away from using Ray Tasks is because something changed after ray
1.11.1
that made the context (here i'm using this term very loosely) not being shared between tasks, which affects negatively those that make use of GPU resources quite evidently. As observed by several users #1942 #1376 #1383 this leads to GPU OOM.A working solution to the above is using
max_calls=1
in the@ray.remote
decorator throughout theRayClientProxy
, however, this comes with big overheads when the Ray Tasks (in our case executing a method in a flower client) can be completed in a few seconds. Currently this is the most typical situation in FL: clients don't not hold large volumes of data that requires several minutes of training. In this scenario, usingmax_calls=1
results in several seconds of task initialisation that represents a significant portion of the total FL simulation time. This gets worse the more Tasks are spawned, i.e., when there are more clients being simulated.To mitigate this, the advice has been to downgrade to ray
1.11.1
. However, such version of Ray is directly incompatible with the latest version of Flower due each requiring conflictinggrpcio
versions. This can be fixed by building the wheels for Ray1.11.1
after bumping the version ofgrpcio
. You can find this in my Ray fork (just execute the docker command indicated inpython/README-building-wheels.md
and you'll get the wheels). This makes that older version of Ray compatible with the current Flower version and there is no OOM issue (and yes, no need to set upmax_calls
)Using ActorPool
The approach implemented in this PR to solve the above issue, instantiates an ActorPool upon
start_simulation()
is executed. The pool contains as many actors (see Actor) as can be fit in the system given theclient_resources
that can be set by the user (the same behaviour as with the previous implementation using Tasks is preserved). Now, each virtualClientProxy
(for which i have drafted a new one I callRayClientProxyForActorPool
) instead of launching a Task, they submit a job to the actor pool. Each job is executed by an idle actor.How can I test this?
How does this perform compared to the other two alternatives when we set
num_gpus=0.2
(i.e. 5 clients can run concurrently on GPU) ?ray==1.11.1
--> ~22sray==2.5.1
(i.e. the code in this PR) --> ~22smax_calls=1
w/ray==2.5.1
--> ~110sThe above is tested on a machine with AMD 5900X and NVIDIA 3090Ti.
The difference was even more dramatic is setting
num_gpus=0.5
: 33s for ActorPool vs 197s for Task withmax_calls=1
.As expected, using
max_calls
for these workloads that take just a couple of seconds is harmful in terms of wall clock time. Given the change in the (low level?) handling of Ray Tasks post1.11.1
, I think the best alternative is to use entities that persist throughout the FL experiment (therefore not needing to release GPU/CPU resources): Ray Actors.Limitations
It is fine to submit one job from each
ClientProxy
but if we also want each proxy to return a result upon method completion (i.e. when callingfit()
), things get a bit tricky if we want these to be in order. The current implementation gets a result (which might not be the one corresponding to the client). This can be mitigated easily by usingcid
as identifiers of job submission/retrieval. This is the reason why i'm passing theis_simulation
flag around.I'm inclined to say that although the current implementation works, it would be better to submit all jobs in one go, using
map
, and then have a single blocking statement until all jobs are completed. Something like this in thefit_clients()
andevaluate_clients()
functions insrc/py/flwr/server/server.py
: