You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to scale the workflows through "Docker Swarm". (I hope it is possible, if not please tell me how one can achieve this? I know it is not supported yet through TorchServe directly, that is why I'm using docker to scale the workflow.)
I have few questions related to using TorchServe as a docker service in swarm mode while I encountered few issues.
Problem Statement:
We are using TorchServe workflow as we have multiple models required to complete the use case.
To make sure that there isn't any difference I've set the number of workers to 2 on each node, so that memory consumption doesn't go above 16GB, and each node has same number of workers and memory.
While creating a docker service, the manager node seems to work fine with the below TorchServe config and completes the task in desired time, but when the manager assigns the task to any of the worker node it takes ~3X more time.
Problem we are facing is while a TorchServe worker is executing on the worker node, looks like it is executing with intervals. i.e., it doesn’t show continuous GPU utilization/processing and stops printing logs as well along with delay in response and meanwhile that if another request comes it will stop executing the current request and starts executing new one.
I did see something in logs (unfortunately, I'm unable to provide the logs here) like, when node m5 is being executed and new request came then the current request directly stops (at least in the logs it looked like that, but no error was thrown) and new one starts. Correct me if I'm wrong but old request should be executing in the background, right?
Now, the question is, Does TorchServe support routing the request through docker swarm?
If so, then what would be the correct configuration to achieve similar results on the all the nodes apart from manager in swarm?
My Docker Swarm Config:
3 nodes, 1 manager 2 workers
Manager has 4 X v100 sxm-2, 32GB each, Worker has 4 X v100 sxm-2, 16GB each
My project config:
(Please ignore the timeout, as I've put it this way because my inference request takes around 10 mins, as it takes over 100 images to process in a batch)
This is not something that we have tried. We do have kubernetes and kserve support.
I would start with something simpler. Just a simple model served through docker swarm and see if you are not seeing these performance issues.
Unfortunately, we haven't been actively developing workflow as we haven't come across specific asks recently. So, there might be perf issues with workflow on a single container deployment too. If this is something your organization is looking for, please send me a message and we can discuss.
That is exactly my plan of action right now for testing it out even further for all the things possible. I just wanted to see if anyone had tried this and faced any issue with this. I'll let you know in case If I still see this issue.
Out of curiosity,
Is there any plan to provide scaling functionality to workflows in near future?
About Kubernetes, have you tried with multiple nodes in a cluster or with just one?
I want to scale the workflows through "Docker Swarm". (I hope it is possible, if not please tell me how one can achieve this? I know it is not supported yet through TorchServe directly, that is why I'm using docker to scale the workflow.)
I have few questions related to using TorchServe as a docker service in swarm mode while I encountered few issues.
Problem Statement:
m5
is being executed and new request came then the current request directly stops (at least in the logs it looked like that, but no error was thrown) and new one starts. Correct me if I'm wrong but old request should be executing in the background, right?My Docker Swarm Config:
My project config:
(Please ignore the timeout, as I've put it this way because my inference request takes around 10 mins, as it takes over 100 images to process in a batch)
Python Packages:
The text was updated successfully, but these errors were encountered: