Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker swarm with TorchServe workflow #3206

Open
KD1994 opened this issue Jun 26, 2024 · 4 comments
Open

Docker swarm with TorchServe workflow #3206

KD1994 opened this issue Jun 26, 2024 · 4 comments
Labels
triaged Issue has been reviewed and triaged workflowx Issues related to workflow / ensemble models

Comments

@KD1994
Copy link

KD1994 commented Jun 26, 2024

I want to scale the workflows through "Docker Swarm". (I hope it is possible, if not please tell me how one can achieve this? I know it is not supported yet through TorchServe directly, that is why I'm using docker to scale the workflow.)
I have few questions related to using TorchServe as a docker service in swarm mode while I encountered few issues.

Problem Statement:

  • We are using TorchServe workflow as we have multiple models required to complete the use case.
  • To make sure that there isn't any difference I've set the number of workers to 2 on each node, so that memory consumption doesn't go above 16GB, and each node has same number of workers and memory.
  • While creating a docker service, the manager node seems to work fine with the below TorchServe config and completes the task in desired time, but when the manager assigns the task to any of the worker node it takes ~3X more time.
  • Problem we are facing is while a TorchServe worker is executing on the worker node, looks like it is executing with intervals. i.e., it doesn’t show continuous GPU utilization/processing and stops printing logs as well along with delay in response and meanwhile that if another request comes it will stop executing the current request and starts executing new one.
  • I did see something in logs (unfortunately, I'm unable to provide the logs here) like, when node m5 is being executed and new request came then the current request directly stops (at least in the logs it looked like that, but no error was thrown) and new one starts. Correct me if I'm wrong but old request should be executing in the background, right?
  • Now, the question is, Does TorchServe support routing the request through docker swarm?
  • If so, then what would be the correct configuration to achieve similar results on the all the nodes apart from manager in swarm?

My Docker Swarm Config:

  • 3 nodes, 1 manager 2 workers
  • Manager has 4 X v100 sxm-2, 32GB each, Worker has 4 X v100 sxm-2, 16GB each

My project config:
(Please ignore the timeout, as I've put it this way because my inference request takes around 10 mins, as it takes over 100 images to process in a batch)

  • There are 5 models
  • model-config.yaml
maxBatchDelay: 10000000
responseTimeout: 10000000
  • workflow.yaml
models:
    min-workers: 1
    max-workers: 2
    max-batch-delay: 10000000
    retry-attempts: 1
    timeout-ms: 3000000

    m1:
      url: mode-1.mar

    m2:
      url: model-2.mar

    m3:
      url: model-3.mar

    m4:
      url: model-4.mar

    m5:
      url: model-5.mar
  
dag:
  pre_processing: [m1]
  m1: [m2]
  m2: [m3]
  m3: [m4]
  m4: [m5]
  m5: [post_processing]
  • config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082

# management
default_response_timeout=10000000
default_workers_per_model=2

load_models=
model_store=model_store
workflow_store=wf_store

enable_envvars_config=true
job_queue_size=3

Python Packages:

torch==1.13.1+cu117
torchvision==0.14.1+cu117
torchaudio==0.13.1+cu117
torchserve==0.10.0
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
nvgpu==0.10.0
captum==0.7.0
@agunapal
Copy link
Collaborator

Hi @KD1994

This is not something that we have tried. We do have kubernetes and kserve support.

I would start with something simpler. Just a simple model served through docker swarm and see if you are not seeing these performance issues.
Unfortunately, we haven't been actively developing workflow as we haven't come across specific asks recently. So, there might be perf issues with workflow on a single container deployment too. If this is something your organization is looking for, please send me a message and we can discuss.

@agunapal agunapal added triaged Issue has been reviewed and triaged workflowx Issues related to workflow / ensemble models labels Jun 26, 2024
@KD1994
Copy link
Author

KD1994 commented Jun 27, 2024

Thanks, @agunapal for the quick response.

That is exactly my plan of action right now for testing it out even further for all the things possible. I just wanted to see if anyone had tried this and faced any issue with this. I'll let you know in case If I still see this issue.

Out of curiosity,

  1. Is there any plan to provide scaling functionality to workflows in near future?
  2. About Kubernetes, have you tried with multiple nodes in a cluster or with just one?

@agunapal
Copy link
Collaborator

Yes, I have. if you are using aws, setup a cluster using https://github.com/aws-samples/aws-do-eks and then use this to launch torchserve with a bert model, aws-solutions-library-samples/guidance-for-machine-learning-inference-on-aws#15

@KD1994
Copy link
Author

KD1994 commented Jun 28, 2024

Ok, thanks for the info. I will look into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been reviewed and triaged workflowx Issues related to workflow / ensemble models
Projects
None yet
Development

No branches or pull requests

2 participants