Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question to model inference optimization #3134

Open
geraldstanje opened this issue May 4, 2024 · 8 comments
Open

question to model inference optimization #3134

geraldstanje opened this issue May 4, 2024 · 8 comments
Assignees
Labels
triaged Issue has been reviewed and triaged

Comments

@geraldstanje
Copy link

geraldstanje commented May 4, 2024

📚 The doc issue

there is a typo: A larger batch size means a higher throughput at the cost of lower latency.
correct version should be: A larger batch size means a higher throughput at the cost of higher latency.

i have some more questions to model inference latency optimization:
im currently reading about:
https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-cpu-
https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-gpu
https://github.com/pytorch/serve/blob/master/docs/configuration.md
https://huggingface.co/docs/transformers/en/perf_torch_compile

i currently running model inference for a SetFit model (https://huggingface.co/blog/setfit) on a ml.g4dn.xlarge instance on aws (vCPUs: 4, Memory (GiB): 16.0, Memory per vCPU (GiB): 4.0, Physical Processor: Intel Xeon Family, GPU: 1, GPU Architecture: nvidia t4 tensor core, Video Memory (GiB): 16).

one thing which helped was to use torch.compile with mode="reduce-overhead"

im not sure how you set all these parameters to tune for low latency, high throughput:

* `min_worker` - (optional) the minimum number of worker processes. TorchServe will try to maintain this minimum
* for specified model. The default value is `1`.
* `max_worker` - (optional) the maximum number of worker processes. TorchServe will make no more that this
* number of workers for the specified model. The default is the same as the setting for `min_worker`.
saw also other settings: 
number_of_netty_threads: defines the number of threads to accept incoming http requests from your client container.
job_queue_size: defines the size of a models's job queue which stores incoming http requests.
default_workers_per_model: defines the number of workers which fetches a http request from a model's job queue.
netty_client_threads: defines the number of threads of a model's worker to receive http response from a model's
worker backend in TS internal.

i measured that a single model inference takes about 20ms. i want to have a max latency of around 50m. so i set the max_batch_delay to 30ms and set the max_batch_size to 100 (which seems a bit high at the moment).

how to set min_worker, max_worker - should that set to the number of cpu cores?
should i also increase the default_workers_per_model?
also, does BetterTransformer work with setfit models as well?

i have not used a profiler yet - just looking to understand all those settings before.

Suggest a potential alternative/fix

No response

@geraldstanje geraldstanje changed the title question to model inference tuning question to model inference optimization May 4, 2024
@agunapal
Copy link
Collaborator

agunapal commented May 7, 2024

Hi @geraldstanje You will have to use the benchmarking tool as shown in this example
https://github.com/pytorch/serve/tree/master/examples/benchmarking/resnet50
You can refer to the yaml file to see the various options it runs the experiments for.

@agunapal agunapal self-assigned this May 7, 2024
@agunapal agunapal added the triaged Issue has been reviewed and triaged label May 7, 2024
@geraldstanje
Copy link
Author

geraldstanje commented Jun 13, 2024

@agunapal can you run torch compile in the init function for torchServe any problems with that? e.g. here: https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py#L34

do you have an example somewhere?

@agunapal
Copy link
Collaborator

Hi @geraldstanje You can download this mar file. Here we use torch.compile with BERT https://github.com/pytorch/serve/blob/master/benchmarks/models_config/bert_torch_compile_gpu.yaml#L24

@geraldstanje
Copy link
Author

geraldstanje commented Jun 13, 2024

@agunapal ok - i want to look whats inside the .mar file - will i need https://github.com/pytorch/serve/blob/master/model-archiver/README.md ?

@agunapal
Copy link
Collaborator

You can wget the mar file and then do an unzip

@geraldstanje
Copy link
Author

geraldstanje commented Jun 13, 2024

@agunapal
ok that worked - what does the self.model.eval() before the torch.compile in initialize of TransformersSeqClassifierHandler?

what could be the reason that torch.compile doesnt immediately complete?

it seems torch.compile requres some warmup requests to run (not sure if thats specific to mode=reduce-overhead only) - can you run this in the initialize as well - do you see any problems if the entire warmup takes longer than 30 sec?

@agunapal
Copy link
Collaborator

Eval may not be needed.

Torch.compile first iteration can take time..so..usually you need to send a few(3-4) requests to warm up.

You can also check how we address this with aot compile. You can find the example under pt2 examples directory

@geraldstanje
Copy link
Author

geraldstanje commented Jun 14, 2024

@agunapal the problem is seems its doing some lazy execution - i run torch.compile - it seems to stop there - if i send request for predict it runs torch.compile ... how to disable lazy execution?

or how to check if lazy execution causes such behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been reviewed and triaged
Projects
None yet
Development

No branches or pull requests

2 participants