Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT-LLM Engine integration #3228

Merged
merged 11 commits into from
Jul 9, 2024
Merged

TensorRT-LLM Engine integration #3228

merged 11 commits into from
Jul 9, 2024

Conversation

agunapal
Copy link
Collaborator

@agunapal agunapal commented Jul 3, 2024

Description

This PR shows how to integrate TensorRT-LLM Engine with TorchServe

  • The example is shown to be working with llama
  • The example also uses TorchServe's async backend workers

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A
    Logs for Test A

  • Test B
    Logs for Test B

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@agunapal agunapal added this to the v0.12.0 milestone Jul 3, 2024
@agunapal agunapal marked this pull request as ready for review July 3, 2024 23:48
@agunapal agunapal requested review from mreso and lxning July 3, 2024 23:48
Copy link
Collaborator

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid overall, left some comments. Especially the postprocessing needs some rework I think as client side does not now how many beams to expect and so it will be hard to make sense of the returned streaming chunks.

examples/large_models/trt_llm/llama/README.md Outdated Show resolved Hide resolved
examples/large_models/trt_llm/llama/README.md Outdated Show resolved Hide resolved
maxBatchDelay: 100
responseTimeout: 1200
deviceType: "gpu"
asyncCommunication: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can TRT-LLM handle multi-gpu inference easily? Then we should demonstrate that we can easily integrate that with

parallelType: "custom"
parallelLevel: 4

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried standalone multi gpu inference, it didn't work for me. Although the model loaded on the 4 GPUs, the inference was hanging

streaming=True,
return_dict=True,
)
torch.cuda.synchronize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the synchronization for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy paste from the example code , but seems like its not needed. Works fine without it

for beam in range(num_beams):
output_begin = input_lengths[batch_idx]
output_end = sequence_lengths[batch_idx][beam]
outputs = output_ids[batch_idx][beam][
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we're doing the right thing here? Because output_begin is never used but instead output_end-1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, we don't need output_begin. The example code uses this

output_end - 1 : output_end
].tolist()
output_text = self.tokenizer.decode(outputs)
send_intermediate_predict_response(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I we send N=num_beams intermediate results back without any order information, can we assign each partial response to its beam sequence? Better to send one response per batch entry (will will be =1) with updates for all beams included as a list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this is not needed for llama as num_beams > 1 is not working for llama Removed the inner for loop

@agunapal agunapal requested a review from mreso July 8, 2024 23:29
Copy link
Collaborator

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets figure out the CUDA 12 situation, then this LGTM

@@ -10,6 +10,7 @@ This will downgrade the versions of PyTorch & Triton but this doesn't cause any

```
pip install tensorrt_llm==0.10.0 --extra-index-url https://pypi.nvidia.com
pip install tensorrt-cu12==10.1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this CUDA 12 exclusive? In that case we should inform people to install torch with CUDA 12 as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't mention it but all their docs are pointing to CUDA 12.x . Let me mention that its tested with CUDA 12.1

@agunapal agunapal enabled auto-merge July 9, 2024 17:24
@agunapal agunapal added this pull request to the merge queue Jul 9, 2024
Merged via the queue into master with commit a1c8eb2 Jul 9, 2024
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants