Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual model warmup to resolve AOT model warmup performance degradation #126

Merged
merged 7 commits into from
Aug 14, 2024

Conversation

vivianrwu
Copy link
Contributor

Use manual model warmup instead of AOT implemented model warmup, since with AOT, we observe performance degradation at higher batch size of maxtext configuration, mentioned in #92:

  1. OOM at higher batch size (after model warmup, during an active request)
  2. Slower detokenizing generate step time exponentially at higher batch sizes

This has been verified that the detokenizing generate step time remains same as JetStream optimal behavior for all batch sizes.

curl --request POST --header "Content-type: application/json"
 -s localhost:8000/generate --data '{
    "prompt": "What are the top 5 programming languages",
    "max_tokens": 200
}'
{
    "response": " for data science in 2023?\n\n1. Python\n2. R\n3. SQL\n4. Java\n5. Scala\n\n**Note:** The order is based on popularity and demand in the data science industry in 2023."
}

Copy link
Collaborator

@JoeZijunZhou JoeZijunZhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update unit tests?

QQ on the description,

  1. we set the max pdbs when we start the server, this value should be within memory cap (based on calculation w the devices used), then it would not OOM right?
  2. Why higher actual batch size would have very slow detokenization? Could you share some investigation or profiles?

jetstream/core/orchestrator.py Show resolved Hide resolved
@vivianrwu
Copy link
Contributor Author

vivianrwu commented Aug 7, 2024

Do we need to update unit tests?

Unit tests do not need to be updated because it is on the condition of engine.warm

QQ on the description,

  1. we set the max pdbs when we start the server, this value should be within memory cap (based on calculation w the devices used), then it would not OOM right?

Yes, I think the storage of the compiled graphs from AOT and executing it from AOT is what takes up the memory. We observe the OOM at generate request.

  1. Why higher actual batch size would have very slow detokenization? Could you share some investigation or profiles?

Yes, you can reference #92 for some investigations. Also shared the doc internally.

@FanhaiLu1
Copy link
Collaborator

ified that the detokenizing generate step time remains same as JetStream optimal behavior for all batch sizes.

Did you figure out what is the root cause of performance issue and OOM for AOT?

@vivianrwu
Copy link
Contributor Author

ified that the detokenizing generate step time remains same as JetStream optimal behavior for all batch sizes.

Did you figure out what is the root cause of performance issue and OOM for AOT?

RCA has been attempted and the root cause of OOM can potentially be the added space to save the compiled graphs in executables alongside saving the cache in the compilation cache directory. The performance issue, has not been concluded. Could be unoptimal AOT executables. I can share the investigation offline

@JoeZijunZhou JoeZijunZhou merged commit 59538fc into AI-Hypercomputer:main Aug 14, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants