Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Deploy GenAI in Helm #8727

Merged
merged 29 commits into from
Feb 7, 2024
Merged

feat: Deploy GenAI in Helm #8727

merged 29 commits into from
Feb 7, 2024

Conversation

tayritenour
Copy link
Contributor

@tayritenour tayritenour commented Jan 22, 2024

GAS-166

Description

Allows us to deploy GenAI in the Determined Helm Chart. This is turned off by default but if a version for GenAI is provided, we will create the deployment and the proxies necessary for it to work.

The user will need to provide a shared drive as either:

  1. A PVC with ReadWriteMany access mode they already own on the cluster. They can provide this by name in .Values.sharedPVCName.
  2. Support a StorageClass for spinning up a disk with ReadWriteMany access mode enabled. If using GCP, this can be done with something like supporting FileStore in their cluster.

The user's cluster will also need to have access to a100 GPUs to run the chat and fine-tuning.

Test Plan

Release Party:

  1. Add the following to the helm chart's values.yaml:
## Configure GenAI Deployment
genai:
  ## Version of GenAI to use. If unset, GenAI will not be deployed
  # version: "release"
  
  ## Port for GenAI to use
  port: 9011
  
  ## Secret to pull the GenAI image
  # imagePullSecretName:

  ## GenAI pod memory request
  memRequest: 1Gi

  ## GenAI pod cpu request
  cpuRequest: 100m

  ## GenAI pod memory limit
  # memLimit: 1Gi

  ## GenAI pod cpu limit
  # cpuLimit: 2

  ## PVC Name for the shared file system for GenAI.
  ## Note: Either `sharedPVCName` or `generatedPVC.storageSize` (to
  ## generate a new PVC) is required for GenAI deployment
  # sharedPVCName:

  ## Spec for the generated PVC for GenAI
  ## Note: In order to generate a shared PVC, you will need access to a
  ## StorageClass that can provide a ReadWriteMany volume
  generatedPVC:
    ## Storage class name for the generated PVC
    storageClassName: standard-rwx

    ## Size of the generated PVC
    storageSize: 100Gi
  1. Dry-run the latest helm chart from the helm/charts/determined directory like so: helm template test . --set maxSlotsPerPod=1 --dry-run --debug
  2. Confirm that genai-deployment is not present
  3. Uncomment a version for genai in the values.yaml file
  4. Confirm that the genai-deployment is present

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@cla-bot cla-bot bot added the cla-signed label Jan 22, 2024
Copy link

netlify bot commented Jan 22, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 402a254
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/65c3fc71a3b7c0000744418e
😎 Deploy Preview https://deploy-preview-8727--determined-ui.netlify.app/
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Jan 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1daf9d3) 47.72% compared to head (402a254) 53.03%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8727      +/-   ##
==========================================
+ Coverage   47.72%   53.03%   +5.31%     
==========================================
  Files        1049      633     -416     
  Lines      167293    72316   -94977     
  Branches     2241        0    -2241     
==========================================
- Hits        79842    38355   -41487     
+ Misses      87293    33961   -53332     
+ Partials      158        0     -158     
Flag Coverage Δ
backend 43.31% <ø> (+0.01%) ⬆️
harness 61.88% <ø> (-2.44%) ⬇️
web ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 443 files with indirect coverage changes

@@ -178,7 +178,7 @@ checkpointStorage:
# storage beyond initial testing as most Kubernetes cluster nodes do not have a shared file
# system.
type: shared_fs
hostPath: /checkpoints
hostPath: /tmp/checkpoints
Copy link
Contributor Author

@tayritenour tayritenour Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I'm doing this because modern k8s won't let you just create a hostPath somewhere like /. This is required since the change to use containerd as the default: https://kubernetes.io/blog/2022/02/17/dockershim-faq/ and https://cloud.google.com/container-optimized-os/docs/concepts/disks-and-filesystem

This format is also a little more explicit about what's really happening here and why it's not recommended long term.

@tayritenour tayritenour changed the title Deploy GenAI in Helm feat: Deploy GenAI in Helm Jan 22, 2024
helm/charts/determined/values.yaml Outdated Show resolved Hide resolved
@tayritenour tayritenour merged commit 762fcef into main Feb 7, 2024
73 of 86 checks passed
@tayritenour tayritenour deleted the helm-genai branch February 7, 2024 23:05
maxrussell pushed a commit that referenced this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants