Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Genai helm service fix #8885

Merged
merged 4 commits into from
Feb 27, 2024
Merged

fix: Genai helm service fix #8885

merged 4 commits into from
Feb 27, 2024

Conversation

tayritenour
Copy link
Contributor

@tayritenour tayritenour commented Feb 24, 2024

Description

We need to make sure that genai can be populated with the service names from k8s to generalize depending on how the network is set up. This allows us to send a particular host service name to the backend server directly.

We also would like to make sure to not have to make assumptions about the kubernetes cluster having load balancers set up. The genai services have been changed over to use ClusterIP

Test Plan

Validate that when checking the helm deploy that it does not have a LoadBalancer for anything other than the determined master

  • Add the following to your values.yaml:
## Configure GenAI Deployment
genai:
  ## Version of GenAI to use. If unset, GenAI will not be deployed
  version: "0.1.1"
  
  ## Port for GenAI to backend use
  port: 9011
  
  ## Port for GenAI message queue
  messageQueuePort: 9013
  
  ## Secret to pull the GenAI image
  # imagePullSecretName:

  ## GenAI pod memory request
  memRequest: 1Gi

  ## GenAI pod cpu request
  cpuRequest: 100m

  ## GenAI pod memory limit
  # memLimit: 1Gi

  ## GenAI pod cpu limit
  # cpuLimit: 2

  ## PVC Name for the shared file system for GenAI.
  ## Note: Either `sharedPVCName` or `generatedPVC.storageSize` (to
  ## generate a new PVC) is required for GenAI deployment
  # sharedPVCName:

  ## Spec for the generated PVC for GenAI
  ## Note: In order to generate a shared PVC, you will need access to a
  ## StorageClass that can provide a ReadWriteMany volume
  generatedPVC:
    ## Storage class name for the generated PVC
    # storageClassName: standard-rwx
    storageClassName: standard

    ## Size of the generated PVC
    storageSize: 10Gi

  ## Extra Resource Pool Metadata is hardcoded information about the
  ## GPUs available to the resource pools. This information
  ## is not provided in k8s so we provide it directly.
  ## Note: All resource pools defined here need to also be reflected in
  ## the .Values.resourcePools.
  extraResourcePoolMetadata:
    a100:
      gpu_type: A100
      max_agents: 3
    # v100:
    #   gpu_type: V100
    #   max_agents: 2
  • Deploy: helm install --generate-name . --set maxSlotsPerPod=1 --debug --dry-run

  • It should not contain LoadBalancer in the other services for genai

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@cla-bot cla-bot bot added the cla-signed label Feb 24, 2024
Copy link

netlify bot commented Feb 24, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 49f2862
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/65dd33fd41b1a80008684e61

Copy link

codecov bot commented Feb 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 47.52%. Comparing base (a8ac657) to head (49f2862).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8885      +/-   ##
==========================================
- Coverage   47.53%   47.52%   -0.01%     
==========================================
  Files        1066     1066              
  Lines      170248   170248              
  Branches     2235     2235              
==========================================
- Hits        80919    80915       -4     
- Misses      89171    89175       +4     
  Partials      158      158              
Flag Coverage Δ
backend 43.34% <ø> (-0.01%) ⬇️
harness 63.77% <ø> (-0.01%) ⬇️
web 42.50% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 4 files with indirect coverage changes

release: {{ .Release.Name }}
spec:
ports:
- port: {{ required "A valid Values.genai.messageQueuePort entry required!" .Values.genai.messageQueuePort }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a new field, right? maybe values.yaml needs an update with a commented out example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, we actually removed the examples from the values.yaml when I was talking with @NicholasBlaskey a bit ago: https://github.com/determined-ai/determined/pull/8821/files

The idea being that the values would be confusing to new users and that genai isn't fully released yet.

Instead we have documentation here: https://hpe-ai-solutions-documentation.netlify.app/products/gen-ai/latest/admin/set-up/install-kubernetes/ which I will update to include that parameter

@tayritenour tayritenour merged commit ca96da1 into main Feb 27, 2024
74 of 88 checks passed
@tayritenour tayritenour deleted the genai-helm-service-fix branch February 27, 2024 17:50
maxrussell pushed a commit that referenced this pull request Mar 21, 2024
* wip trying new service for redis queue

* update the genai helm chart integration to enforce services for all hosts

* revert the changes from testing

* use clusterip instead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants