Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: initial k8s rocm support [CM-367] #9794

Merged
merged 4 commits into from
Aug 14, 2024
Merged

chore: initial k8s rocm support [CM-367] #9794

merged 4 commits into from
Aug 14, 2024

Conversation

NicholasBlaskey
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey commented Aug 5, 2024

Ticket

Description

Adds initial experimental support for amd gpus on k8s.

Test Plan

Unit tests cover this change

Also manually verified on amd hardware. Automated e2e tests were deemed not in scope due to hardware availability issues

blocked on determined-ai/environments#275. We just need the environment images to land in the same release as this. Some additional docs will be added in the environments pr detailing what our rocm environment does and does not support.

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@cla-bot cla-bot bot added the cla-signed label Aug 5, 2024
Copy link

netlify bot commented Aug 5, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 0a054f1
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66bd0129a7e6cf0008b1056b
😎 Deploy Preview https://deploy-preview-9794--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Aug 5, 2024

Codecov Report

Attention: Patch coverage is 80.43478% with 9 lines in your changes missing coverage. Please review.

Project coverage is 54.01%. Comparing base (91d0b67) to head (0a054f1).
Report is 4 commits behind head on main.

Files Patch % Lines
master/internal/rm/kubernetesrm/jobs.go 84.84% 5 Missing ⚠️
master/internal/rm/kubernetesrm/spec.go 72.72% 3 Missing ⚠️
master/internal/config/resource_manager_config.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9794      +/-   ##
==========================================
- Coverage   54.38%   54.01%   -0.38%     
==========================================
  Files        1261     1261              
  Lines      155770   155795      +25     
  Branches     3540     3539       -1     
==========================================
- Hits        84711    84146     -565     
- Misses      70921    71511     +590     
  Partials      138      138              
Flag Coverage Δ
backend 45.26% <80.43%> (+0.02%) ⬆️
harness 70.60% <ø> (-2.02%) ⬇️
web 53.70% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/config/resource_manager_config.go 66.66% <50.00%> (-0.55%) ⬇️
master/internal/rm/kubernetesrm/spec.go 84.64% <72.72%> (+2.31%) ⬆️
master/internal/rm/kubernetesrm/jobs.go 71.71% <84.84%> (+0.26%) ⬆️

... and 13 files with indirect coverage changes

@determined-ci determined-ci requested a review from a team August 5, 2024 15:27
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Aug 5, 2024
@NicholasBlaskey NicholasBlaskey changed the title chore: initial k8s rocm support chore: initial k8s rocm support [CM-367] Aug 5, 2024
@NicholasBlaskey NicholasBlaskey marked this pull request as ready for review August 5, 2024 15:34
@NicholasBlaskey NicholasBlaskey requested review from a team as code owners August 5, 2024 15:35
@@ -403,14 +403,21 @@ resource pool ``max_slots_per_pod``.
``slot_type``
-------------

Resource type used for compute tasks. Defaults to ``cuda``.
Resource type used for compute tasks. Valid options are ``gpu``, ``cuda``, ``cpu``, or ``rocm``.
Defaults to ``cuda``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it default to cuda or gpu? The helm config reference says gpu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master config defaults to cuda, helm defaults to gpu

**New Features**

- Kubernetes: Experimental support for AMD ROCM GPUs is now available for Kubernetes. To use set
``slotType=rocm``. See :ref:`helm-config-reference` for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the docs style guideline prefers visit over see.

Copy link
Contributor

@kkunapuli kkunapuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left a couple questions, but nothing blocking.

case device.CPU, device.CUDA:
case device.ROCM:
checkSlotType = errors.Errorf("rocm slot_type is not supported yet on k8s")
case device.CPU, device.CUDA, device.ROCM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the code detect when the user inputs gpu and switches it to cuda? This makes it seem like gpu isn't actually accepted for slotType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we switch it here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

@NicholasBlaskey NicholasBlaskey merged commit e5d4b7f into main Aug 14, 2024
88 of 100 checks passed
@NicholasBlaskey NicholasBlaskey deleted the k8s_rocm branch August 14, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants