Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: K8s allow proxying through Gateway #9469

Merged
merged 35 commits into from
Jun 18, 2024

Conversation

NicholasBlaskey
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey commented Jun 4, 2024

Ticket

https://hpe-aiatscale.atlassian.net/browse/RM-137

Description

Add support for reaching K8s tasks in additional k8s clusters.

Test Plan

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label Jun 4, 2024
Copy link

netlify bot commented Jun 4, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 3a99b0c
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6671b8c0bd17b90007caf3d9

Copy link

codecov bot commented Jun 4, 2024

Codecov Report

Attention: Patch coverage is 54.62795% with 250 lines in your changes missing coverage. Please review.

Project coverage is 49.33%. Comparing base (fec31a1) to head (131d0a6).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9469      +/-   ##
==========================================
+ Coverage   49.31%   49.33%   +0.01%     
==========================================
  Files        1241     1243       +2     
  Lines      161338   161827     +489     
  Branches     2868     2868              
==========================================
+ Hits        79566    79831     +265     
- Misses      81600    81824     +224     
  Partials      172      172              
Flag Coverage Δ
backend 44.02% <54.82%> (+0.10%) ⬆️
harness 63.80% <0.00%> (-0.02%) ⬇️
web 44.91% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/kubernetesrm/gateway_spec.go 100.00% <100.00%> (ø)
master/internal/task/allocation.go 76.42% <100.00%> (ø)
master/pkg/check/check_numbers.go 28.00% <100.00%> (+11.56%) ⬆️
master/pkg/check/check_string.go 78.94% <100.00%> (+78.94%) ⬆️
...nal/rm/kubernetesrm/kubernetes_resource_manager.go 30.81% <0.00%> (-0.11%) ⬇️
harness/determined/exec/prep_container.py 17.30% <0.00%> (-0.17%) ⬇️
master/internal/rm/kubernetesrm/spec.go 82.02% <97.18%> (+2.72%) ⬆️
master/internal/config/resource_manager_config.go 68.42% <84.21%> (+3.15%) ⬆️
master/internal/rm/kubernetesrm/request_queue.go 82.40% <50.00%> (-7.50%) ⬇️
master/internal/rm/kubernetesrm/gateway_service.go 66.66% <66.66%> (ø)
... and 3 more

... and 4 files with indirect coverage changes

@determined-ci determined-ci requested a review from a team June 10, 2024 18:13
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Jun 10, 2024
@NicholasBlaskey NicholasBlaskey force-pushed the notebook_proxy_feature_branch_pods_to_jobs branch from b9c204f to f43e47d Compare June 12, 2024 14:53
@hamidzr hamidzr changed the title Notebook proxy feature branch pods to jobs feat: Notebook proxy feature branch pods to jobs Jun 12, 2024
.circleci/real_config.yml Outdated Show resolved Hide resolved
provisioner provided by Project Contour.

On a local dev machine, you can use Minikube and Contour as the controller. We provide a script to
simplify the process. This can be found in `tools/k8s/launch-minikube-with-gateway.sh`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might make sense to list out the steps for deploying the gateway and explaining each step so customers will have an easier time understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically about this script or in general for deploying a gateway?
There are some shared steps but the rest are controller dependent. We don't add a lot of custom steps on top, do we?
https://projectcontour.io/getting-started/ gives a good description for Contour would we do something differently

const defaultResourcePoolName = "default"
const (
defaultResourcePoolName = "default"
validGWPortRangeStart = 1025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasonable?

}

for i, proxyPort := range j.req.ProxyPorts {
sharedName := fmt.Sprintf("%s-%d", j.jobName, i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what'd be a reasonable way to de-indent this but it's going too far right makes it hard to read on gh

@hamidzr hamidzr marked this pull request as ready for review June 14, 2024 18:19
@hamidzr hamidzr requested review from a team as code owners June 14, 2024 18:19
@hamidzr hamidzr requested review from kkunapuli, jgongd, molinamelendezj and carolinaecalderon and removed request for kkunapuli June 14, 2024 18:19
@NicholasBlaskey NicholasBlaskey changed the title feat: Notebook proxy feature branch pods to jobs feat: K8S allow proxying through Gateway Jun 17, 2024
Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- a few clarifying comments, but code looks good!

}{
{"valid-label", false},
{"Valid-Label_123", false},
{"a", false},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a case with special characters like "#$%"?

https://projectcontour.io/docs/1.29/guides/gateway-api/

***************
Envoy Gateway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a minimum version we recommend?

Copy link
Contributor

@hamidzr hamidzr Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for k8s? or this particula gateway implementaiton?
for k8s we do mention 1.27 in the other page.

docs/setup-cluster/k8s/internal-task-gateway.rst Outdated Show resolved Hide resolved
Comment on lines 20 to 23
// I wanted to do this all in patches, but Gateways don't yet support strategic merge patch.
// Its pretty easy to use json patch to add the port, but removing the port is a lot harder
// as I don't think you can remove by value. Instead just serialize reading then submitting
// updates on the gateway.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a TODO for this? in the rm backlog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this comment since it is less relevant that we switched the design to use the gateway to decide which ports are free

@tara-det-ai tara-det-ai self-requested a review June 17, 2024 18:49
@NicholasBlaskey NicholasBlaskey changed the title feat: K8S allow proxying through Gateway feat: K8s allow proxying through Gateway Jun 17, 2024
- Kubernetes: Introduced the Internal Task Gateway feature which allows Determined tasks running on
remote K8s clusters to be exposed to the Determined master and proxies. This feature supports
multi-resource manager setups by configuring a Gateway controller in the external K8s cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:orphan:

New Features

  • Kubernetes: The Internal Task Gateway feature enables Determined tasks running on remote Kubernetes clusters to be exposed to the Determined master and proxies. This feature facilitates multi-resource manager setups by configuring a Gateway controller in the external Kubernetes cluster.

Ensure that you have the necessary security measures in place to limit access to the exposed
tasks and secure communication between the external cluster and the main one. Recommended
measures include setting up a firewall, using a VPN, IP white-listing, K8s Network Policies,
or other security measures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. important::

  Enabling this feature exposes Determined tasks to the outside world. It is crucial to implement appropriate security measures to restrict access to exposed tasks and secure communication between the external cluster and the main cluster. Recommended measures include:
  • Setting up a firewall
  • Using a VPN
  • Implementing IP whitelisting
  • Configuring Kubernetes Network Policies
  • Employing other security measures as needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a similar restructured text format, visit this release note: https://docs.determined.ai/latest/release-notes.html#version-0-30-0

********

https://docs.cilium.io/en/stable/network/servicemesh/gateway-api/gateway-api/#gs-gateway-api Based
on Envoy.
Copy link
Member

@tara-det-ai tara-det-ai Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any https:// URL, use this format

Link text <https://url>_

e.g.,

Cilium Gateway API Support <https://docs.cilium.io/en/stable/network/servicemesh/gateway-api/gateway-api/#gs-gateway-api>_

If you'd like to enable north-south access to Determined proxied tasks in external-to-master
clusters you'd need to set up a gateway as described in the docs :doc:`Internal Task Gateway
<internal-task-gateway>`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. note::

To enable north-south access to Determined proxied tasks in external-to-master
clusters, set up a gateway as described in the docs :doc:Internal Task Gateway <internal-task-gateway>

firewall, using a VPN, IP white-listing, K8s Network Policies, native cloud solutions, or other
security measures. Some Determined tasks including JupyterLab Notebooks and shells already have
secure transport built-in. Tensorboards do not use TLS currently so a tunneling solution should
be considered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling this feature exposes Determined tasks to the outside world. It is crucial to implement appropriate security measures to restrict access to exposed tasks and secure communication between the external cluster and the main cluster. Recommended measures include:

  • Setting up a firewall
  • Using a VPN
  • Implementing IP whitelisting
  • Configuring Kubernetes Network Policies
  • Employing other security measures as needed

Some tasks including JupyterLab notebooks and shells already have secure transport built-in.
Since TensorBoards do not use TLS, you should consider a tunneling solution.

@NicholasBlaskey NicholasBlaskey force-pushed the notebook_proxy_feature_branch_pods_to_jobs branch from 92f55e7 to d8403ac Compare June 18, 2024 13:29
@@ -3304,10 +3304,12 @@ jobs:
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && sudo install minikube-linux-amd64 /usr/local/bin/minikube
- run:
name: Start defaultrm minikube
command: minikube start --profile defaultrm
command: |
K8S_VERSION=1.29.5 source tools/k8s/launch-minikube-with-gateway.sh defaultrm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can stamp this version at the top of file in case we need test other version later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed it up to the job level. I'm thinking we can push it higher up if/when we use it elsewhere?
Also I wasn't sure how those top-level parameters exactly work

- run:
name: Start additionalrm minikube
command: minikube start --profile additionalrm
command: minikube start --profile additionalrm --kubernetes-version 1.29.5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here so it's easy to update in one place

@hamidzr hamidzr merged commit 3641bfc into main Jun 18, 2024
84 of 98 checks passed
@hamidzr hamidzr deleted the notebook_proxy_feature_branch_pods_to_jobs branch June 18, 2024 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants