fix: show non det pods in other namespaces than 'default' [RM-141] #9268

NicholasBlaskey · 2024-04-30T15:01:39Z

Ticket

Description

Fix an issue where we only looked at the default namespace for getting pods.

Also cache get agents which probably should have been done anyway but getting all namespaces makes this even more important.

Test Plan

unit test + manual

Make a non default named namespace that determined master has access to

apiVersion: v1
kind: Pod
metadata:
  name: sleep-pod
spec:
  containers:
  - name: sleep
    image: busybox
    command: ["sleep", "100"]
    resources:
      requests:
        cpu: "2000m" # Or request gpus if this is on a gpu cluster

Then apply the file with kubectl -n namespace -f file.yaml

det slot ls should show

 Agent ID   | Resource Pool   |   Slot ID | Enabled   | Draining   | Allocation ID   | Task Name                      | Type   | Device
------------+-----------------+-----------+-----------+------------+-----------------+--------------------------------+--------+----------
 minikube   | a, default      |         0 | True      | False      | OCCUPIED        | Non-Determined Task: sleep-pod | cpu    |
 minikube   | a, default      |         1 | True      | False      | OCCUPIED        | Non-Determined Task: sleep-pod | cpu    |

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

netlify · 2024-04-30T15:01:58Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`d3ad16a`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/66313c656dfacf000892c73c

codecov · 2024-04-30T15:02:46Z

Codecov Report

Attention: Patch coverage is 85.71429% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 44.60%. Comparing base (55b7fd9) to head (d3ad16a).
Report is 3 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #9268   +/-   ##
=======================================
  Coverage   44.59%   44.60%           
=======================================
  Files        1273     1273           
  Lines      155832   155839    +7     
  Branches     2439     2439           
=======================================
+ Hits        69501    69508    +7     
  Misses      86092    86092           
  Partials      239      239

Flag	Coverage Δ
backend	`41.75% <85.71%> (+<0.01%)`	⬆️
harness	`64.05% <ø> (ø)`
web	`35.04% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
master/internal/rm/kubernetesrm/pods.go	`21.49% <85.71%> (+0.26%)`	⬆️

... and 3 files with indirect coverage changes

kkunapuli

Nice work! I like how clear the test plan is; it looks like someone with minimal context could run it with confidence.

I'm curious about the end-to-end performance here. How often does the UI update "used slots" / how frequently would we expect this API to be called? Is there a timeout in place for APIs? (I'm assuming yes?)

The case I'm worried about is if the call to listPodsInAllNamespaces hangs or takes a long time (unresponsive k8s cluster or a massive number of resources). Could a second API call be issued before the first one finishes? Can we smooth that out by adjusting cache timeout?

kkunapuli · 2024-04-30T15:27:56Z

master/internal/rm/kubernetesrm/pods.go

@@ -102,9 +102,13 @@ type pods struct {
 	podInterfaces       map[string]typedV1.PodInterface
 	configMapInterfaces map[string]typedV1.ConfigMapInterface

+	// TODO(RM-236) make one cache and make this code more straightforward.


Nice callout. Out of curiosity, why is it too much work to include in this PR / what makes it non-trivial? To be clear, I'm not suggesting this PR should consolidate caches. I'm trying to understand the technical complexities here.

The code is just really confusing. Summarize by resource pool and get agents look very similar but sometimes act differently. I think the code could probably benefit from a significant refactor that I don't want to do here.

Makes sense. For what it's worth, since the cache is just a single data structure, I kinda like that they're separate ... but I can understand how it would benefit from a refactor either way.

To me its more like we should have one way to summarize and get the cluster state. That cluster state summarization should be behind a cache.

Then for get agents which needs the data in a certain way we should process the cluster state from the cache and return the needed response

For get resource pools we should process that same cluster state and return that response

We should decouple the cluster state from the apis that consume it

kkunapuli · 2024-04-30T15:29:24Z

master/internal/rm/kubernetesrm/pods.go

+	p.getAgentsCacheLock.Lock()
+	defer p.getAgentsCacheLock.Unlock()
+
+	if time.Since(p.getAgentsCacheTime) > 5*time.Second {


Why 5s for a cache timeout? That seems pretty short.

Nit: I would prefer that the cache timeout be called out as a constant instead of hard-coded.

5 seconds was chosen because it is the same as the summarize cache. The cache was made to resolve a performance issue so I think we have pretty good evidence that 5 seconds seems reasonable to prevent performance issues. The longer the cache the less response the ui / cli is. I made it a constant.

What time would you choose?

If 5 seconds works in practice already, that works for me! Does the API have a timeout?

I'm concerned about a case where (picking numbers out of the air) there's an API call every 1s, with a cache timeout of 5s. But the total time to update the cache is 15s. The cache is locked for reads and writes while it's being updated right? That could result in a lot of stalled API calls. And by the time the cache is updated, it's out of date again.

I'm assuming it's unlikely that the call to listPodsInAllNamespaces would take a long time ... but I think it's possible since it involves a call to k8s (right?).

kkunapuli · 2024-04-30T15:45:17Z

master/internal/rm/kubernetesrm/pods.go

 	summarizeCacheLock sync.RWMutex
 	summarizeCache     summarizeResult
 	summarizeCacheTime time.Time
+	getAgentsCacheLock sync.Mutex
+	getAgentsCache     *apiv1.GetAgentsResponse


This is just an in-memory cache right - aka a data structure saved in memory as opposed to an external cache, right?

it is just in memory

NicholasBlaskey · 2024-04-30T16:28:59Z

I'm curious about the end-to-end performance here. How often does the UI update "used slots" / how frequently would we expect this API to be called? Is there a timeout in place for APIs? (I'm assuming yes?)

I think the webui polls roughly every 5 seconds when a user is on the cluster page. I think the webui has a timeout.

The case I'm worried about is if the call to listPodsInAllNamespaces hangs or takes a long time (unresponsive k8s cluster or a massive number of resources). Could a second API call be issued before the first one finishes? Can we smooth that out by adjusting cache timeout?

This is the reason why we added the cache for summarize resources initially. We had a Kubernetes cluster which was getting hit by many concurrent requests which made the Kubernetes cluster respond even slower causing a spiral. I think the last thing we should do if the Kubernetes cluster is taking long to respond is to send more requests at it.

The cache is to prevent this case. If get agents is constantly taking more than 5 seconds on a cluster then I think the issue would like with getAgents and not this cache.

kkunapuli · 2024-04-30T16:59:52Z

This is the reason why we added the cache for summarize resources initially. We had a Kubernetes cluster which was getting hit by many concurrent requests which made the Kubernetes cluster respond even slower causing a spiral. I think the last thing we should do if the Kubernetes cluster is taking long to respond is to send more requests at it.

Nice!

The cache is to prevent this case. If get agents is constantly taking more than 5 seconds on a cluster then I think the issue would like with getAgents and not this cache.

A cache timeout being too short can absolutely exacerbate the problem; bigger backend calls (e.g., listing pods for all namespaces instead of one namespace) require larger timeouts. If the p50 of a backend call is 8s, but the cache timeout is 5s, then the timeout will cause more requests to be issued. We can argue that no function call should take more than 5s ... but in the meantime there's the possibility of all kinds of production issues.

There's a timeout for pod.List(...) that gets all pods for a namespace - but it's in seconds which means it's a minimum of 1s if it's set at all.

NicholasBlaskey added 2 commits April 30, 2024 10:53

fix: show non det pods in other namespaces than 'default'

ddda87a

release note and clean up

9cd1a78

cla-bot bot added the cla-signed label Apr 30, 2024

determined-ci requested a review from a team April 30, 2024 15:01

determined-ci added the documentation Improvements or additions to documentation label Apr 30, 2024

NicholasBlaskey marked this pull request as ready for review April 30, 2024 15:03

NicholasBlaskey requested a review from a team as a code owner April 30, 2024 15:03

NicholasBlaskey requested a review from kkunapuli April 30, 2024 15:03

clarification

4f23cfa

kkunapuli reviewed Apr 30, 2024

View reviewed changes

cache constant

8ecc633

NicholasBlaskey requested a review from kkunapuli April 30, 2024 16:29

kkunapuli approved these changes Apr 30, 2024

View reviewed changes

sleep and 15 seconds cache

bff0908

NicholasBlaskey requested a review from a team as a code owner April 30, 2024 18:38

NicholasBlaskey requested a review from amandavialva01 April 30, 2024 18:38

inline comments in python?

d3ad16a

NicholasBlaskey merged commit a3f0fcf into main Apr 30, 2024
87 of 104 checks passed

NicholasBlaskey deleted the fix_get_non_det_pods branch April 30, 2024 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: show non det pods in other namespaces than 'default' [RM-141] #9268

fix: show non det pods in other namespaces than 'default' [RM-141] #9268

NicholasBlaskey commented Apr 30, 2024 •

edited by jira bot

Loading

netlify bot commented Apr 30, 2024 •

edited

Loading

codecov bot commented Apr 30, 2024 •

edited

Loading

kkunapuli left a comment

kkunapuli Apr 30, 2024

NicholasBlaskey Apr 30, 2024

kkunapuli Apr 30, 2024

NicholasBlaskey Apr 30, 2024

kkunapuli Apr 30, 2024

NicholasBlaskey Apr 30, 2024

kkunapuli Apr 30, 2024

kkunapuli Apr 30, 2024

NicholasBlaskey Apr 30, 2024

NicholasBlaskey commented Apr 30, 2024

kkunapuli commented Apr 30, 2024

fix: show non det pods in other namespaces than 'default' [RM-141] #9268

fix: show non det pods in other namespaces than 'default' [RM-141] #9268

Conversation

NicholasBlaskey commented Apr 30, 2024 • edited by jira bot Loading

Ticket

Description

Test Plan

Checklist

netlify bot commented Apr 30, 2024 • edited Loading

✅ Deploy Preview for determined-ui canceled.

codecov bot commented Apr 30, 2024 • edited Loading

Codecov Report

kkunapuli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicholasBlaskey commented Apr 30, 2024

kkunapuli commented Apr 30, 2024

NicholasBlaskey commented Apr 30, 2024 •

edited by jira bot

Loading

netlify bot commented Apr 30, 2024 •

edited

Loading

codecov bot commented Apr 30, 2024 •

edited

Loading