Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: show non det pods in other namespaces than 'default' [RM-141] #9268

Merged
merged 6 commits into from
Apr 30, 2024

Conversation

NicholasBlaskey
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey commented Apr 30, 2024

Ticket

Description

Fix an issue where we only looked at the default namespace for getting pods.

Also cache get agents which probably should have been done anyway but getting all namespaces makes this even more important.

Test Plan

unit test + manual

Make a non default named namespace that determined master has access to

apiVersion: v1
kind: Pod
metadata:
  name: sleep-pod
spec:
  containers:
  - name: sleep
    image: busybox
    command: ["sleep", "100"]
    resources:
      requests:
        cpu: "2000m" # Or request gpus if this is on a gpu cluster

Then apply the file with kubectl -n namespace -f file.yaml

det slot ls should show

 Agent ID   | Resource Pool   |   Slot ID | Enabled   | Draining   | Allocation ID   | Task Name                      | Type   | Device
------------+-----------------+-----------+-----------+------------+-----------------+--------------------------------+--------+----------
 minikube   | a, default      |         0 | True      | False      | OCCUPIED        | Non-Determined Task: sleep-pod | cpu    |
 minikube   | a, default      |         1 | True      | False      | OCCUPIED        | Non-Determined Task: sleep-pod | cpu    |

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label Apr 30, 2024
@determined-ci determined-ci requested a review from a team April 30, 2024 15:01
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Apr 30, 2024
Copy link

netlify bot commented Apr 30, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit d3ad16a
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66313c656dfacf000892c73c

Copy link

codecov bot commented Apr 30, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 44.60%. Comparing base (55b7fd9) to head (d3ad16a).
Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9268   +/-   ##
=======================================
  Coverage   44.59%   44.60%           
=======================================
  Files        1273     1273           
  Lines      155832   155839    +7     
  Branches     2439     2439           
=======================================
+ Hits        69501    69508    +7     
  Misses      86092    86092           
  Partials      239      239           
Flag Coverage Δ
backend 41.75% <85.71%> (+<0.01%) ⬆️
harness 64.05% <ø> (ø)
web 35.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/kubernetesrm/pods.go 21.49% <85.71%> (+0.26%) ⬆️

... and 3 files with indirect coverage changes

@NicholasBlaskey NicholasBlaskey marked this pull request as ready for review April 30, 2024 15:03
@NicholasBlaskey NicholasBlaskey requested a review from a team as a code owner April 30, 2024 15:03
Copy link
Contributor

@kkunapuli kkunapuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I like how clear the test plan is; it looks like someone with minimal context could run it with confidence.

I'm curious about the end-to-end performance here. How often does the UI update "used slots" / how frequently would we expect this API to be called? Is there a timeout in place for APIs? (I'm assuming yes?)

The case I'm worried about is if the call to listPodsInAllNamespaces hangs or takes a long time (unresponsive k8s cluster or a massive number of resources). Could a second API call be issued before the first one finishes? Can we smooth that out by adjusting cache timeout?

@@ -102,9 +102,13 @@ type pods struct {
podInterfaces map[string]typedV1.PodInterface
configMapInterfaces map[string]typedV1.ConfigMapInterface

// TODO(RM-236) make one cache and make this code more straightforward.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice callout. Out of curiosity, why is it too much work to include in this PR / what makes it non-trivial? To be clear, I'm not suggesting this PR should consolidate caches. I'm trying to understand the technical complexities here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is just really confusing. Summarize by resource pool and get agents look very similar but sometimes act differently. I think the code could probably benefit from a significant refactor that I don't want to do here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. For what it's worth, since the cache is just a single data structure, I kinda like that they're separate ... but I can understand how it would benefit from a refactor either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me its more like we should have one way to summarize and get the cluster state. That cluster state summarization should be behind a cache.

Then for get agents which needs the data in a certain way we should process the cluster state from the cache and return the needed response

For get resource pools we should process that same cluster state and return that response

We should decouple the cluster state from the apis that consume it

p.getAgentsCacheLock.Lock()
defer p.getAgentsCacheLock.Unlock()

if time.Since(p.getAgentsCacheTime) > 5*time.Second {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 5s for a cache timeout? That seems pretty short.

Nit: I would prefer that the cache timeout be called out as a constant instead of hard-coded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 seconds was chosen because it is the same as the summarize cache. The cache was made to resolve a performance issue so I think we have pretty good evidence that 5 seconds seems reasonable to prevent performance issues. The longer the cache the less response the ui / cli is. I made it a constant.

What time would you choose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 5 seconds works in practice already, that works for me! Does the API have a timeout?

I'm concerned about a case where (picking numbers out of the air) there's an API call every 1s, with a cache timeout of 5s. But the total time to update the cache is 15s. The cache is locked for reads and writes while it's being updated right? That could result in a lot of stalled API calls. And by the time the cache is updated, it's out of date again.

I'm assuming it's unlikely that the call to listPodsInAllNamespaces would take a long time ... but I think it's possible since it involves a call to k8s (right?).

summarizeCacheLock sync.RWMutex
summarizeCache summarizeResult
summarizeCacheTime time.Time
getAgentsCacheLock sync.Mutex
getAgentsCache *apiv1.GetAgentsResponse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an in-memory cache right - aka a data structure saved in memory as opposed to an external cache, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is just in memory

@NicholasBlaskey
Copy link
Contributor Author

I'm curious about the end-to-end performance here. How often does the UI update "used slots" / how frequently would we expect this API to be called? Is there a timeout in place for APIs? (I'm assuming yes?)

I think the webui polls roughly every 5 seconds when a user is on the cluster page. I think the webui has a timeout.

The case I'm worried about is if the call to listPodsInAllNamespaces hangs or takes a long time (unresponsive k8s cluster or a massive number of resources). Could a second API call be issued before the first one finishes? Can we smooth that out by adjusting cache timeout?

This is the reason why we added the cache for summarize resources initially. We had a Kubernetes cluster which was getting hit by many concurrent requests which made the Kubernetes cluster respond even slower causing a spiral. I think the last thing we should do if the Kubernetes cluster is taking long to respond is to send more requests at it.

The cache is to prevent this case. If get agents is constantly taking more than 5 seconds on a cluster then I think the issue would like with getAgents and not this cache.

@kkunapuli
Copy link
Contributor

This is the reason why we added the cache for summarize resources initially. We had a Kubernetes cluster which was getting hit by many concurrent requests which made the Kubernetes cluster respond even slower causing a spiral. I think the last thing we should do if the Kubernetes cluster is taking long to respond is to send more requests at it.

Nice!

The cache is to prevent this case. If get agents is constantly taking more than 5 seconds on a cluster then I think the issue would like with getAgents and not this cache.

A cache timeout being too short can absolutely exacerbate the problem; bigger backend calls (e.g., listing pods for all namespaces instead of one namespace) require larger timeouts. If the p50 of a backend call is 8s, but the cache timeout is 5s, then the timeout will cause more requests to be issued. We can argue that no function call should take more than 5s ... but in the meantime there's the possibility of all kinds of production issues.

There's a timeout for pod.List(...) that gets all pods for a namespace - but it's in seconds which means it's a minimum of 1s if it's set at all.

@NicholasBlaskey NicholasBlaskey merged commit a3f0fcf into main Apr 30, 2024
87 of 104 checks passed
@NicholasBlaskey NicholasBlaskey deleted the fix_get_non_det_pods branch April 30, 2024 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants