-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: show non det pods in other namespaces than 'default' [RM-141] #9268
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9268 +/- ##
=======================================
Coverage 44.59% 44.60%
=======================================
Files 1273 1273
Lines 155832 155839 +7
Branches 2439 2439
=======================================
+ Hits 69501 69508 +7
Misses 86092 86092
Partials 239 239
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I like how clear the test plan is; it looks like someone with minimal context could run it with confidence.
I'm curious about the end-to-end performance here. How often does the UI update "used slots" / how frequently would we expect this API to be called? Is there a timeout in place for APIs? (I'm assuming yes?)
The case I'm worried about is if the call to listPodsInAllNamespaces
hangs or takes a long time (unresponsive k8s cluster or a massive number of resources). Could a second API call be issued before the first one finishes? Can we smooth that out by adjusting cache timeout?
@@ -102,9 +102,13 @@ type pods struct { | |||
podInterfaces map[string]typedV1.PodInterface | |||
configMapInterfaces map[string]typedV1.ConfigMapInterface | |||
|
|||
// TODO(RM-236) make one cache and make this code more straightforward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice callout. Out of curiosity, why is it too much work to include in this PR / what makes it non-trivial? To be clear, I'm not suggesting this PR should consolidate caches. I'm trying to understand the technical complexities here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is just really confusing. Summarize by resource pool and get agents look very similar but sometimes act differently. I think the code could probably benefit from a significant refactor that I don't want to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. For what it's worth, since the cache is just a single data structure, I kinda like that they're separate ... but I can understand how it would benefit from a refactor either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me its more like we should have one way to summarize and get the cluster state. That cluster state summarization should be behind a cache.
Then for get agents which needs the data in a certain way we should process the cluster state from the cache and return the needed response
For get resource pools we should process that same cluster state and return that response
We should decouple the cluster state from the apis that consume it
p.getAgentsCacheLock.Lock() | ||
defer p.getAgentsCacheLock.Unlock() | ||
|
||
if time.Since(p.getAgentsCacheTime) > 5*time.Second { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 5s for a cache timeout? That seems pretty short.
Nit: I would prefer that the cache timeout be called out as a constant instead of hard-coded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 seconds was chosen because it is the same as the summarize cache. The cache was made to resolve a performance issue so I think we have pretty good evidence that 5 seconds seems reasonable to prevent performance issues. The longer the cache the less response the ui / cli is. I made it a constant.
What time would you choose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 5 seconds works in practice already, that works for me! Does the API have a timeout?
I'm concerned about a case where (picking numbers out of the air) there's an API call every 1s, with a cache timeout of 5s. But the total time to update the cache is 15s. The cache is locked for reads and writes while it's being updated right? That could result in a lot of stalled API calls. And by the time the cache is updated, it's out of date again.
I'm assuming it's unlikely that the call to listPodsInAllNamespaces
would take a long time ... but I think it's possible since it involves a call to k8s (right?).
summarizeCacheLock sync.RWMutex | ||
summarizeCache summarizeResult | ||
summarizeCacheTime time.Time | ||
getAgentsCacheLock sync.Mutex | ||
getAgentsCache *apiv1.GetAgentsResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an in-memory cache right - aka a data structure saved in memory as opposed to an external cache, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is just in memory
I think the webui polls roughly every 5 seconds when a user is on the cluster page. I think the webui has a timeout.
This is the reason why we added the cache for summarize resources initially. We had a Kubernetes cluster which was getting hit by many concurrent requests which made the Kubernetes cluster respond even slower causing a spiral. I think the last thing we should do if the Kubernetes cluster is taking long to respond is to send more requests at it. The cache is to prevent this case. If get agents is constantly taking more than 5 seconds on a cluster then I think the issue would like with getAgents and not this cache. |
Nice!
A cache timeout being too short can absolutely exacerbate the problem; bigger backend calls (e.g., listing pods for all namespaces instead of one namespace) require larger timeouts. If the p50 of a backend call is 8s, but the cache timeout is 5s, then the timeout will cause more requests to be issued. We can argue that no function call should take more than 5s ... but in the meantime there's the possibility of all kinds of production issues. There's a timeout for |
Ticket
Description
Fix an issue where we only looked at the default namespace for getting pods.
Also cache get agents which probably should have been done anyway but getting all namespaces makes this even more important.
Test Plan
unit test + manual
Make a non default named namespace that determined master has access to
Then apply the file with
kubectl -n namespace -f file.yaml
det slot ls
should showChecklist
docs/release-notes/
.See Release Note for details.