-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add local storage (scratch space) allocatable support #46456
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. I'll look into what changes to machineInfo it would take to be able to determine rootfs capacity.
func (cm *containerManagerImpl) GetNodeAllocatableReservation() v1.ResourceList { | ||
evictionReservation := hardEvictionReservation(cm.HardEvictionThresholds, cm.capacity) | ||
if !hasStorageInfoInCapacity(cm.capacity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you dont need a helper function here, as it is just one line:
if _, ok := cm.capacity[v1.ResourceStorage]; !ok {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/kubelet/eviction/api/types.go
Outdated
@@ -38,6 +38,8 @@ const ( | |||
SignalImageFsInodesFree Signal = "imagefs.inodesFree" | |||
// SignalAllocatableMemoryAvailable is amount of memory available for pod allocation (i.e. allocatable - workingSet (of pods), in bytes. | |||
SignalAllocatableMemoryAvailable Signal = "allocatableMemory.available" | |||
// SignalAllocatableStorageScratchAvailable is amount of storage available for pod allocation | |||
SignalAllocatableStorageScratchAvailable Signal = "allocatableStorageScratch.available" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we have begun calling rootfs "Scratch" space, it would probably be easier to read the code if it was named similarly to the Node-Level signal. Something like SignalAllocatableNodeFsAvailable
.
// build the ranking functions (if not yet known) | ||
// TODO: have a function in cadvisor that lets us know if global housekeeping has completed | ||
if len(m.resourceToRankFunc) == 0 || len(m.resourceToNodeReclaimFuncs) == 0 { | ||
// this may error if cadvisor has yet to complete housekeeping, so we will just try again in next pass. | ||
hasDedicatedImageFs, err := diskInfoProvider.HasDedicatedImageFs() | ||
if err != nil { | ||
if errImage != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move the error check next to the call? Why only check for errors inside the if
statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the existing code, I think the logic is if resourceToRankFunc and resourceToNodeReclaimFuncs are already defined, it does not matter whether errImage returns nil or not.
I modified the code here a little more. Instead of keeping calling HasDedicatedImageFs in every sync call, I put it in a map. PTAL
@@ -391,10 +391,13 @@ func (m *managerImpl) reclaimNodeLevelResources(resourceToReclaim v1.ResourceNam | |||
glog.Errorf("eviction manager: unable to find value associated with signal %v", signal) | |||
continue | |||
} | |||
value.available.Add(*reclaimed) | |||
if reclaimed != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline. reclaimed cannot be nil here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
// evaluate all current thresholds to see if with adjusted observations, we think we have met min reclaim goals | ||
if len(thresholdsMet(m.thresholdsMet, observations, true)) == 0 { | ||
glog.V(3).Infof("eviction manager: reclaim goal reached for resource %v: ", resourceToReclaim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already log something in synchronize on line 311 if we are able to reclaim enough resources. This is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/kubelet/eviction/helpers.go
Outdated
@@ -83,6 +86,7 @@ func init() { | |||
signalToResource[evictionapi.SignalImageFsInodesFree] = resourceImageFsInodes | |||
signalToResource[evictionapi.SignalNodeFsAvailable] = resourceNodeFs | |||
signalToResource[evictionapi.SignalNodeFsInodesFree] = resourceNodeFsInodes | |||
signalToResource[evictionapi.SignalAllocatableStorageScratchAvailable] = resourceNodeFs | |||
resourceToSignal = map[v1.ResourceName]evictionapi.Signal{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you still planning to remove resourceToSignal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will do it in a separate PR, basically changing from 1-1 mapping from resource to signal, we use map 1-many with a list of signals
} | ||
|
||
// Pass podTestSpecsP as references so that it could be set up in the first BeforeEach clause | ||
func runLocalStorageEvictionTest(f *framework.Framework, conditionType v1.NodeConditionType, testCondition string, podTestSpecsP *[]podTestSpec, evictionTestTimeout time.Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reuse the version in the inode eviction test suite? If not, can we refactor them to avoid duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a few things, only by passing the reference works for me. I plan to refactor this code in a separate PR, is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that can happen after code freeze, if neccessary.
@@ -100,3 +100,15 @@ func (kl *Kubelet) GetCachedMachineInfo() (*cadvisorapi.MachineInfo, error) { | |||
} | |||
return kl.machineInfo, nil | |||
} | |||
|
|||
// GetCachedRootFsInfo assumes that the rootfs info can't change without a reboot | |||
func (kl *Kubelet) GetCachedRootFsInfo() (cadvisorapiv2.FsInfo, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking right now is that it would make more sense to change cAdvisor to include the information we need in MachineInfo. IIRC, the only piece we are missing is the mountpoint for the disks, which would let us determine which filesystem is the RootFs. I am happy to look into this in the next couple days to see if it is feasible. That would greatly simplify this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I will check that too, but we might also need to add this after code freeze.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requested a few changes.
//Account for storage requested by emptydir volumes | ||
for _, vol := range pod.Spec.Volumes { | ||
if vol.EmptyDir != nil { | ||
result.StorageScratch += vol.EmptyDir.SizeLimit.Value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to exclude empty dir that is backed by memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Name: "emptyDirVolumeName", | ||
VolumeSource: v1.VolumeSource{ | ||
EmptyDir: &v1.EmptyDirVolumeSource{ | ||
SizeLimit: *resource.NewQuantity(scratch, resource.BinarySI), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set medium to not memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
reasons []algorithm.PredicateFailureReason | ||
}{ | ||
{ | ||
pod: newResourcePod(schedulercache.Resource{MilliCPU: 1, Memory: 1, StorageOverlay: 1}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test for memory backed empty dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a test for memory medium
//Account for storage requested by emptydir volumes | ||
for _, vol := range pod.Spec.Volumes { | ||
if vol.EmptyDir != nil { | ||
res.StorageScratch += vol.EmptyDir.SizeLimit.Value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good if we have a utility functoin for this as this loop is repeated in multiple places.
also need to ignore medium memory empty dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. this loop only appeared in predicates.go and node_info.go, could we consider add a utility if more places need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine tests also using this utility function.
f0e5ded
to
0235873
Compare
addressed the comments. @derekwaynecarr @dashpole PTAL |
eviction manager changes lgtm |
@@ -0,0 +1,322 @@ | |||
/* | |||
Copyright 2016 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an odd file name. I think go recommends using _
instead of camelcase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the file name
pkg/kubelet/kubelet_node_status.go
Outdated
if hasDedicatedImageFs, err := kl.HasDedicatedImageFs(); err != nil && hasDedicatedImageFs { | ||
imagesfs, err := kl.ImagesFsInfo() | ||
if err != nil { | ||
node.Status.Capacity[v1.ResourceStorageOverlay] = resource.MustParse("0Gi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we fail kubelet in this case? Why would kubelet lie that imagefs doesn't exist when it fails to collect stats about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the logic in GetCachedMachineInfo. Should we use different behavior here?
491253e
to
19c5b3c
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davidopp, jingxu97, thockin, vishh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@k8s-bot pull-kubernetes-federation-e2e-gce test this |
@k8s-bot pull-kubernetes-unit test this |
This PR adds the support for allocatable local storage (scratch space). This feature is only for root file system which is shared by kubernetes componenets, users' containers and/or images. User could use --kube-reserved flag to reserve the storage for kube system components. If the allocatable storage for user's pods is used up, some pods will be evicted to free the storage resource.
This PR adds the check for local storage request when admitting pods. If the local storage request exceeds the available resource, pod will be rejected.
@k8s-bot pull-kubernetes-unit test this |
@k8s-bot pull-kubernetes-e2e-kops-aws test this |
Automatic merge from submit-queue |
This PR adds the support for allocatable local storage (scratch space).
This feature is only for root file system which is shared by kubernetes
componenets, users' containers and/or images. User could use
--kube-reserved flag to reserve the storage for kube system components.
If the allocatable storage for user's pods is used up, some pods will be
evicted to free the storage resource.
This feature is part of local storage capacity isolation and described in the proposal kubernetes/community#306
Release note: