Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create network: use locks and reservations to solve race condition #10858

Merged
merged 3 commits into from
Mar 25, 2021

Conversation

prezha
Copy link
Contributor

@prezha prezha commented Mar 17, 2021

fixes #10833

as described in the example given in the issue's description, and as @medyagh suggested, we can use locks to resolve race condition while competing over a free network segment

this pr implements free network subnet reservations that would expire after the reservation period, which is set to one minute by default, and also have a retry mechanism if create network fails, utilising sync.Map that is safe for concurrent use by multiple goroutines without additional locking or coordination

also, to further avoid overlaps between free network scans (ie, kvm starts from 192.168.39.0 and container starts from 192.168.49.0, both with increment steps of 10), increment steps for the kvm is 11 and for the containers is 9

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 17, 2021
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 17, 2021
@prezha
Copy link
Contributor Author

prezha commented Mar 18, 2021

thought: shouldn't we want to fail-fast[er] instead of trying to use a network that we weren't able to create?

i haven't changed that in this pr, but i think we should (ie, not try using nonexisting or a network that was already taken for something else) - existing code:

if err != nil {
return info.gateway, fmt.Errorf("failed to create network after 20 attempts")
}

if gateway, err := oci.CreateNetwork(d.OCIBinary, networkName); err != nil {
out.WarningT("Unable to create dedicated network, this might result in cluster IP change after restart: {{.error}}", out.V{"error": err})

@prezha prezha requested a review from ilya-zuyev March 18, 2021 18:49
@medyagh
Copy link
Member

medyagh commented Mar 22, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Mar 22, 2021
@minikube-pr-bot
Copy link

kvm2 Driver
error collecting results for kvm2 driver: timing run 0 with Minikube (PR 10858): timing cmd: [/home/performance-monitor/.minikube/minikube-binaries/10858/minikube start --driver=kvm2]: starting cmd: fork/exec /home/performance-monitor/.minikube/minikube-binaries/10858/minikube: exec format error
docker Driver
error collecting results for docker driver: timing run 0 with Minikube (PR 10858): timing cmd: [/home/performance-monitor/.minikube/minikube-binaries/10858/minikube start --driver=docker]: starting cmd: fork/exec /home/performance-monitor/.minikube/minikube-binaries/10858/minikube: exec format error

var (
reservedSubnets = sync.Map{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using our Existing lock package ?

func PathMutexSpec(path string) mutex.Spec {

https://github.com/medyagh/minikube/blob/f95d43a2c78070ab10a4e7d134ff2d0952a5ca86/pkg/util/lock/lock.g o#L48

for better maintainability of the lock code, what is the Special about Locking a network Name that is different than other Locks in minikube?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we need to lock a specific currently free network segment (not a network name) for a caller for a short period of time:

when a call to network.FreeSubnet() is made, it checks if a subnet is taken at the system level and skip those;
furthermore, in case of concurrent calls to network.FreeSubnet(), it also needs to skip all those network segments that were previously returned as 'free' (to other callers), but those other callers, within a configurable grace period, have not yet allocated the subnet at the system level

the existing locking mechanism with lock.PathMutexSpec() takes care of concurrent requests that would want to write to the same file by making them wait for their turn

on the other hand, with the free subnet allocation, we need a mechanism that would instantly know & skip if a subnet is either already taken or 'reserved' but not yet expired, and so to move on to the next free subnet

therefore i used sync.Map that has a built-in mechanism for safe concurrent access for 'reservations' with key:subnet and value:createdAt to be able to check if the subnet reservation expired (so free to reuse)

does that make sense?

@medyagh
Copy link
Member

medyagh commented Mar 23, 2021

is there a reason to implement a new lock logic just for network ? how about using our existing Lock Package what functionatliy is missing ? could we instead improve the lock package and have all the lock related code there?

@medyagh
Copy link
Member

medyagh commented Mar 24, 2021

/retest-this-please

@medyagh medyagh merged commit 894ca12 into kubernetes:master Mar 25, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: medyagh, prezha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

race condition while trying to create network: subnet is taken
5 participants