Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology aware scale-from-zero running of NSEs in response to NSC demand #892

Closed
denis-tingaikin opened this issue Apr 29, 2021 · 7 comments
Assignees
Labels
Planning The issue that related to current SOW

Comments

@denis-tingaikin
Copy link
Member

Overview

Currently, in the Network Service Resource we can provide 'matches' for selecting candidates:

apiVersion: networkservicemesh.io/v1alpha1
kind: NetworkService
metadata:
  name: secure-intranet-connectivity
spec:
  payload: IP
  matches:
    - match:
      sourceSelector:
        app: firewall
      route:
        - destination:
          destinationSelector:
            app: vpn-gw
      ...

Where we can 'match' sourceSelectors on the sourceLabels on the request, and then in the 'destinationSelector' match destinationLabels from the registered Network Service.

This is extremely simple, powerful, and flexible for cases where we want to compose together Network Services.

There are certain kinds of composition though that it is insufficient for. Take the case where we have a Client on a Node requesting a Network Service... and we'd prefer it be connected to the Network Service on the same Node.

We could have each Client and Network Service add a label 'nodeName=${nodeName}'. But short of a complete enumeration in the Network Service Resource... we have no way to address that with our matches.

This proposal is to expand the power of our matches with golang style templates, to allow for 'dynamic' selectors.

Example:

apiVersion: networkservicemesh.io/v1alpha1
kind: NetworkService
metadata:
  name: secure-intranet-connectivity
spec:
  payload: IP
  matches:
    - match:
      route:
        - destination:
          destinationSelector:
            nodeName: {{index .Src "nodeName"}}
      ...

This would select a Destination which has label nodeName with the same value as the Sources value for nodeName. This allows us to very very cheaply and easily handle topological cases like "on same node", "in same K8s cluster", "in same region" using labels.

It also allows for some very clean handling of create Proxy NSMgr (pNSMgr) cases:

apiVersion: networkservicemesh.io/v1alpha1
kind: NetworkService
metadata:
  name: secure-intranet-connectivity
spec:
  payload: IP
  matches:
    - match:
      route:
        - destination:
          destinationSelector:
            nodeName: {{index .Src "nodeName"}}
    - match:
        - destination:
          definationSelector:
            createProxyNSMgr: true
      ...

Which would route you to an existing NSE of the Network Service on the same Node... or to a create pNSMgr if one is not available on your Node (which could then create NSE on your Node).

References

  1. Decompose 'Topology aware scale-from-zero running of NSEs in response to NSC demand' and provide estimation time for each task #821
  2. Allow for dynamic template selectors for matching in Network Services networkservicemesh#1824
@d-uzlov
Copy link
Contributor

d-uzlov commented May 12, 2021

In the related "decompose" issue we determined that we need to test the element that shuts down the endpoint if there were no active connections for some time in some cmd- repository.

What we need to do:

Basically, we need an equivalent of cmd-nse-icmp-responder that will automatically shut down if it was idle for specified time.

Possible solution 1:

We can modify cmd-nse-icmp-responder to include new onidle element.

responderEndpoint := endpoint.NewServer(ctx,
	spiffejwt.TokenGeneratorFunc(source, config.MaxTokenLifetime),
	endpoint.WithName(config.Name),
	endpoint.WithAuthorizeServer(authorize.NewServer()),
	endpoint.WithAdditionalFunctionality(
		onidle.NewServer(ctx, shutdownCallback, onidle.WithTimeout(timeout)), // <-- added element
		point2pointipam.NewServer(ipnet),
		recvfd.NewServer(),
		mechanisms.NewServer(map[string]networkservice.NetworkServiceServer{
			kernelmech.MECHANISM: kernel.NewServer(),
			noop.MECHANISM:       null.NewServer(),
		}),
		dnscontext.NewServer(config.DNSConfigs...),
		sendfd.NewServer(),
	),
)

Possible solution 2:

We can create a new repository (with name cmd-nse-auto-timeout or something) that will mostly duplicate cmd-nse-icmp-responder, but will use onidle element for timeout.

Question:

I think first solution is better. Is there any issues I don't see?

@denis-tingaikin
Copy link
Member Author

/cc @edwarnicke ^^

@edwarnicke
Copy link
Member

Question... is there a timeout value we can pass to onidle that causes it to never timeout (like for example 0)? If so, we could simply add it into the chain, set by an env variable that defaults to that never-timeout value. It basically allows any NSE to be set to self destruct if idle for a period. Thoughts?

@d-uzlov
Copy link
Contributor

d-uzlov commented May 13, 2021

Currently onidle element can't be parametrically disabled.

We can add a condition into any place onidle is used in, to use null server instead, if timeout is set to some special value.
The disadvantage would be that we need to add such check to every place where we want to be able to disable the element, and we can potentially have different special values for disabling onidle in different places.

The question is: how often do we need to disable the onidle element?
I don't think it's good to mindlessly include onidle into a chain and disable it by default because from my understanding using onidle only makes sense if NSE is created automatically on demand or if we create an NSE just for testing (so we can create it on demand manually).
And I feel like disabling onidle for an NSE that was created automatically is a bad idea: we can accidentally end up with a lot of NSEs that we don't need, until all of them are deleted manually.

So, if we modify onidle to disable itself if it has received some special value, we will have the benefit of having one universal way to disable it but it will only be useful in testing.
That's still a lot, if we can use onidle in many tests. But it's hard for me to judge if this is the case.

@edwarnicke
Copy link
Member

@d-uzlov So the real question is: how often do we have an NSE that we think we may want to use from Scale From Zero. I suspect often. Often with timeouts '0' is interpreted as 'no timeout' ... perhaps onidle could take that interpretation. Then a NSM_ONIDLE_TIMEOUT env variable that defaults to 0 could be used pretty simply. Thoughts?

@d-uzlov
Copy link
Contributor

d-uzlov commented May 17, 2021

Yeah, makes sense.

@denis-tingaikin
Copy link
Member Author

@edwarnicke All subtasks for this are completed. All PRs have merged.
Also, we added an example for the scenario: networkservicemesh/deployments-k8s#1427 + networkservicemesh/deployments-k8s#1427 (fix for using templates)

So I'm closing this...

@edwarnicke Be free to reopen it or ping us if we missed something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Planning The issue that related to current SOW
Projects
None yet
Development

No branches or pull requests

4 participants