Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add definitions for RDMA controller/cgroup of Linux kernel 4.11 #942

Merged
merged 1 commit into from
Mar 1, 2018

Conversation

paravmellanox
Copy link
Contributor

This request is for the inclusion of RDMA controller/cgroup which is added in Linux kernel 4.11.

Background:
RDMA Networking adapters provide low latency, direct access to hardware (HCA) to user space applications to send/receive messages and data in a clustered systems.
With latest HCAs, application to application level messages delivery latency can be as low as < 1usec.
RDMA kernel subsystem achieves this using RDMA resources such as QPs, MRs CQs objects.

RDMA cgroup allows limiting resources of the such RDMA HCA networking adapters.

RDMA resource creations such as hca_handle and hca_objects must be controlled so that certain set of processes doesn't take away all hardware resources.

RDMA controller allows limiting such HCA resource on per HCA basis.

Example implementation of this controller is already available below.
paravmellanox/cgroups@469ac1c

Copy link
Contributor

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this settles down, it would be good to get JSON Schema entries too.

config-linux.md Outdated
@@ -169,7 +169,7 @@ In addition to any devices configured with this setting, the runtime MUST also s
## <a name="configLinuxControlGroups" />Control groups

Also known as cgroups, they are used to restrict resource usage for a container and handle device access.
cgroups provide controls (through controllers) to restrict cpu, memory, IO, pids and network for the container.
cgroups provide controls (through controllers) to restrict cpu, memory, IO, pids, network and rdma resources for the container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel docs aren't particularly consistent about this, but since RDMA is an acronym for remote direct memory access, I think we want to use RDMA here (and in other places, except for the literal rdma).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. yes, I will change rdma to RDMA.

config-linux.md Outdated

* **`hca_device`** *(string, REQUIRED)* - specifies the device name whose resources limit to be configured
* **`hca_handles`** *(uint32, REQUIRED)* - specifies the maximum number of hca_objects in the cgroup for a specified device
* **`hca_objects`** *(uint32, REQUIRED)* - specifies the maximum number of hca_handles in the cgroup for a specified device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have “Default is "no limit".” in the Go comments, but REQUIRED here. Reading the kernel docs, I think hca_handles and hca_objects should both be OPTIONAL. If you like, you can add a requirement like this

You MUST specify at least one of hca_handles or hca_objects in a given entry, and MAY specify both.

The kernel docs also make a distinction between leaving those unset (no change from the previous limit) or setting them to max (clearing a possible previous limit). Since the spec supports joining and tweaking existing cgroups, I think we need a way to distinguish between the two cases. The current policy seems to be using signed integers and using -1 for “explicitly remove any limits”.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I initially had it optional but than made it required to simplify the -1. But I think its fine as it already exist.
I will make both of them as optional.
So when user doesn't want to configure any one of the value, it should set to -1. Correct?
Because at kernel level they are just optional KV pair, but at cgroups spec level it is a structure. So no_limit should be initialized as -1. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when user doesn't want to configure any one of the value, it should set to -1. Correct?

No, leave it unset to not change the limit (nothing written to the control file for this parameter). Set to -1 to clear the limit (max written to the cgroup control file for this parameter).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so when cgroups.New() is invokes, there are 4 valid values for hca_objects/hca_handles.

  1. 0 - no allocation possible
  2. +ve - valid positive value
  3. -1 to clear the limits
  4. do not touch the limits

If we make these two parameters optional, what should user set in specs.LinuxRdma.HcaObjects to convey - 'do not touch the limits'.
Setting 0 will disable it (which is valid value).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... what should user set in specs.LinuxRdma.HcaObjects to convey - 'do not touch the limits'.

nil, which is why we want pointers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Sounds good. I will revise the spec.

// Maximum number of HCA handles that can be opened. Default is "no limit".
HcaHandles uint32 `json:"hca_handles"`
// Maximum number of HCA objects that can be created. Default is "no limit".
HcaObjects uint32 `json:"hca_objects"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are zero values allowed (e.g. for “you can't use any HCA handles”)? If not, we should say so in the Markdown spec. If so, we should make these omitempty pointers so we can distinguish between “set to zero” and “unset”.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. setting zero is allowed.

Copy link
Contributor

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also on the “don't allow no-op noise” front would be requiring at least one of hca_handles or hca_objects to be set. If you add that contraint, no-op configs like "rdma": [{"hca_device": "m1x5_1"}] become illegal. I'm not strongly against no-op noise, but I am weakly against it to maintain consistency with the existing block I/O pattern.

config-linux.md Outdated

The following parameters can be specified to set up the controller:

* **`hca_device`** *(string, OPTIONAL)* - specifies the device name whose resources limit to be configured
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep this REQUIRED, because there's no point in supporting "rdma": [{}]. It also lets you drop the later line about “You must specify a valid hca_device…”.

Copy link
Contributor Author

@paravmellanox paravmellanox Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. so if I understand correctly, "rdma": [{}] can be treated as illegal config.
"rdma": [{"hca_device": "m1x5_1"}] is valid. And other two parameters are optional anyway.

So you are saying if end user passes hca_device only then we create the rdma controller?
Won't Docker or other engines create all empty cgroup controllers as "rdma": [{}]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I understand correctly, "rdma": [{}] can be treated as illegal config

That's what you'd get if you REQUIRED hca_device (which I think is a good idea for consistency with rlimits[].type ).

"rdma": [{"hca_device": "m1x5_1"}] is valid…

As you have it now. But if you requiring at least one of hca_handles or hca_objects to be set, then that would be invalid too (and we'd be consistent with the current block I/O approach).

So you are saying if end user passes hca_device only then we create the rdma controller? Won't Docker or other engines create all empty cgroup controllers as "rdma": [{}]?

If you have rdma unset or set to [{}] (or [{"hca_device": "m1x5_1"}], etc.), you're not adding new limits. The spec position on that is:

Runtimes MAY attach the container process to additional cgroup controllers beyond those necessary to fulfill the resources settings.

which means that, for a config that does not adjust limits via linux.resources.rdma, the runtime can attach the container process to the rdma controller at cgroupsPath or not as it sees fit. See #493, #834, and opencontainers/runc#861 for more background.

In case it helps clarify, I think this section of text should look like:

Each entry has the following structure:

* **`hca_device`** *(string, REQUIRED)* - specifies the device name whose resources limit to be configured
* **`hca_handles`** *(uint32, OPTIONAL)* - specifies the maximum number of hca_objects in the cgroup for a specified device
* **`hca_objects`** *(uint32, OPTIONAL)* - specifies the maximum number of hca_handles in the cgroup for a specified device

You MUST specify at least one of `hca_handles` or `hca_objects` in a given entry, and MAY specify both.

To close another no-op loophole we could follow this and require rdma to, when set, contain at least one entry. That would be new wording though, since the spec doesn't currently do that for any OPTIONAL property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

section of text that you described looks good to me. If we follow the text you described, it closes the no-op loophole too, isn't it for rdma controller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we follow the text you described, it closes the no-op loophole too, isn't it for rdma controller?

You could still have "rdma": [] unless you REQUIRE the array to have at least one entry. We don't close that loophole for any other OTIONAL properties at the moment (e.g. "mounts": [] is ok), so we'd need maintainer input on whether it was worth setting a new precedent.

Another issue is the possibility of duplicate hca_device entries. My preferred approach to tgat would be to use an object:

"rdma": {
  "m1x5_1": {
    "hca_handles": 3
  }
}

My attempt to do that for rlimits was rejected (#583), but I'm not clear on why. It's possible that maintainers would support an object for this new property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is case (1), where runtimes MAY create the cgroup (if necessary) and attach the container process to the cgroup. Apparently containerd/cgroups decides to do that. But runtimes don't have to. And not creating the cgroup (and not attaching the container process to the non-existent cgroup) is the lazy-cgroups approach discussed in opencontainers/runc#861. If containerd/cgroups decided to implement the lazy-cgroups approach on nil, their Create would instead look like:

I agree to what you described.
However current scheme (1) of runc-spec and its implementation sounds reasonble to me and fits the requirements of rdma too without #861.
When opencontainers/runc#861 scheme is adopted, rdma could possibly follow that.

In order to make progress, I propose to move forward with currently defined scheme and implementation of (1), (2) and (3).
(2) and (3) are already matches generic runc-spec definition.

(1) is implemented by all cgroups (not just pids) in https://github.com/containerd/cgroups.
If (1) needs change in runc-spec definition or wording, it is going to be different PR which will affect other resources too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If (1) needs change in runc-spec definition or wording, it is going to be different PR which will affect other resources too.

That makes sense to me. I had been under the impression (based on this comment), that you intended to start requiring cgroup creation (if necessary) and joining in the "rmda": {} case, which is not a requirement that the current spec has for other controllers. Currently the spec, as I read it, allows runtimes to decide whether they want to create/join or not for both the “unset” (e.g. "resources": {}) and “set empty” cases (e.g. "resources": {"memory": {}}) because of this clause. If you're ok using that same logic (for now) for rmda, then I'm back to preferring the wording I floated here.

Copy link
Contributor Author

@paravmellanox paravmellanox Dec 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I had been under the impression (based on this comment), that you intended to start requiring cgroup creation (if necessary) and joining in the "rmda": {} case, which is not a requirement that the current spec has for other controllers. Currently the spec, as I read it, allows runtimes to decide whether they want to create/join or not for both the “unset” (e.g. "resources": {}) and “set empty” cases (e.g. "resources": {"memory": {}}) because of this clause. If you're ok using that same logic (for now) for rmda, then I'm back to preferring the wording I floated here.

Based on the clause you mention https://github.com/opencontainers/runtime-spec/blame/v1.0.1/config-linux.md#L197 and based on implementation at https://github.com/containerd/cgroups, which follows the MAY part, -> "rdma": [{}] and "rdma": [{"hca_device": "m1x5_1"}] is also valid similar to priorities array of objects in Network cgroup as better example than Pids. For rdma instead of array of objects, as you suggested, we will have map as

"rdma": {
  "m1x5_1": {
    "hca_handles": 3
  }
}

Are you ok with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rdma instead of array of objects, as you suggested, we will have map as

"rdma": {
  "m1x5_1": {
    "hca_handles": 3
  }
}

Are you ok with that?

Ah, right, that would change the wording. I'm in favor of the object/map approach, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Great. So let me revise the spec to have treat them as optional and as map instead of array.

// LinuxRdma for Linux cgroup 'rdma' resource management (Linux 4.11)
type LinuxRdma struct {
// Hca device name whose resources to be restricted
HcaDevice string `json:"hca_device,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you go back to REQUIRing hca_device, you'll want to drop omitempty here.

Copy link
Contributor Author

@paravmellanox paravmellanox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the updated changes. Reference implementation also updated to match these updates.
Once spec is approved, will send PR for https://github.com/paravmellanox/cgroups/blob/master/rdma.go

config-linux.md Outdated
```json
"rdma": {
"limits": [
"mlx5_1": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not valid JSON. I think you want to drop limits and just make rdma an object with device-name keys and handles/objects-limit values.

type LinuxRdma struct {
// Limits are a set of key value pairs that define RDMA resource limits,
// where the key is device name and value is resource limits.
Limits map[string]LinuxRdmaLimit `json:"limits,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is whatI think you want for rdma; no need for a Limits. Just use:

type LinuxRdma map[string]LinuxRdmaLimit

Or skip LinuxRdma entirely and just inline map[string]LinuxRdmaLimitinLinuxResources`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or skip LinuxRdma entirely and just inlinemap[string]LinuxRdmaLimitinLinuxResources`.
Simplification looks fine to me. Sending updates.

Copy link
Contributor Author

@paravmellanox paravmellanox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please approve the changes if they look good now, we need to get this going as users are requesting it.
Hi @wking , any more changes to be done or we are good to go. I believe we addressed all the comments.
We like to get this small part integrated so that we can make this available to users.

@paravmellanox
Copy link
Contributor Author

@wking any updates? Do you need any more information for this PR?

@wking
Copy link
Contributor

wking commented Jan 24, 2018 via email

@wking
Copy link
Contributor

wking commented Jan 24, 2018

With paravmellanox#1 merged, the new content looks good to me. You might want to squash this PR down to a single commit. And passing tests are blocked on #945. Any maintainers want to look this over?

@paravmellanox
Copy link
Contributor Author

I am not sure that if I squash and force push the new squashed commit, will the discussions happened here will be lost or not.
I was hoping that If maintainer does squash merge, than discussion is preserved and final merged code also has single commit.
I little new in this area, so please guild me.

@wking
Copy link
Contributor

wking commented Jan 24, 2018 via email

@paravmellanox
Copy link
Contributor Author

Hi @wking, @mrunalp, and other reviewers,
I squashed all the commits and now have single commit for this PR.
Can you please review and merge it?
Code will follow soon sometime next week once this is merged.

Copy link
Contributor

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

846209f looks good to me. We'll see what the maintainers think ;).

@paravmellanox
Copy link
Contributor Author

@wking , thanks a lot for the inputs, fixup and additional JSON definitions.

@paravmellanox
Copy link
Contributor Author

Hi,

@vishh @vbatts @philips @mrunalp @hqhq @dqminh @tianon @crosbymichael @caniszczyk

Can you please review this PR? @wking and I have went through several iterations. Feel that this is good start now and like to merge this PR at earliest to provide this feature to users waiting for it.

config-linux.md Outdated
The name of the device to limit is the entry key.
Entry values are objects with the following properties:

* **`hca_handles`** *(uint32, OPTIONAL)* - specifies the maximum number of hca_objects in the cgroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no _ in the json properties

@crosbymichael
Copy link
Member

Overall, this looks good to me once the _ are removed from the json.

@paravmellanox
Copy link
Contributor Author

Removed the underscore and updated the PR.

@crosbymichael
Copy link
Member

Why did you close?

@paravmellanox paravmellanox reopened this Feb 12, 2018
@paravmellanox
Copy link
Contributor Author

paravmellanox commented Feb 12, 2018

Closed by mistake. I am sorry.

Hi @crosbymichael, would you please marked as reviewed/approved?
Also who else from the group should complete the review?

@paravmellanox
Copy link
Contributor Author

Got closed during rebase process, must be something wrong I did from the GUI interface. Reopening it again.

@paravmellanox paravmellanox reopened this Feb 16, 2018
@paravmellanox
Copy link
Contributor Author

After the rebase, travis-ci still seems to fail while executing:
make -C schema test

ERROR: invalid document-file path: stat test/config/good/invalid-json.json: no such file or directory

testing test/config/bad/linux-rdma.json
The document is valid
received unexpected validation success
make: *** [test] Error 1

@wking or @crosbymichael, how to check which out of the 2 check failed the build and where to see what caused the failure if it is due to error in linux-rdma.json

@wking
Copy link
Contributor

wking commented Feb 16, 2018

After the rebase…

Looks like you haven't pushed the cherry-pick:

$ git fetch origin
$ git log -1 --oneline --decorate origin/pr/942
5d9bbad (origin/pr/942, origin/pr/936) config: Dedent root paragraphs, since they aren't a list entry

@paravmellanox
Copy link
Contributor Author

I followed below steps, likely not optimal.

  1. Reset the head of paravmellanox/runtime-spec master branch to an older version.
  2. Rebase & merge from opencontainers/runtime-spec.
  3. git am <rdma-cgroup.patch>
  4. git push origin master

This updated the PR with 4 patches (patches after rebase and new patch).
I am not sure how to avoid these patches which came as rebase to not show up in old PR.

As a result of this push command, travis-ci build triggered.
It looked at test/config/good/invalid-json.json

schema/test/config/good directory contains only 3 files. minimal*.json and spec-example.json.
invalid-json.json file is not found. It is present in test/config/bad directory.

@wking
Copy link
Contributor

wking commented Feb 16, 2018 via email

@paravmellanox
Copy link
Contributor Author

paravmellanox commented Feb 20, 2018

Hi @wking, I did follow the steps as described. However I continue to see the error:

ERROR: invalid document-file path: stat test/config/good/invalid-json.json: no such file or directory

As I mentioned, this file doesn't exist in https://github.com/opencontainers/runtime-spec/tree/master/schema/test/config/good directory where Travis-CI is looking for a file.
It is instead located in https://github.com/opencontainers/runtime-spec/tree/master/schema/test/config/bad

Or may be the error is actually below one?

testing test/config/bad/linux-rdma.json
The document is valid
received unexpected validation success

@wking wking mentioned this pull request Feb 20, 2018
@wking
Copy link
Contributor

wking commented Feb 20, 2018 via email

@paravmellanox
Copy link
Contributor Author

Hi @wking, @crosbymichael, Is there anything I can do to remove the bump of #947? I do not have much know-how of Travis-CI though, but I can attempt.

@wking
Copy link
Contributor

wking commented Feb 22, 2018 via email

@paravmellanox
Copy link
Contributor Author

Sure. @wking lets wait for another day to resolve #947.

@paravmellanox
Copy link
Contributor Author

In order to make progress to make use of this functionality, I updated PR to drop schema/test/config/bad/linux-rdma.json for now. @crosbymichael and others please continue to review and merge the PR.

config-linux.md Outdated
#### Example

```json
"rdmaLimits": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why you named this rdmaLimits instead of just rdma?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why you named this rdmaLimits instead of just rdma?

I like rdma better, although I don't care much. Previously, @paravmellanox cited hugepageLimits for precedent, although the spec obviously also contains limiting properties without a Limits suffix.

@wking
Copy link
Contributor

wking commented Feb 27, 2018

In order to make progress to make use of this functionality, I updated PR to drop schema/test/config/bad/linux-rdma.json for now.

#947 landed today, so you can restore the bad example.

Linux kernel 4.11 adds support for RDMA cgroup resource controller.
This allows limiting maximum number of open hca_handle and maximum number of
hca_objects which can be created by processes.

config-linux: Add documentation for Linux RDMA cgroup

Add documentation, example and link to kernel documentation for
Linux RDMA cgroup.

additionalProperties is defined for the JSON Schema draft-04 in [1]
with clearer documentation in draft-07 [2]. It is supportd by
gojsonschema since xeipuuv/gojsonschema@0572d9d (added
additionalProperties with inner schema, 2013-06-21).

[1]: https://tools.ietf.org/html/draft-fge-json-schema-validation-00#section-5.4.4
[2]: https://tools.ietf.org/html/draft-handrews-json-schema-validation-00#section-6.5.6

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: W. Trevor King <wking@tremily.us>
@paravmellanox
Copy link
Contributor Author

both changes done. @wking , @crosbymichael
rdmaLimits -> rdma.
Added back bad/linux-rdma.json file.
Travis-CI build is passing now.

@crosbymichael
Copy link
Member

crosbymichael commented Feb 27, 2018

LGTM

Approved with PullApprove

@paravmellanox
Copy link
Contributor Author

@crosbymichael, who else from the group should be reviewing these changes?

@crosbymichael
Copy link
Member

ping @opencontainers/runtime-spec-maintainers

@mrunalp
Copy link
Contributor

mrunalp commented Mar 1, 2018

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit fa4b36a into opencontainers:master Mar 1, 2018
@vbatts vbatts mentioned this pull request Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants