feat: Improve ResourceManager UX #9338

ajnavarro · 2022-10-07T16:08:25Z

This PR adds several new functionalities to make easier the usage of ResourceManager:

Now resource manager logs when resources are exceeded are on ERROR instead of warning.
The resources exceeded error now shows what kind of limit was reached and the scope.
When there was no limit exceeded, we print a message for the user saying that limits are not exceeded anymore.
Added swarm limit all command to show all set limits with the same format as swarm stats all
Added min-used-limit-perc option to swarm stats all to only show stats that are above a specific percentage
Simplify a lot default values.
Enable ResourceManager by default.

Output example:

2022-11-09T10:51:40.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:51:50.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 483095 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:51:50.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:00.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 455294 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:00.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:10.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 471384 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:10.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 8 times with error "peer:12D3KooWKqcaBtcmZKLKCCoDPBuA6AXGJMNrLQUPPMsA5Q6D1eG6: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 192 times with error "peer:12D3KooWPjetWPGQUih9LZTGHdyAM9fKaXtUxDyBhA93E3JAWCXj: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 469746 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:30.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 484137 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:30.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 29 times with error "peer:12D3KooWPjetWPGQUih9LZTGHdyAM9fKaXtUxDyBhA93E3JAWCXj: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:30.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:40.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 468843 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:40.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:50.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 366638 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:50.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:53:00.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 405526 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:53:00.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 107 times with error "peer:12D3KooWQZQCwevTDGhkE9iGYk5sBzWRDUSX68oyrcfM9tXyrs2Q: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:53:00.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:53:10.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 336923 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:53:10.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:53:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 71 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:53:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:53:30.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:64      Resrouce limits are no longer being exceeded.

Validation tests

Accelerated DHT client runs with no errors when ResourceManager is active. No problems were observed.
Running an attack with 200 connections and 1M streams using yamux protocol. Node was usable during the attack. With ResourceManager deactivated, the node was killed by the OS because of the amount of memory consumed.
- Actions done when the attack was active:
  - Add files
  - Force a reprovide
  - Use the gateway to resolve an IPNS address.

It closes #9001
It closes #9351
It closes #9322

core/node/libp2p/rcmgr_logging.go

BigLep

It looks good to me from a log message regard. I'll let @guseggert give the approval from a code regard.

BigLep · 2022-11-09T16:10:17Z

@ajnavarro and I had a verbal on 2022-11-09. I'm going to work now on adding some comments and improvements to rcmgr_defaults.go.

BigLep

We also need to update https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr

Remove "experimental" notes
Link to updated resource manager location (now that the resource manager is in the go-libp2p monorepo)
Discuss being able to set incremental limits? Although I guess if someone is doing expert mode of specifying all their limits, we won't be doing scaling based on resources. They need to do this on their own. We should call this out.

We also need a changelog update, but I'm fine if that's a separate PR.

I would like to review again once feedback is incorporated. I'm also good to verbally connect so close this out today.

core/node/libp2p/rcmgr_defaults.go

core/node/libp2p/rcmgr.go

BigLep

This is great @ajnavarro ! Thanks for your persistence here!

The default limits look good to me, and my approval is based on them.

I assume you'll incorporate the relevant bit of my feedback below, but don't block on me needing to see another round. If you address the comments, you should ship from my perspective.

Please make sure from a code perspective that you got a signoff from someone like @guseggert . I see he reviewed and I know you guys spoke verbally, but I don't know if there's anything else he want to see before approving.

The other things I'd love to see if PR comments on what testing you have done. I'm hoping this passes when you use the accelerated DHT client.

I think it would also be good to specify the "attack script" configuration you used. (We obviously won't paste in the attack script itself.) It would be great to show that even with lots of peers, the node doesn't fall over.

For example, a comment like this would be great:

With default resource manager configuration, the node spun up and built a routing table when the accelerated client was used.  There were no errors in the logs.

These "attacks" were tried against the default configuration but the node stayed responsive.

I ran:
./attack-script.sh --numPeers=5 --numConnectionsPeer=100 --numStreamsPerConnection=100
./attack-script.sh --numPeers=100 --numConnectionsPeer=100 --numStreamsPerConnection=100

In both cases, I was still to do the following when the node was "under attack":

ipfs add file
ipfs get file
curl URL_FOR_GATEWAY_RETRIEVAL

The above is just an idea. You don't have to follow it exactly.

core/node/libp2p/rcmgr.go

core/node/libp2p/rcmgr_defaults.go

docs/config.md

test/sharness/t0139-swarm-rcmgr.sh

docs/config.md

core/node/libp2p/rcmgr_logging.go

guseggert

The comments I left are not blockers, feel free to incorporate them in this PR or in a subsequent PR, we don't need to block the RC release train for this. Other than Steve's comments, LGTM!

This PR adds several new functionalities to make easier the usage of ResourceManager: - Now resource manager logs when resources are exceeded are on ERROR instead of warning. - The resources exceeded error now show what kind of limit was reached - When there was no limit exceeded, we print a message for the user saying that limits were back to normal - Added `swarm limit all` command to show all set limits with the same format as `swarm stats all` - Added `min-used-limit-perc` option to `swarm stats all` to only show stats that are above a specific percentage

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>

Signed-off-by: Antonio Navarro <antnavper@gmail.com>

BigLep · 2022-11-10T16:17:13Z

@ajnavarro : thanks for adding the "Validation tests" section to the PR. The dimension I think we need to cal out is "number of peers". I assume right now your validation test had a small number of peers. That is is good/fine since it ensures we are protected from unintentional DoS from misbehaving nodes. A malicious attacker will just increase the number of peerids it uses to then hit system limits. I want to make sure that the system scope limits in place still keep the node responsive.

In the comment above, I gave an example of what I think we want to see:

./attack-script.sh --numPeers=5 --numConnectionsPeer=100 --numStreamsPerConnection=100
./attack-script.sh --numPeers=100 --numConnectionsPeer=100 --numStreamsPerConnection=100

I'm not saying those are the right number, but we want to do something like this, where the goal is to test the "system scope" limits, rather than just the "peer scope" limits. Thanks!

ajnavarro · 2022-11-11T12:41:45Z

I modified the attack script to be able to create several hosts at the same time.

I did a couple more tests:

100 hosts, 100 conns per host, 100000 streams per host
1000 hosts, 10 conns per host, 10000 streams per host

My computer cannot keep up with 1000 hosts simultaneously ( only ~500 at any point in time).

Kubo and the attackers were running on the same pc.

With this amount of hosts, we are hitting system and transient scopes:

2022-11-11T13:32:31.918+0100	ERROR	resourcemanager	libp2p/rcmgr_logging.go:53	Resource limits were exceeded 201763 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-11T13:32:31.918+0100	ERROR	resourcemanager	libp2p/rcmgr_logging.go:53	Resource limits were exceeded 274 times with error "system: cannot reserve inbound connection: resource limit exceeded".

The kubo node is still able to process and do operations, but you can feel the choppiness.

BigLep · 2022-11-11T15:48:05Z

@ajnavarro : thanks for the update - this is useful/great. A couple of things:

Was your attacker node and your Kubo node the same physical host or separate? (Lets describe the setup a little more.)
Please upstream your attack script change if possible because others should ideally be testing with large number of peers (not just large numbers of connectionsPerPeer streamsPerConnection).

Thanks!

ajnavarro · 2022-11-11T15:56:39Z

@BigLep Kubo and the attackers were running on the same PC. I updated the comment.

I'll upstream my changes.

BigLep · 2022-11-11T16:14:33Z

Thanks @ajnavarro. I'm trying to think through the ramification of the attack node and Kubo node being on the same host. They're inevitably competing with each other for resources, and I wonder how much it affects the validity of the test. I'm not sure how to reason about it. I think it would be cleanest if we had two separate hosts setups (e.g., run the test on a cloud provider).

ajnavarro · 2022-11-14T09:51:33Z

@BigLep even if they are competing for resources, the tests are clear: in all cases, if the resource manager is off, the node is killed by the OS for OOM. When the resource manager is on, the node works as described before.

The objective of these manual tests was not to test all corner cases (because there was no time for that, we wanted to have it on 0.17) but to check that ResourceManager was doing something and was effectively working on protecting the node.

We need to have more in-deep tests, maybe using our CI, or maybe using testground to check several attack types.

ajnavarro changed the title ~~Improve ErrorManager UX~~ feat: Improve ErrorManager UX Oct 7, 2022

ajnavarro changed the title ~~feat: Improve ErrorManager UX~~ feat: Improve ResourceManager UX Oct 10, 2022

ajnavarro marked this pull request as ready for review October 11, 2022 09:27

ajnavarro requested a review from guseggert October 11, 2022 09:27

BigLep assigned ajnavarro Oct 11, 2022

BigLep reviewed Oct 18, 2022

View reviewed changes

core/node/libp2p/rcmgr_logging.go Outdated Show resolved Hide resolved

core/node/libp2p/rcmgr_logging.go Outdated Show resolved Hide resolved

ajnavarro force-pushed the feat/improve-resource-manager-ux branch from d5e3765 to e362445 Compare October 19, 2022 10:42

ajnavarro requested a review from BigLep October 19, 2022 10:43

ajnavarro force-pushed the feat/improve-resource-manager-ux branch from e362445 to ad81a06 Compare October 19, 2022 11:00

BigLep reviewed Oct 19, 2022

View reviewed changes

BigLep mentioned this pull request Nov 8, 2022

Release 0.17 #9319

Closed

ajnavarro force-pushed the feat/improve-resource-manager-ux branch 4 times, most recently from 082af2c to ab2187c Compare November 9, 2022 12:50

This was referenced Nov 9, 2022

Simplifying rcmgr_defaults.go #9351

Closed

Discussion: ResourceManager defaults when active by default #9322

Closed

BigLep suggested changes Nov 9, 2022

View reviewed changes

guseggert reviewed Nov 9, 2022

View reviewed changes

core/node/libp2p/rcmgr.go Show resolved Hide resolved

BigLep approved these changes Nov 9, 2022

View reviewed changes

guseggert reviewed Nov 9, 2022

View reviewed changes

docs/config.md Outdated Show resolved Hide resolved

core/node/libp2p/rcmgr_logging.go Outdated Show resolved Hide resolved

guseggert approved these changes Nov 9, 2022

View reviewed changes

ajnavarro added 7 commits November 10, 2022 10:32

Improve log output and fix stat.Peers output bug

86b5dd8

Requested changes

6ba5a15

Simplify ResourceManager defaults.

e75b570

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>

Improve error messages and tweak default values.

46f6ab3

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>

Go mod tidy

d2b8da1

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>

Enable ResourceManager by default.

c8d28b3

ajnavarro added 7 commits November 10, 2022 10:32

Fix rcmgr logging test

1f09608

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>

Fix shaness tests

3fc3d99

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>

Requested changes.

fb881ba

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>

go mod tidy

d844df9

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>

Use file descriptors only on unix.

e7c9800

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>

Requested changes

f730347

Signed-off-by: Antonio Navarro <antnavper@gmail.com>

Fix getFD on Windows

f9b7a79

Signed-off-by: Antonio Navarro <antnavper@gmail.com>

ajnavarro force-pushed the feat/improve-resource-manager-ux branch from 03614f4 to f9b7a79 Compare November 10, 2022 10:32

ajnavarro merged commit 254d81a into master Nov 10, 2022

ajnavarro deleted the feat/improve-resource-manager-ux branch November 10, 2022 11:26

BigLep mentioned this pull request Nov 10, 2022

Integrate/expose go-libp2p resource manager #8761

Closed

20 tasks

BigLep mentioned this pull request Nov 15, 2022

0.17.0-rc1: ResourceMgr defaults clash with AcceleratedDHTClient #9405

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improve ResourceManager UX #9338

feat: Improve ResourceManager UX #9338

ajnavarro commented Oct 7, 2022 •

edited

Loading

BigLep left a comment

BigLep commented Nov 9, 2022

BigLep left a comment

BigLep left a comment •

edited

Loading

guseggert left a comment

BigLep commented Nov 10, 2022

ajnavarro commented Nov 11, 2022 •

edited

Loading

BigLep commented Nov 11, 2022

ajnavarro commented Nov 11, 2022

BigLep commented Nov 11, 2022

ajnavarro commented Nov 14, 2022

feat: Improve ResourceManager UX #9338

feat: Improve ResourceManager UX #9338

Conversation

ajnavarro commented Oct 7, 2022 • edited Loading

Validation tests

BigLep left a comment

Choose a reason for hiding this comment

BigLep commented Nov 9, 2022

BigLep left a comment

Choose a reason for hiding this comment

BigLep left a comment • edited Loading

Choose a reason for hiding this comment

guseggert left a comment

Choose a reason for hiding this comment

BigLep commented Nov 10, 2022

ajnavarro commented Nov 11, 2022 • edited Loading

BigLep commented Nov 11, 2022

ajnavarro commented Nov 11, 2022

BigLep commented Nov 11, 2022

ajnavarro commented Nov 14, 2022

ajnavarro commented Oct 7, 2022 •

edited

Loading

BigLep left a comment •

edited

Loading

ajnavarro commented Nov 11, 2022 •

edited

Loading