New RT management policy #520

aarshkshah1992 · 2020-03-31T13:47:33Z

Problem
Cypress currently moves away from Balsa’s strategy of switching from having routing tables only containing nodes with connections to us to allowing nodes without connections to remain in our tables. Cypress helps its routing tables remain fresh by polling every CleanupInterval to check if entries in our routing tables are alive, and if not we query them to check liveness. Additionally, Cypress prefers nodes that we’ve known about for longer over peers we’ve known about for less time both because we assume that long lived nodes will continue to be alive and because we have security concerns about a flood of new peers coming and attacking our routing tables.

Since every peer in the network initially connects to our bootstrap nodes the bootstrap nodes will end up in their routing tables. When the bootstrap nodes connection manager prunes them they will end up requerying the bootstrap node after the CleanupInterval. This means that it’s likely that every node in the network will end up querying at least one of the bootstrap nodes every CleanupInterval. Additionally, since bootstrap nodes will be the longest lived nodes in the routing tables they will never get removed.

** Solution **
Each contact peer in the table maintains timestamp of "last valuable query".
When a contact peer is responsible for a successful lookup, its "last valuable query" timestamp is updated.
When a contact peer fails to respond during a lookup, it is removed from the routing table.

When another peer contacts us, add the peer to the table, if it meets the "addition criteria".

When the table size is below a threshold, every peer known to the host
is added to the routing table, if it meets the "addition criteria".

The "addition criteria" stipulates that a new peer can be added to a bucket
only if the bucket is not full or it contains a peer with a stale "last valuable query" timestamp.

After any peer is added to a bucket, its "last valuable query" timestamp is updated.

RT PR at: libp2p/go-libp2p-kbucket#66

petar

I had one main comment. Otherwise looks good.

dht.go

aschmahmann

A good pass, left a bunch of comments/questions.

There seem to be a bunch of broken tests now or tests that you've changed. Some of these tests might be failing due to using randomized or other improper routing tables and expecting things to work. For those tests we can probably manually cleanup the routing tables (e.g. call Refresh). I'd try and keep the changes here minimized so that @petar can take a look at them without needing to grok all the complexities of this codebase.

aschmahmann · 2020-03-31T19:13:39Z

dht_test.go

@@ -788,7 +788,7 @@ func TestRefresh(t *testing.T) {
 		}
 	}()

-	waitForWellFormedTables(t, dhts, 7, 10, 20*time.Second)


Why this change? I think it might be making CI fail

Hey. I think the test was timing out(> 10 minutes) because of heavy lock contention on the routing table lock due to frequently polling the size in waitForWellFormedTables (polling interval is just 5ms) and accessing it as part of the bootstrap query.

I've refactored this test to make it work and things look good now.

aarshshah@Aarshs-MacBook-Pro go-libp2p-kad-dht % GO111MODULE=on go test -race -count=50 -run TestRefresh PASS ok github.com/libp2p/go-libp2p-kad-dht 296.337s

I guess this is fine, but it's a little strange. Also this test now must take 5 seconds since we have to wait for the context to timeout. Probably would be better to refactor waitForWellFormedTables into a second checkWellFormedTables function that we can call here every 50ms. Could do this in a followup PR though.

dht_test.go

ext_test.go

query.go

subscriber_notifee.go

aschmahmann · 2020-04-01T05:19:24Z

query_test.go

+// TODO Debug test failures due to timing issue on windows
+// Tests  are timing dependent as can be seen in the 2 seconds timed context that we use in "tu.WaitFor".
+// While the tests work fine on OSX and complete in under a second,
+// they repeatedly fail to complete in the stipulated time on Windows.
+// However, increasing the timeout makes them pass on Windows.
+
+func TestRTEvictionOnFailedQuery(t *testing.T) {


Were these tests moved from somewhere else, if so why move them?

These are new tests. We are now testing the change in the RT state based on query responses as per the new logic we've added in this PR.

ah ok, I noticed the OS X vs Windows thing and figured it was just copy-paste from our other issue like that.

We got rid of the old tests as they are not needed anymore/don't prove anything in light of the new changes.

However, I am sure if these tests will work on OSX/Windows (because the methodology of the tests/some of the code paths we hit are similar) and so I kept the comment as is.

dht.go

aarshkshah1992 · 2020-04-01T10:12:13Z

Hey @petar

Let me know if you would like me to break this down into small PRs for easy reviewing.

aarshkshah1992 · 2020-04-01T12:11:27Z

@aschmahmann @petar The huge number of files is because I rebased this on the peer filtering work to make sure there are no integration problems.

Please can we get the petar/async branch in and then I can rebase this PR on top of that too ?

aschmahmann

@aarshkshah1992 Left a few comments. There was a force push onto the petar/async branch so you'll need to drop those commits and rebase again. I'll try and get that branch merged into cypress by the time you're up, but it may have to wait until tomorrow.

subscriber_notifee.go

aarshkshah1992 · 2020-04-02T07:45:59Z

@aschmahmann Please Can I get a final set of fine eyes on this PR ?

dht.go

dht_test.go

query.go

aarshkshah1992 · 2020-04-02T22:05:48Z

@aschmahmann I have no idea why the test is failing on the CI:

aarshshah@Aarshs-MacBook-Pro go-libp2p-kad-dht % GO111MODULE=on go test -count=5 
-race -run TestNotFound ./...
ok  	github.com/libp2p/go-libp2p-kad-dht	1.746s

aarshkshah1992 · 2020-04-02T22:30:25Z

@aschmahmann I'll fix this tomorrow. Have created an issue and assigned myself. Need to get this in right now so can get the periodic pinging on top of this for an easy merge.

* new RT management policy

aarshkshah1992 changed the base branch from master to petar/async March 31, 2020 13:47

aarshkshah1992 requested review from Stebalien and aschmahmann and removed request for Stebalien March 31, 2020 14:38

petar reviewed Apr 1, 2020

View reviewed changes

dht.go Show resolved Hide resolved

dht.go Show resolved Hide resolved

aarshkshah1992 commented Apr 1, 2020

View reviewed changes

dht.go Outdated Show resolved Hide resolved

aschmahmann requested changes Apr 1, 2020

View reviewed changes

aarshkshah1992 mentioned this pull request Apr 1, 2020

[Peerstore] Pin/Unpin interface for peerstore libp2p/go-libp2p#800

Open

aarshkshah1992 force-pushed the feat/better-rt-replace branch from 1eb7008 to 6f0002c Compare April 1, 2020 12:07

aarshkshah1992 mentioned this pull request Apr 1, 2020

Periodically ping disconnected peers in the routing table #521

Closed

petar force-pushed the petar/async branch from 28ea636 to 11384c8 Compare April 1, 2020 20:17

aschmahmann requested changes Apr 2, 2020

View reviewed changes

aschmahmann reviewed Apr 2, 2020

View reviewed changes

subscriber_notifee.go Show resolved Hide resolved

aschmahmann reviewed Apr 2, 2020

View reviewed changes

subscriber_notifee.go Outdated Show resolved Hide resolved

aarshkshah1992 force-pushed the feat/better-rt-replace branch from 6f0002c to 145903c Compare April 2, 2020 07:22

aschmahmann changed the base branch from petar/async to cypress April 2, 2020 16:53

aarshkshah1992 added 10 commits April 2, 2020 23:28

new RT management policy

3eaf51d

updated go.mod

f981f91

use max threshold

4eda06a

first set of changes

e92c1b4

fixed tests

bd83f37

rebased PR

14cd169

rebased PR

e7853b6

changes as per review

a72af0e

changes as per review

88c979b

RT changes

66a406a

aarshkshah1992 force-pushed the feat/better-rt-replace branch from 145903c to 66a406a Compare April 2, 2020 19:51

aarshkshah1992 changed the title ~~[WIP] New RT management policy~~ New RT management policy Apr 2, 2020

aschmahmann reviewed Apr 2, 2020

View reviewed changes

dht.go Show resolved Hide resolved

aschmahmann reviewed Apr 2, 2020

View reviewed changes

dht_test.go Show resolved Hide resolved

aschmahmann reviewed Apr 2, 2020

View reviewed changes

query.go Outdated Show resolved Hide resolved

aschmahmann reviewed Apr 2, 2020

View reviewed changes

query.go Outdated Show resolved Hide resolved

aarshkshah1992 added 2 commits April 3, 2020 03:07

changes as per review

f050757

test comment

625513d

aarshkshah1992 mentioned this pull request Apr 2, 2020

Failing TestNotFound #531

Closed

aarshkshah1992 merged commit f79ad79 into cypress Apr 2, 2020

Stebalien pushed a commit that referenced this pull request Apr 3, 2020

New RT management policy (#520)

8fa0438

* new RT management policy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New RT management policy #520

New RT management policy #520

aarshkshah1992 commented Mar 31, 2020 •

edited by petar

Loading

petar left a comment

aschmahmann left a comment

aschmahmann Mar 31, 2020

aarshkshah1992 Apr 1, 2020 •

edited

Loading

aschmahmann Apr 2, 2020

aschmahmann Apr 1, 2020

aarshkshah1992 Apr 1, 2020

aschmahmann Apr 2, 2020

aarshkshah1992 Apr 2, 2020

aarshkshah1992 commented Apr 1, 2020

aarshkshah1992 commented Apr 1, 2020 •

edited

Loading

aschmahmann left a comment

aarshkshah1992 commented Apr 2, 2020 •

edited

Loading

aarshkshah1992 commented Apr 2, 2020 •

edited

Loading

aarshkshah1992 commented Apr 2, 2020 •

edited

Loading

New RT management policy #520

New RT management policy #520

Conversation

aarshkshah1992 commented Mar 31, 2020 • edited by petar Loading

petar left a comment

Choose a reason for hiding this comment

aschmahmann left a comment

Choose a reason for hiding this comment

aschmahmann Mar 31, 2020

Choose a reason for hiding this comment

aarshkshah1992 Apr 1, 2020 • edited Loading

Choose a reason for hiding this comment

aschmahmann Apr 2, 2020

Choose a reason for hiding this comment

aschmahmann Apr 1, 2020

Choose a reason for hiding this comment

aarshkshah1992 Apr 1, 2020

Choose a reason for hiding this comment

aschmahmann Apr 2, 2020

Choose a reason for hiding this comment

aarshkshah1992 Apr 2, 2020

Choose a reason for hiding this comment

aarshkshah1992 commented Apr 1, 2020

aarshkshah1992 commented Apr 1, 2020 • edited Loading

aschmahmann left a comment

Choose a reason for hiding this comment

aarshkshah1992 commented Apr 2, 2020 • edited Loading

aarshkshah1992 commented Apr 2, 2020 • edited Loading

aarshkshah1992 commented Apr 2, 2020 • edited Loading

aarshkshah1992 commented Mar 31, 2020 •

edited by petar

Loading

aarshkshah1992 Apr 1, 2020 •

edited

Loading

aarshkshah1992 commented Apr 1, 2020 •

edited

Loading

aarshkshah1992 commented Apr 2, 2020 •

edited

Loading

aarshkshah1992 commented Apr 2, 2020 •

edited

Loading

aarshkshah1992 commented Apr 2, 2020 •

edited

Loading