Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New RT management policy #520

Merged
merged 12 commits into from
Apr 2, 2020
Merged

New RT management policy #520

merged 12 commits into from
Apr 2, 2020

Conversation

aarshkshah1992
Copy link
Contributor

@aarshkshah1992 aarshkshah1992 commented Mar 31, 2020

Problem
Cypress currently moves away from Balsa’s strategy of switching from having routing tables only containing nodes with connections to us to allowing nodes without connections to remain in our tables. Cypress helps its routing tables remain fresh by polling every CleanupInterval to check if entries in our routing tables are alive, and if not we query them to check liveness. Additionally, Cypress prefers nodes that we’ve known about for longer over peers we’ve known about for less time both because we assume that long lived nodes will continue to be alive and because we have security concerns about a flood of new peers coming and attacking our routing tables.

Since every peer in the network initially connects to our bootstrap nodes the bootstrap nodes will end up in their routing tables. When the bootstrap nodes connection manager prunes them they will end up requerying the bootstrap node after the CleanupInterval. This means that it’s likely that every node in the network will end up querying at least one of the bootstrap nodes every CleanupInterval. Additionally, since bootstrap nodes will be the longest lived nodes in the routing tables they will never get removed.

** Solution **
Each contact peer in the table maintains timestamp of "last valuable query".
When a contact peer is responsible for a successful lookup, its "last valuable query" timestamp is updated.
When a contact peer fails to respond during a lookup, it is removed from the routing table.

When another peer contacts us, add the peer to the table, if it meets the "addition criteria".

When the table size is below a threshold, every peer known to the host
is added to the routing table, if it meets the "addition criteria".

The "addition criteria" stipulates that a new peer can be added to a bucket
only if the bucket is not full or it contains a peer with a stale "last valuable query" timestamp.

After any peer is added to a bucket, its "last valuable query" timestamp is updated.

RT PR at: libp2p/go-libp2p-kbucket#66

@aarshkshah1992 aarshkshah1992 changed the base branch from master to petar/async March 31, 2020 13:47
@aarshkshah1992 aarshkshah1992 requested review from Stebalien and aschmahmann and removed request for Stebalien March 31, 2020 14:38
Copy link
Contributor

@petar petar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one main comment. Otherwise looks good.

dht.go Show resolved Hide resolved
dht.go Show resolved Hide resolved
dht.go Outdated Show resolved Hide resolved
Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good pass, left a bunch of comments/questions.

There seem to be a bunch of broken tests now or tests that you've changed. Some of these tests might be failing due to using randomized or other improper routing tables and expecting things to work. For those tests we can probably manually cleanup the routing tables (e.g. call Refresh). I'd try and keep the changes here minimized so that @petar can take a look at them without needing to grok all the complexities of this codebase.

@@ -788,7 +788,7 @@ func TestRefresh(t *testing.T) {
}
}()

waitForWellFormedTables(t, dhts, 7, 10, 20*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? I think it might be making CI fail

Copy link
Contributor Author

@aarshkshah1992 aarshkshah1992 Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. I think the test was timing out(> 10 minutes) because of heavy lock contention on the routing table lock due to frequently polling the size in waitForWellFormedTables (polling interval is just 5ms) and accessing it as part of the bootstrap query.

I've refactored this test to make it work and things look good now.

aarshshah@Aarshs-MacBook-Pro go-libp2p-kad-dht % GO111MODULE=on go test -race -count=50 -run TestRefresh
PASS
ok  	github.com/libp2p/go-libp2p-kad-dht	296.337s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is fine, but it's a little strange. Also this test now must take 5 seconds since we have to wait for the context to timeout. Probably would be better to refactor waitForWellFormedTables into a second checkWellFormedTables function that we can call here every 50ms. Could do this in a followup PR though.

dht_test.go Show resolved Hide resolved
dht_test.go Show resolved Hide resolved
ext_test.go Outdated Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
subscriber_notifee.go Outdated Show resolved Hide resolved
Comment on lines +14 to +20
// TODO Debug test failures due to timing issue on windows
// Tests are timing dependent as can be seen in the 2 seconds timed context that we use in "tu.WaitFor".
// While the tests work fine on OSX and complete in under a second,
// they repeatedly fail to complete in the stipulated time on Windows.
// However, increasing the timeout makes them pass on Windows.

func TestRTEvictionOnFailedQuery(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these tests moved from somewhere else, if so why move them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are new tests. We are now testing the change in the RT state based on query responses as per the new logic we've added in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, I noticed the OS X vs Windows thing and figured it was just copy-paste from our other issue like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We got rid of the old tests as they are not needed anymore/don't prove anything in light of the new changes.

However, I am sure if these tests will work on OSX/Windows (because the methodology of the tests/some of the code paths we hit are similar) and so I kept the comment as is.

dht.go Outdated Show resolved Hide resolved
dht.go Outdated Show resolved Hide resolved
@aarshkshah1992
Copy link
Contributor Author

Hey @petar

Let me know if you would like me to break this down into small PRs for easy reviewing.

@aarshkshah1992
Copy link
Contributor Author

aarshkshah1992 commented Apr 1, 2020

@aschmahmann @petar The huge number of files is because I rebased this on the peer filtering work to make sure there are no integration problems.

Please can we get the petar/async branch in and then I can rebase this PR on top of that too ?

Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aarshkshah1992 Left a few comments. There was a force push onto the petar/async branch so you'll need to drop those commits and rebase again. I'll try and get that branch merged into cypress by the time you're up, but it may have to wait until tomorrow.

subscriber_notifee.go Outdated Show resolved Hide resolved
@aarshkshah1992
Copy link
Contributor Author

aarshkshah1992 commented Apr 2, 2020

@aschmahmann Please Can I get a final set of fine eyes on this PR ?

@aschmahmann aschmahmann changed the base branch from petar/async to cypress April 2, 2020 16:53
@aarshkshah1992 aarshkshah1992 changed the title [WIP] New RT management policy New RT management policy Apr 2, 2020
dht.go Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
@aarshkshah1992
Copy link
Contributor Author

aarshkshah1992 commented Apr 2, 2020

@aschmahmann I have no idea why the test is failing on the CI:

aarshshah@Aarshs-MacBook-Pro go-libp2p-kad-dht % GO111MODULE=on go test -count=5 
-race -run TestNotFound ./...
ok  	github.com/libp2p/go-libp2p-kad-dht	1.746s

@aarshkshah1992
Copy link
Contributor Author

aarshkshah1992 commented Apr 2, 2020

@aschmahmann I'll fix this tomorrow. Have created an issue and assigned myself. Need to get this in right now so can get the periodic pinging on top of this for an easy merge.

@aarshkshah1992 aarshkshah1992 merged commit f79ad79 into cypress Apr 2, 2020
Stebalien pushed a commit that referenced this pull request Apr 3, 2020
* new RT management policy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants