-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New RT management policy #520
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one main comment. Otherwise looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good pass, left a bunch of comments/questions.
There seem to be a bunch of broken tests now or tests that you've changed. Some of these tests might be failing due to using randomized or other improper routing tables and expecting things to work. For those tests we can probably manually cleanup the routing tables (e.g. call Refresh). I'd try and keep the changes here minimized so that @petar can take a look at them without needing to grok all the complexities of this codebase.
@@ -788,7 +788,7 @@ func TestRefresh(t *testing.T) { | |||
} | |||
}() | |||
|
|||
waitForWellFormedTables(t, dhts, 7, 10, 20*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? I think it might be making CI fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey. I think the test was timing out(> 10 minutes) because of heavy lock contention on the routing table lock due to frequently polling the size in waitForWellFormedTables
(polling interval is just 5ms) and accessing it as part of the bootstrap query.
I've refactored this test to make it work and things look good now.
aarshshah@Aarshs-MacBook-Pro go-libp2p-kad-dht % GO111MODULE=on go test -race -count=50 -run TestRefresh
PASS
ok github.com/libp2p/go-libp2p-kad-dht 296.337s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is fine, but it's a little strange. Also this test now must take 5 seconds since we have to wait for the context to timeout. Probably would be better to refactor waitForWellFormedTables
into a second checkWellFormedTables
function that we can call here every 50ms. Could do this in a followup PR though.
// TODO Debug test failures due to timing issue on windows | ||
// Tests are timing dependent as can be seen in the 2 seconds timed context that we use in "tu.WaitFor". | ||
// While the tests work fine on OSX and complete in under a second, | ||
// they repeatedly fail to complete in the stipulated time on Windows. | ||
// However, increasing the timeout makes them pass on Windows. | ||
|
||
func TestRTEvictionOnFailedQuery(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these tests moved from somewhere else, if so why move them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are new tests. We are now testing the change in the RT state based on query responses as per the new logic we've added in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, I noticed the OS X vs Windows thing and figured it was just copy-paste from our other issue like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We got rid of the old tests as they are not needed anymore/don't prove anything in light of the new changes.
However, I am sure if these tests will work on OSX/Windows (because the methodology of the tests/some of the code paths we hit are similar) and so I kept the comment as is.
Hey @petar Let me know if you would like me to break this down into small PRs for easy reviewing. |
1eb7008
to
6f0002c
Compare
@aschmahmann @petar The huge number of files is because I rebased this on the peer filtering work to make sure there are no integration problems. Please can we get the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aarshkshah1992 Left a few comments. There was a force push onto the petar/async branch so you'll need to drop those commits and rebase again. I'll try and get that branch merged into cypress by the time you're up, but it may have to wait until tomorrow.
6f0002c
to
145903c
Compare
@aschmahmann Please Can I get a final set of fine eyes on this PR ? |
145903c
to
66a406a
Compare
@aschmahmann I have no idea why the test is failing on the CI:
|
@aschmahmann I'll fix this tomorrow. Have created an issue and assigned myself. Need to get this in right now so can get the periodic pinging on top of this for an easy merge. |
Problem
Cypress currently moves away from Balsa’s strategy of switching from having routing tables only containing nodes with connections to us to allowing nodes without connections to remain in our tables. Cypress helps its routing tables remain fresh by polling every CleanupInterval to check if entries in our routing tables are alive, and if not we query them to check liveness. Additionally, Cypress prefers nodes that we’ve known about for longer over peers we’ve known about for less time both because we assume that long lived nodes will continue to be alive and because we have security concerns about a flood of new peers coming and attacking our routing tables.
Since every peer in the network initially connects to our bootstrap nodes the bootstrap nodes will end up in their routing tables. When the bootstrap nodes connection manager prunes them they will end up requerying the bootstrap node after the CleanupInterval. This means that it’s likely that every node in the network will end up querying at least one of the bootstrap nodes every CleanupInterval. Additionally, since bootstrap nodes will be the longest lived nodes in the routing tables they will never get removed.
** Solution **
Each contact peer in the table maintains timestamp of "last valuable query".
When a contact peer is responsible for a successful lookup, its "last valuable query" timestamp is updated.
When a contact peer fails to respond during a lookup, it is removed from the routing table.
When another peer contacts us, add the peer to the table, if it meets the "addition criteria".
When the table size is below a threshold, every peer known to the host
is added to the routing table, if it meets the "addition criteria".
The "addition criteria" stipulates that a new peer can be added to a bucket
only if the bucket is not full or it contains a peer with a stale "last valuable query" timestamp.
After any peer is added to a bucket, its "last valuable query" timestamp is updated.
RT PR at: libp2p/go-libp2p-kbucket#66