-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analysis: k-bucket underutilisation and excessive thrashing #194
Comments
Calling out a few people to get the discussion started here: @jhiesey @bigs @Stebalien @whyrusleeping @mgoelzer @carospiegly @diasdavid @jbenet. |
would be interested in dropping in a trie or an ordered map (supporting |
As a side note, (we totally might have talked about this yesterday) do we choose not to have a preference for established nodes because of security? I thought that kademlia keeps established nodes around over new nodes? |
@carospiegly I think our current protocol doesn’t use ping and pong messages – I vaguely recall hearing that (afk at the moment so I can’t check). Without that ability, we can’t test for node liveness (unless we issue a query). According to the Kademlia spec, when a bucket is full you pick an existing node and test if it’s still alive. If it is, you reject the incoming one, thus giving preference to elder nodes. In Ethereum this is problematic because queries are only answered if the requester is in your table, so new nodes find it relatively hard to join the network IIRC. |
@carospiegly Another thought I’m having is that the Kademlia RPC protocol never refers to distances. So as long as we can honour the behaviour of returning nodes close to a target, I think we’re good. In other words, the arrangement by distance is purely local! Additional data structures to consider include: skip lists, trees. We’ll probably need an invertible bloom filter for fast membership tests. |
it would have IDs as keys and could provide nlogn lookups for closest peers |
@raulk exactly. all we need is to find nearest ids! |
The paper Sub-Second Lookups on a Large-Scale Kademlia-Based Overlay mentions a technique to counteract the log-distribution of buckets: make the bucket for the largest distance have more entries (they changed the bucket size of the 4 farthest away buckets from 8 to 64,32,16,8). This resulted in significant decrease in lookup latency for their tests in the kademlia MainlineDHT and might be an idea worth looking at. |
@bennofs thanks a lot for the reference! Their observations are consistent with my findings. Personally I'm thinking of a complete redesign of the underlying data structure towards a segmented heap of some kind. I am not entirely convinced the concept of "distance" is as helpful as it claims to be. I'm thinking of a data structure that:
This is just off the top of my head, happy to hear thoughts. |
@raulk I cannot comment on your proposal (I don't fully grok Kademlia yet :) but I found another explanation for the bucket trashing: If we look at the implementation of
The only path in which the new peer is not added to the bucket is if either it already is in the bucket or its latency is too high. From this code, it appears to me that we are not at all implementing the "prefer older peers" from kademlia: every new node is added to a bucket (we remove the least active node, but least active != not connected anymore). So if you are flooded with nodes that come online and then go offline again quickly (which for new nodes is likely, consider someone just trying ipfs for a small amount of time) then the bucket is trashed. |
@bennofs Indeed. This is what I explained – albeit maybe not clearly – in my point above:
Looking to get involved with libp2p? Would be happy to collaborate on this issue, or on others ;-) |
My instinct is precisely what @bennofs described. Thrashing occurs because evictions are too easy. An existing implementation avoids this by evicting bad nodes if the bucket is full. See https://github.com/anacrolix/dht/blob/b09db78595aaba14cd992537fbe992a4c5e0a141/server.go#L441-L450 and https://github.com/anacrolix/dht/blob/master/server.go#L457-L478 for examples of how this is done in other implementations. |
I've been experimenting with this, the biggest contributor is disconnects causing evictions from the routing table, followed by a bad bootstrapping process, particularly the random walks, which mainly flood the first bucket. |
@aschmahmann - you might be interested in this, particularly #194 (comment) |
Fixed. |
This ticket summarises my findings around bucket utilisation and peer thrashing in IPFS DHT, driven by go-libp2p-kad-dht.
Problem statement
As peers are discovered in the network, Kademlia stores them in a routing table logically composed of as many buckets as bits in the peer ID (or rather, its hash). I say logically because implementations can choose to physically store the table differently (e.g. unfolding lazily, like libp2p Kad DHT).
Each bucket represents a distance from our node, calculated by XORing the hash of the peer's ID with ours (key from now on) and counting leading 0, i.e. boils down to how many prefix bits our keys share. This is called distance (rather, closeness). The higher the value, the closer they are to us.
As a result, buckets follow a log2n distribution. The furthest bucket represents 1/2 of the addressable network, and each step closer halves that.
Soon enough, the subtree represented by a bucket becomes so small that it is highly improbable to find a peer there. In fact, the Ethereum network bonks out pretty soon, at bucket 8 (if 0 is furthest), or bucket 248 (if 0 is closest aka. us).
This phenomenon is discussed in section II(d) of the "Low-Resource Eclipse Attacks
on Ethereum’s Peer-to-Peer Network" paper.
The below chart is extracted from that paper. It shows a 27-day run, in which the first 6-7 buckets were filled very quickly, with next buckets taking hours, and the remaining buckets being permanently vacant. Only ~4% of the routing table was ever utilised.
Assessing libp2p Kad DHT
I set out to analyse if this behaviour is present in the IPFS DHT. I instrumented
go-libp2p-kad-dht
andgo-libp2p-kbucket
with two types of log statements:I performed two runs of the IPFS daemon:
Observations:
The libp2p Kad DHT implementation does a good job of unfolding buckets lazily as they are utilised, versus eagerly initialising all buckets at startup – although I'd argue the memory impact of one vs. the other is negligible.
Impact
It's clear to me that the linear k-bucket routing table an inefficient data structure in this context. On the bright side, it does set a heuristic to arrange the network (based on distance), although arguably a mediocre one.
The problem emerges when you combine these factors:
Now you have a routing table imposing the same k limitation to the buckets holding 50% of the network and the buckets holding 0.0001% of the network. This makes little sense.
Furthermore, every entry beyond k is an opportunity to evict a peer from the bucket. Kademlia proposes to test the liveness of the least recently seen peer, although in libp2p Kad DHT we take a different approach (replacing the least recently seen immediately). This has several negative trade-offs:
Wasted discoveries
During DHT lookups and recurring table filling operations, we encounter lots of peers. Peers that could be useful. But we discard them.
Peer entries probably occupy less than 100 bytes, so storing the 8000 peers we discarded could've entailed only 800kb. A worthwhile investment if this gives us fast-track direct access (teleport) to arbitrary regions of the DHT.
Actually, it is worthy noting that 7 buckets cover 99% of the network. That's where the meat of the matter is. We should avoid on-demand iterative searches as much as possible, as they are inefficient and synchronous. Rather populate our DHT picture in the background, in a way that allows us to teleport when lookups are demanded.
Some ideas:
To me, (1) and (3) are good paths forward here.
References
[1] https://eprint.iacr.org/2018/236.pdf
[2] raulk/go-libp2p-kbucket@4acf71e
[3] raulk@f5c1563
[4] https://gist.github.com/raulk/95b269072e75a71f84a1bcf2fa237247
[5] https://gist.github.com/raulk/17aba7f263bba54c83cda1269e6bf5a7
The text was updated successfully, but these errors were encountered: