Analysis: k-bucket underutilisation and excessive thrashing #194

raulk · 2018-09-06T17:12:49Z

This ticket summarises my findings around bucket utilisation and peer thrashing in IPFS DHT, driven by go-libp2p-kad-dht.

Problem statement

As peers are discovered in the network, Kademlia stores them in a routing table logically composed of as many buckets as bits in the peer ID (or rather, its hash). I say logically because implementations can choose to physically store the table differently (e.g. unfolding lazily, like libp2p Kad DHT).

Each bucket represents a distance from our node, calculated by XORing the hash of the peer's ID with ours (key from now on) and counting leading 0, i.e. boils down to how many prefix bits our keys share. This is called distance (rather, closeness). The higher the value, the closer they are to us.

As a result, buckets follow a log2n distribution. The furthest bucket represents 1/2 of the addressable network, and each step closer halves that.

Soon enough, the subtree represented by a bucket becomes so small that it is highly improbable to find a peer there. In fact, the Ethereum network bonks out pretty soon, at bucket 8 (if 0 is furthest), or bucket 248 (if 0 is closest aka. us).

This phenomenon is discussed in section II(d) of the "Low-Resource Eclipse Attacks
on Ethereum’s Peer-to-Peer Network" paper.

The below chart is extracted from that paper. It shows a 27-day run, in which the first 6-7 buckets were filled very quickly, with next buckets taking hours, and the remaining buckets being permanently vacant. Only ~4% of the routing table was ever utilised.

Assessing libp2p Kad DHT

I set out to analyse if this behaviour is present in the IPFS DHT. I instrumented go-libp2p-kad-dht and go-libp2p-kbucket with two types of log statements:

dump the routing table every 10 seconds; code.
log whenever we encounter a peer for a bucket that's full – which forces eviction; code.

I performed two runs of the IPFS daemon:

3-minute run; output.
70-minute run; excerpt | full.

Observations:

the table remains almost empty for the first minute. I suspect we kick off the bootstrapping/seeding process 1 minute from start, but I have to look for evidence.
we begin dropping peers just as the bootstrapping begins.
we discarded 841 peers in the 3-minute run, only retained ~140, and filled 6 buckets out of 256 + a partial bucket.
we discarded 8436 peers in the 70-minute run, only retained ~160, and filled 7 buckets out of 256 + a partial bucket.
the number of buckets and fullness settles at the 1-2 minute mark with 7 buckets, and remains constant until minute 30, where a new bucket is unfolded. After that, no more changes.
we stabilise on a table structure very soon, but the bucket contents are thrashing constantly.

The libp2p Kad DHT implementation does a good job of unfolding buckets lazily as they are utilised, versus eagerly initialising all buckets at startup – although I'd argue the memory impact of one vs. the other is negligible.

Impact

It's clear to me that the linear k-bucket routing table an inefficient data structure in this context. On the bright side, it does set a heuristic to arrange the network (based on distance), although arguably a mediocre one.

The problem emerges when you combine these factors:

a distance formula yielding a log2n distribution.
a constant size limitation per bucket (k).
a policy to evict peers on full buckets.

Now you have a routing table imposing the same k limitation to the buckets holding 50% of the network and the buckets holding 0.0001% of the network. This makes little sense.

Furthermore, every entry beyond k is an opportunity to evict a peer from the bucket. Kademlia proposes to test the liveness of the least recently seen peer, although in libp2p Kad DHT we take a different approach (replacing the least recently seen immediately). This has several negative trade-offs:

you are testing peers too often (every time a new peer is found), and remember that 50% of peers fall in the furthest bucket, so that bucket receives a lot of testing pressure.
you are thrashing buckets too often, replacing the least recently seen peer immediately with the incoming one (libp2p Kad DHT's behaviour).
you are forgetting which peers you've seen / evaluated for inclusion before, so if you encounter an old peer again, you are going to repeat the same process, over and over again.

Wasted discoveries

During DHT lookups and recurring table filling operations, we encounter lots of peers. Peers that could be useful. But we discard them.

Peer entries probably occupy less than 100 bytes, so storing the 8000 peers we discarded could've entailed only 800kb. A worthwhile investment if this gives us fast-track direct access (teleport) to arbitrary regions of the DHT.

Actually, it is worthy noting that 7 buckets cover 99% of the network. That's where the meat of the matter is. We should avoid on-demand iterative searches as much as possible, as they are inefficient and synchronous. Rather populate our DHT picture in the background, in a way that allows us to teleport when lookups are demanded.

Some ideas:

Further segmenting buckets into sub-buckets or trees that store sub-distances, e.g. a node with XOR distance 11010000 could be stored in bucket 1 (furthest from us), segment 13 (1101), and each bucket could have an invertible bloom filter to check membership quickly.
Keeping the single-tier structure, but using adaptable bucket lengths according to some heuristic.
Replacing the k-bucket table data structure with something totally different that still preserves the distance-based access on some level, i.e. keep the outward Kademlia API but change the mechanics. Some kind of tree.

To me, (1) and (3) are good paths forward here.

References

[1] https://eprint.iacr.org/2018/236.pdf
[2] raulk/go-libp2p-kbucket@4acf71e
[3] raulk@f5c1563
[4] https://gist.github.com/raulk/95b269072e75a71f84a1bcf2fa237247
[5] https://gist.github.com/raulk/17aba7f263bba54c83cda1269e6bf5a7

The text was updated successfully, but these errors were encountered:

raulk · 2018-09-06T17:19:41Z

Calling out a few people to get the discussion started here: @jhiesey @bigs @Stebalien @whyrusleeping @mgoelzer @carospiegly @diasdavid @jbenet.

bigs · 2018-09-06T19:30:37Z

would be interested in dropping in a trie or an ordered map (supporting lookup LTE or GTE operations)

carospiegly · 2018-09-06T19:41:29Z

As a side note, (we totally might have talked about this yesterday) do we choose not to have a preference for established nodes because of security? I thought that kademlia keeps established nodes around over new nodes?
@bigs would the ordered map have distance buckets as the keys?

raulk · 2018-09-06T19:47:42Z

@carospiegly I think our current protocol doesn’t use ping and pong messages – I vaguely recall hearing that (afk at the moment so I can’t check). Without that ability, we can’t test for node liveness (unless we issue a query). According to the Kademlia spec, when a bucket is full you pick an existing node and test if it’s still alive. If it is, you reject the incoming one, thus giving preference to elder nodes.

In Ethereum this is problematic because queries are only answered if the requester is in your table, so new nodes find it relatively hard to join the network IIRC.

raulk · 2018-09-06T20:52:23Z

@carospiegly Another thought I’m having is that the Kademlia RPC protocol never refers to distances. So as long as we can honour the behaviour of returning nodes close to a target, I think we’re good.

In other words, the arrangement by distance is purely local!

Additional data structures to consider include: skip lists, trees. We’ll probably need an invertible bloom filter for fast membership tests.

bigs · 2018-09-06T21:39:00Z

it would have IDs as keys and could provide nlogn lookups for closest peers

bigs · 2018-09-06T22:05:39Z

@raulk exactly. all we need is to find nearest ids!

ghost · 2018-09-09T01:53:19Z

@Stebalien

bennofs · 2018-09-19T11:28:30Z

The paper Sub-Second Lookups on a Large-Scale Kademlia-Based Overlay mentions a technique to counteract the log-distribution of buckets: make the bucket for the largest distance have more entries (they changed the bucket size of the 4 farthest away buckets from 8 to 64,32,16,8). This resulted in significant decrease in lookup latency for their tests in the kademlia MainlineDHT and might be an idea worth looking at.

raulk · 2018-09-20T13:54:09Z

@bennofs thanks a lot for the reference! Their observations are consistent with my findings.

Personally I'm thinking of a complete redesign of the underlying data structure towards a segmented heap of some kind. I am not entirely convinced the concept of "distance" is as helpful as it claims to be.

I'm thinking of a data structure that:

Indexes peers by the subtree of the network they give us access to.
Maintains a balance of # of peers across all subtrees, to whatever depth (which can be adaptive). So this structure "rolls off" gradually up to a maximum depth.
Scores peers according to some heuristic (latency, failure rate, answer quality, etc.), or a mix of them.
Contains a "probing region" and a "quarantine region".
- Newly discovered peers make it into the probing region, where they are tested for their characteristics based on above heuristics. If they prove to be better than existing peers in their subtree, they are replaced in the main segment for that subtree.
- Displaced peers are placed into "quarantine" where they are given a chance to prove they can do better than previous ones.
We maintain a "eden region" for new peers – this gives new nodes a chance to get up to speed with the network and prove their usefulness, before the standard heuristics kick in.
A periodic process probes all peers in the main region if they have been inactive for some time, to confirm their availability. If they have to be replaced, we query other nodes in that subtree to fill up the vacant spot.

This is just off the top of my head, happy to hear thoughts.

bennofs · 2018-09-20T19:16:21Z

@raulk I cannot comment on your proposal (I don't fully grok Kademlia yet :) but I found another explanation for the bucket trashing:

If we look at the implementation of RoutingTable.Update, there is the following code:

	// New peer, add to bucket
	bucket.PushFront(p)
	rt.PeerAdded(p)

	// Are we past the max bucket size?
	if bucket.Len() > rt.bucketsize {
		// If this bucket is the rightmost bucket, and its full
		// we need to split it and create a new bucket
		if bucketID == len(rt.Buckets)-1 {
			rt.nextBucket()
		} else {
			// If the bucket cant split kick out least active node
			rt.PeerRemoved(bucket.PopBack())
		}
	}

The only path in which the new peer is not added to the bucket is if either it already is in the bucket or its latency is too high.

From this code, it appears to me that we are not at all implementing the "prefer older peers" from kademlia: every new node is added to a bucket (we remove the least active node, but least active != not connected anymore). So if you are flooded with nodes that come online and then go offline again quickly (which for new nodes is likely, consider someone just trying ipfs for a small amount of time) then the bucket is trashed.

raulk · 2018-09-20T20:19:10Z

@bennofs Indeed. This is what I explained – albeit maybe not clearly – in my point above:

Kademlia proposes to test the liveness of the least recently seen peer, although in libp2p Kad DHT we take a different approach (replacing the least recently seen immediately). This has several negative trade-offs:
[...]

you are thrashing buckets too often, replacing the least recently seen peer immediately with the incoming one (libp2p Kad DHT's behaviour).

Looking to get involved with libp2p? Would be happy to collaborate on this issue, or on others ;-)

anacrolix · 2019-01-10T23:41:42Z

My instinct is precisely what @bennofs described. Thrashing occurs because evictions are too easy. An existing implementation avoids this by evicting bad nodes if the bucket is full. See https://github.com/anacrolix/dht/blob/b09db78595aaba14cd992537fbe992a4c5e0a141/server.go#L441-L450 and https://github.com/anacrolix/dht/blob/master/server.go#L457-L478 for examples of how this is done in other implementations.

anacrolix · 2019-02-14T00:42:43Z

I've been experimenting with this, the biggest contributor is disconnects causing evictions from the routing table, followed by a bad bootstrapping process, particularly the random walks, which mainly flood the first bucket.

anacrolix · 2019-02-14T00:48:19Z

Can we reduce this issue to something actionable? Specifically I propose #45, an issue in kbucket for a better datastructure based on research per @bennofs suggestion, and tangible items to improve bootstrap cycles.

raulk · 2019-06-26T14:07:01Z

@aschmahmann - you might be interested in this, particularly #194 (comment)

Stebalien · 2021-07-21T04:22:52Z

Fixed.

raulk mentioned this issue Sep 17, 2018

During DHT crawl, keep evicted peers as a backup #53

Closed

bigs added need/community-input Needs input from the wider community research status/deferred Conscious decision to pause or backlog labels Sep 18, 2018

raulk mentioned this issue Sep 20, 2018

2018 Q4 OKR Planning libp2p/go-libp2p#425

Closed

raulk mentioned this issue Sep 21, 2018

libp2p integration ethereum/devp2p#45

Open

raulk mentioned this issue Nov 17, 2018

Add minimal Kademlia DHT spec libp2p/specs#108

Merged

4 tasks

raulk mentioned this issue Feb 12, 2019

Towards better visibility, debuggability and diagnostics #256

Open

yusefnapora mentioned this issue Jul 17, 2019

Dissociating routing table membership from connection state #283

Closed

Stebalien closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis: k-bucket underutilisation and excessive thrashing #194

Analysis: k-bucket underutilisation and excessive thrashing #194

raulk commented Sep 6, 2018 •

edited

Loading

raulk commented Sep 6, 2018

bigs commented Sep 6, 2018

carospiegly commented Sep 6, 2018

raulk commented Sep 6, 2018 •

edited

Loading

raulk commented Sep 6, 2018 •

edited

Loading

bigs commented Sep 6, 2018

bigs commented Sep 6, 2018

ghost commented Sep 9, 2018

bennofs commented Sep 19, 2018

raulk commented Sep 20, 2018 •

edited

Loading

bennofs commented Sep 20, 2018 •

edited

Loading

raulk commented Sep 20, 2018

anacrolix commented Jan 10, 2019

anacrolix commented Feb 14, 2019

anacrolix commented Feb 14, 2019

raulk commented Jun 26, 2019

Stebalien commented Jul 21, 2021

Analysis: k-bucket underutilisation and excessive thrashing #194

Analysis: k-bucket underutilisation and excessive thrashing #194

Comments

raulk commented Sep 6, 2018 • edited Loading

Problem statement

Assessing libp2p Kad DHT

Impact

Wasted discoveries

References

raulk commented Sep 6, 2018

bigs commented Sep 6, 2018

carospiegly commented Sep 6, 2018

raulk commented Sep 6, 2018 • edited Loading

raulk commented Sep 6, 2018 • edited Loading

bigs commented Sep 6, 2018

bigs commented Sep 6, 2018

ghost commented Sep 9, 2018

bennofs commented Sep 19, 2018

raulk commented Sep 20, 2018 • edited Loading

bennofs commented Sep 20, 2018 • edited Loading

raulk commented Sep 20, 2018

anacrolix commented Jan 10, 2019

anacrolix commented Feb 14, 2019

anacrolix commented Feb 14, 2019

raulk commented Jun 26, 2019

Stebalien commented Jul 21, 2021

raulk commented Sep 6, 2018 •

edited

Loading

raulk commented Sep 6, 2018 •

edited

Loading

raulk commented Sep 6, 2018 •

edited

Loading

raulk commented Sep 20, 2018 •

edited

Loading

bennofs commented Sep 20, 2018 •

edited

Loading