Question regarding gossipsub #256

iulianpascalau · 2020-01-21T16:46:28Z

Hello

I am wondering why in a setup of 7 nodes (connected in a complete graph fashion) sometimes it happens that a peer will get the message delivered after about 1 second (sometimes even more) in respect to the other peers. The setup we are using has multiple topics, uses message signing and gossip sub as the pubsub router. The message payload is around 2kB large. We are also using topic validators but the functions return in under ~~20ms~~ 65ms (after our latest measurements). The setup has been employed on Digital Ocean and after that on AWS VPS, both runs yielding the same results.
I'm asking this as I have tried using the flood sub router and a message is being broadcast to all peers under 150ms.
I'm trying to figure out a way to reproduce this but I have failed even when using 100 hosts connected through localhost interface.

raulk · 2020-01-30T12:33:56Z

@iulianpascalau thanks for the report! Would be useful if you could point us to the code where you set up gossipsub and the validators. What's the message throughput you're subjecting the system to?

iulianpascalau · 2020-01-30T13:01:26Z

Ok, working on a lightweight wrapper over libp2p libs. The above comment was done on a system in which only one out of 7 peers broadcast 2 messages (one about 2KB and one under 1KB) delayed by a 100us and the other, upon receiving those 2 messages, broadcast each a message of under 1KB size. (in other words, the first peer broadcast a block header + block body and the others sent their signature shares after processing the header and body).
Beside this, we conducted some other experiments employing 384 DO vps split in 3 geographic zones. All peers had a trimming connection mechanism that kept the connections count to around 150. All peers joined the same topic (the only topic) and one peer started broadcasting a message on a 3 seconds interval between sends. With the newest libp2p versions:

        github.com/libp2p/go-libp2p v0.5.1
	github.com/libp2p/go-libp2p-core v0.3.0
	github.com/libp2p/go-libp2p-discovery v0.2.0
	github.com/libp2p/go-libp2p-kad-dht v0.5.0
	github.com/libp2p/go-libp2p-kbucket v0.2.3
	github.com/libp2p/go-libp2p-pubsub v0.2.5

it happened that some peers did not get the message, reaching a low number as 200 (something) peers out of 384 that have received the broadcast message. Re-running the same app, same network topology but with the older versions:

        github.com/libp2p/go-libp2p v0.3.1
	github.com/libp2p/go-libp2p-core v0.2.2
	github.com/libp2p/go-libp2p-discovery v0.1.0
	github.com/libp2p/go-libp2p-kad-dht v0.2.1
	github.com/libp2p/go-libp2p-kbucket v0.2.1
	github.com/libp2p/go-libp2p-pubsub v0.1.1

yield a receiving rate of 100% (all peers got the message)

iulianpascalau · 2020-01-30T13:01:38Z

Now, further investigating the older versions, still found out the same problem depicted in the first comment: some peers got the message after a long period of time (order of seconds) after the broadcast event.
Modifying a little bit the pubsub implementation so that each message will have a trace information area in which all traversing peers that handled the specified message would have included the timestamp and their pid showed us some surprising results in the way that from receiving the message until re-sending it, the time duration is on the tens of milliseconds so we kind of rule out the pubsub implementation, our wrapper and high level processing over our wrapper. (how this it was done, can be seen here: https://github.com/ElrondNetwork/go-libp2p-pubsub/tree/v0.2.5-traverse )

iulianpascalau · 2020-01-30T13:02:08Z

This is how we used the old version
instantiation:

const pubsubTimeCacheDuration = 10 * time.Minute
//......
optsPS := []pubsub.Option{
		pubsub.WithMessageSigning(withSigning),
	}

	pubsub.TimeCacheDuration = pubsubTimeCacheDuration

	ps, err := pubsub.NewGossipSub(ctxProvider.Context(), ctxProvider.Host(), optsPS...)

set up validators:

//......
//topic creation
        subscrRequest, err := netMes.pb.Subscribe(name)
        netMes.mutTopics.Unlock()
	if err != nil {
		return err
	}

	if createChannelForTopic {
		err = netMes.outgoingPLB.AddChannel(name)
	}

	//just a dummy func to consume messages received by the newly created topic
	go func() {
		for {
			_, _ = subscrRequest.Next(ctx)
		}
	}()

//...........
//assigning validators
err := netMes.pb.RegisterTopicValidator(topic, func(ctx context.Context, pid peer.ID, message *pubsub.Message) bool {
		err := handler.ProcessReceivedMessage(NewMessage(message), broadcastHandler)
		if err != nil {
			log.Trace("p2p validator", "error", err.Error(), "topics", message.TopicIDs)
		}

		return err == nil
	})

iulianpascalau · 2020-01-30T13:04:42Z

And how we used the new version
instantiation:

const pubsubTimeCacheDuration = 10 * time.Minute
//......
optsPS := []pubsub.Option{
		pubsub.WithMessageSigning(withSigning),
	}

	pubsub.TimeCacheDuration = pubsubTimeCacheDuration

	ps, err := pubsub.NewGossipSub(ctxProvider.Context(), ctxProvider.Host(), optsPS...)

set up validators:

//......
//topic creation
        netMes.topics[name] = nil
	topic, err := netMes.pb.Join(name)
	if err != nil {
		netMes.mutTopics.Unlock()
		return err
	}
	subscrRequest, err := topic.Subscribe()
	if err != nil {
		netMes.mutTopics.Unlock()
		return err
	}

	netMes.topics[name] = topic
	netMes.mutTopics.Unlock()

	if createChannelForTopic {
		err = netMes.outgoingPLB.AddChannel(name)
	}

	//just a dummy func to consume messages received by the newly created topic
	go func() {
		for {
			_, _ = subscrRequest.Next(ctx)
		}
	}()

//...........
//assigning validators
err := netMes.pb.RegisterTopicValidator(topic, func(ctx context.Context, pid peer.ID, message *pubsub.Message) bool {
		err := handler.ProcessReceivedMessage(NewMessage(message), broadcastHandler)
		if err != nil {
			log.Trace("p2p validator", "error", err.Error(), "topics", message.TopicIDs)
		}

		return err == nil
	})

iulianpascalau · 2020-01-30T13:19:19Z

Sorry for the long messages.

vyzo · 2020-01-30T13:37:53Z

Can you try with the PX PR? There might be some topology issues causing messages to only propagate by gossip; #234

iulianpascalau · 2020-01-30T14:48:39Z

Will try. Thanks! 👍

iulianpascalau · 2020-01-31T20:11:29Z

We have tested the PX PR and found out that actually performed worse than the release tag v0.2.5. Worse meaning higher latencies when sending payloads between 300000-650000 bytes on the 384 nodes. The messages were sent from the same peer with a time duration of about 1 second between each send (100 sends). Average latencies were between 462ms - 2.49s with highs between 934ms - 4s. v0.2.5 on the same setup, same test, yield 295ms - 1.65s for average and 596ms - 2.84s for highs.
Our libp2p libs wrapper can be found here https://github.com/ElrondNetwork/elrond-go-p2p (master branch using the "old libs" as described in my previous comment, and other branches having the "new libs" - current releases + other branches in which we have something changed). This wrapper is the minimum wrapper we can get that mimics the "production wrapper" and essentially, it aggregates 3 main components: the host, the pubsub and a peer discoverer implemented with kad-dht. We have a pseudo sharding feature in which we trim the connections count if a certain "desired" level of connections has been reached. In our tests this "feature" was disabled, the "netMessenger" constructor was called with value 0 so no connections trimming were performed.

iulianpascalau · 2020-01-31T20:11:42Z

We have conducted some more tests in which we employed the afore mentioned wrapper. One of them was supposed to choose 100 (random) peers out of those 384 and each of them supposed to broadcast a small message (around 300-600 bytes). We expected that the messages, after a while, would have reached all other peers but we soon found out the test had failed. Some messages reached as low as 3 peers, averaging at around half of the peers. This happened with "old libs", "new libs" and px pr version with our "lightweight" wrapper and production wrapper. None seemed to cope with such large flooding of messages. We even tested the "new libs" with the option pubsub.WithPeerOutboundQueueSize(1000) but we yield the same results.

iulianpascalau · 2020-01-31T20:11:56Z

And now comes the "not-so-funny part". We tested a setup in an integration test (ran on a single machine, a plain old golang test) that created 200 wrapper instances (production wrappers) and connect them in between using kad-dht discovery mechanism. After that, the first 100 nodes broadcast a single message. After a couple of seconds each peer (out of the 200) got the 100 messages. We kind of expected that since all of the nodes connected through local host interface and soon we started adding stuff in the pubsub implementation. Things like: making the outbound peer channels be a buffered channels of 1 instead of 32 or adding time sleeps of 150ms in comms.go, handleSendingMessages function just before the writeMsg call. The test continued to show us the messages reached all peers with more or less latency, but in the end, all messages reached all peers. Same happened when testing with the mocknet testing infrastructure and playing with the link latency. We were simply unable to reproduce the findings on our medium sized network of 384 nodes using local created hosts.

vyzo · 2020-02-01T10:47:59Z

I am a little confused by all this.

vyzo · 2020-02-01T10:48:31Z

Also, a single message is not necessarily representative, as there may be convergences delays.

vyzo · 2020-02-01T14:28:10Z

As a first step towards diagnosing, can you turn off the connection manager? Let them have as many connections as they want, just to make sure there is no interference.

raulk · 2020-02-01T14:28:14Z

@vyzo you started working on a testground test for peer exchange. Not sure what the status of that is. Do you have any reference numbers you can share from actual experiments?

iulianpascalau · 2020-02-04T07:14:53Z

Ok, short update, it appeared that the problem in which we reported that messages are being dropped randomly is not an issue (it was the application that used our wrappers that caused false reports). The messages are being broadcast to all peers even in the presence of high latencies or misconfigured pubsub implementation (small output message queue size).
The only thing remaining here is the problem regarding the gossipsub that, in some cases, delays the messages broadcast to a certain peer. (first question in this thread)
Will retry with the px pr version.
After some digging in the gossipsub implementation, I came with the following possible answer to my first question (please correct me if I'm wrong): I think that the message, somehow, do not reach the affected peer. After that, using the IHAVE/IWANT control messages, the original message gets re-broadcast and finally reaches the affected peer after that delay of around 1 second.

vyzo · 2020-02-04T12:46:31Z

Yeah, that 1s delay is indicative of gossip transmission.

iulianpascalau · 2020-02-04T13:13:47Z

Ok great, will focus on our protocol then. :)

daviddias added the kind/support A question or request for support label Mar 25, 2020

daviddias closed this as completed Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding gossipsub #256

Question regarding gossipsub #256

iulianpascalau commented Jan 21, 2020 •

edited

Loading

raulk commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

vyzo commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 31, 2020 •

edited

Loading

iulianpascalau commented Jan 31, 2020 •

edited

Loading

iulianpascalau commented Jan 31, 2020 •

edited

Loading

vyzo commented Feb 1, 2020

vyzo commented Feb 1, 2020

vyzo commented Feb 1, 2020

raulk commented Feb 1, 2020

iulianpascalau commented Feb 4, 2020

vyzo commented Feb 4, 2020

iulianpascalau commented Feb 4, 2020 •

edited

Loading

Question regarding gossipsub #256

Question regarding gossipsub #256

Comments

iulianpascalau commented Jan 21, 2020 • edited Loading

raulk commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

vyzo commented Jan 30, 2020

iulianpascalau commented Jan 30, 2020

iulianpascalau commented Jan 31, 2020 • edited Loading

iulianpascalau commented Jan 31, 2020 • edited Loading

iulianpascalau commented Jan 31, 2020 • edited Loading

vyzo commented Feb 1, 2020

vyzo commented Feb 1, 2020

vyzo commented Feb 1, 2020

raulk commented Feb 1, 2020

iulianpascalau commented Feb 4, 2020

vyzo commented Feb 4, 2020

iulianpascalau commented Feb 4, 2020 • edited Loading

iulianpascalau commented Jan 21, 2020 •

edited

Loading

iulianpascalau commented Jan 31, 2020 •

edited

Loading

iulianpascalau commented Jan 31, 2020 •

edited

Loading

iulianpascalau commented Jan 31, 2020 •

edited

Loading

iulianpascalau commented Feb 4, 2020 •

edited

Loading