Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: record message/request event even in case of error #464

Merged
merged 3 commits into from
Mar 3, 2020

Conversation

MichaelMure
Copy link
Contributor

Stumbled upon this while working on #453

The metric for message/request sent/received was not incremented when an error occurred but the error count was. This would lead to a meaningless error ratio.

dht_net.go Outdated Show resolved Hide resolved
dht_net.go Show resolved Hide resolved
dht_net.go Outdated Show resolved Hide resolved
@MichaelMure
Copy link
Contributor Author

MichaelMure commented Feb 28, 2020

@aarshkshah1992 the point of this PR is that at the moment those metrics only record successful events. This makes it at least weird to compute an error ratio.

One could argue that instead of doing err_count / event_count you can still get that ratio by doing err_count / (valid_count+err_count) but I would say that it's not what the name nor the description of the metrics imply.

dht_net.go Outdated
@@ -107,6 +109,8 @@ func (dht *IpfsDHT) handleNewMessage(s network.Stream) bool {
ctx,
[]tag.Mutator{tag.Upsert(metrics.KeyMessageType, "UNKNOWN")},
metrics.ReceivedMessageErrors.M(1),
metrics.ReceivedMessages.M(1),
metrics.ReceivedBytes.M(int64(req.Size())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be len(msgbytes) but we need to record that before we release the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could always be len(msgbytes), right ? Does it ever make sense to use req.Size() instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that, let me know if that doesn't make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it ever make sense to use

For requests, no.

dht_net.go Outdated Show resolved Hide resolved
dht_net.go Outdated
@@ -96,6 +96,8 @@ func (dht *IpfsDHT) handleNewMessage(s network.Stream) bool {
ctx,
[]tag.Mutator{tag.Upsert(metrics.KeyMessageType, "UNKNOWN")},
metrics.ReceivedMessageErrors.M(1),
metrics.ReceivedMessages.M(1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we probably shouldn't record either a received message or a message error unless we receive a non-empty message. If the stream's broken for some reason, that's not something the DHT is really concerned about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be fair to not record anything at all ? My understanding is that if the stream is broken we don't get io.EOF but we don't have a message either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I read your message, the more I get confused ;-). It seems to me that there is three thing we can do:

  • valid message: record metrics.ReceivedMessages.M(1)
  • broken message: record both metrics.ReceivedMessages.M(1) and metrics.ReceivedMessageErrors.M(1)
  • no message: don't record anything

So what about:

  • io.EOF --> no message
  • len(msgbytes)==0 --> no message
  • len(msgbytes)!=0 --> broken message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that, let me know if that doesn't make sense.

Copy link
Contributor Author

@MichaelMure MichaelMure Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also return early on err.Error() == "stream reset" but that's quite brittle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

dht_net.go Outdated Show resolved Hide resolved
@Stebalien
Copy link
Member

I guess it depends on your perspective. From the standpoint of the query logic, these messages are "sent". From the standpoint of the network, these messages haven't been sent.

I'm inclined to merge this as long as we leave a comment explaining it. Especially because outbound request counting is now strictly more accurate (before, we weren't counting requests that successfully sent but didn't receive a response, now we're counting all attempts).

@MichaelMure
Copy link
Contributor Author

Alright, I addressed all your comments, it definitely looks more correct now.

@Stebalien Stebalien merged commit c2631d9 into libp2p:master Mar 3, 2020
@MichaelMure
Copy link
Contributor Author

FYI, with this PR applied, the reported error ratio for received message/requests fall down to ~0.3% which seems to be what is actually happening:
image
The breakdown of the remaining is:

  • unknown: 55%
  • find_node: 33%
  • put_value: 11%
  • get_providers: 1.3%
  • get_value: 0.2%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants