Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to message lost event #192

Merged
merged 3 commits into from
Jun 17, 2020
Merged

Conversation

ivanpauno
Copy link
Member

Implements ros2/rmw#226.
See ros2/rmw#232 for related changes in RMW.

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno ivanpauno added the enhancement New feature or request label Jun 5, 2020
@ivanpauno ivanpauno self-assigned this Jun 5, 2020
@ivanpauno
Copy link
Member Author

@eboasson could you take a look to the implementation?

I'm trying to add a reproducible test case in rcl, with a test where both the subscription and publisher have a history depth of 1, and the last message is overridden before being read (for more details here is the code).

Do you know if I have to do something else to get the sample lost/rejected status working (I always read a 0 count from both)?
Thank in advance!

@eboasson
Copy link
Collaborator

eboasson commented Jun 9, 2020

@ivanpauno there’s nothing wrong with your implementation: I haven’t gotten around to bubbling up the event to the user yet, that’s all. Part of the reason I haven’t done that is procrastination, because I don’t want to do something I believe to be wrong — a personal judgement, I know what the spec says, I just disagree with it.

The reason I think the spec is utterly wrong is that it conflates the case that really matters (those where the contract was inadvertently broken) with the ones that are in principle expected events:

  1. Sample loss caused by the reader momentarily losing liveliness according to the writer, but not the other way around. If the writer happens to publish data in this interval, it doesn’t require an acknowledgement from the reader and hence the data may be lost. Such loss always warrants an event. (It is particular nasty for a reliable, keep-all reader/writer pair. One could argue that the implementations should somehow prevent it, but that appears to bring its own problems.)
  2. Sample loss because of overwriting a keep-last history: well, that is why you asked for keep-last, right? In the world I come from, this was rather a feature, because it decouples the writers from the readers, and by carefully designing the data that gets published, this case becomes completely irrelevant.
  3. Sample loss because of packet loss when using best-effort mode: I’m always very much tempted to randomly drop data in best-effort mode even if everything arrives, simply because the application claimed to be able to deal with it 😈. The only good counter-argument I know of are those systems that use best-effort because there is no time for a retransmit and everything is carefully dimensioned to handle the traffic under all foreseeable circumstances.
  4. Sample loss because of content/time-based filtering: the specs are being extended because it has been recognized that this is even more of a non-event.

Currently only (4) is implemented in Cyclone (it was already there when I got my hands on it). For the fact that (1) isn’t there yet, I can only ask for forgiveness ... I probably should not make things so complicated anyway, one doesn’t have to monitor the events 🙂

Combining “lost” and “rejected” is probably not such a good idea: a rejected one really hasn’t been lost. It will arrive eventually, unless it gets dropped by the writer and then you eventually end up with a “sample lost” event.

P.S. One thing to keep in mind is that there is no guarantee that you can observe every single event. So, e.g., the “reason” in the sample rejected state is merely that of the most recent event, it doesn’t say anything about what came before that.

@ivanpauno
Copy link
Member Author

Sample loss because of overwriting a keep-last history: well, that is why you asked for keep-last, right? In the world I come from, this was rather a feature, because it decouples the writers from the readers, and by carefully designing the data that gets published, this case becomes completely irrelevant.

I see, I misunderstood some DDS documentation and thought that case was covered.
I also consider the queue being overridden normal, but in some cases it might be useful for debugging.

Combining “lost” and “rejected” is probably not such a good idea: a rejected one really hasn’t been lost. It will arrive eventually, unless it gets dropped by the writer and then you eventually end up with a “sample lost” event.

Thanks for clarifying that.
If a "rejected" message never arrives, will it trigger a "message lost" event in that case?

@jacobperron
Copy link
Member

jacobperron commented Jun 9, 2020

I see how the conflation of expected events (like overwriting history) and broken contracts is not great, but I do agree with @ivanpauno that having an indicator if samples are lost would be useful for diagnosing communication issues in any case.

@eboasson
Copy link
Collaborator

@ivanpauno the DDS family is a typical family, with a few feuds and long-running misunderstandings, but I don't think you really misunderstood anything. In this case, all the DDS specification (that is, DCPS 1.4) states about the "sample lost" status (and associated events, listeners, blah blah) is literally:

(page 2-130)

SAMPLE_LOST A sample has been lost (never received)

(page 2-132)

total_count Total cumulative count of all samples lost across of instances of data published under the Topic.
total_count_change The incremental number of samples lost since the last time the listener was called or the status was read.

Unfortunately, the DDSI-RTPS spec (which came a fair bit later) then barges in and tries to force an interpretation of it, so that you can have an implementation that arguably meets the DCPS specification and yet doesn’t meet the DDSI-RTPS one. (Weird, for a spec that purports to be no more than an interoperable protocol for the DCPS spec …) That the DDSI-RTPS spec got it wrong is clear, because one of the things currently under revision is precisely this matter. Of course, there is also the nice thing of having specs that are imprecise, incorrect or contradictory: one gains a lot of freedom 🙂

So what does "lost" mean? What does "never received" mean? By the reader history cache, or, as I forgot yesterday, by the application? Which means there are a few more cases that could be considered sample loss:

  • (2a) a sample has been received by the protocol stack and has been available in the reader history for some time, but pushed out of it by later samples.
  • (5) a sample never makes it to the reader because it was initially lost, and lifespan caused it to be dropped from the writer's history cache before a retransmit could succeed.
  • (5a) lifespan causes the sample to have been dropped from the RHC before the application read it.
  • (6a) no writers with registrations left (so the instance is in the "not-alive-no-writers" state) and the "auto purge" settings cause it to be removed from the reader history cache before the application read it.
  • (6b) the instance was disposed, and again, the "auto purge" settings cause it to be removed automatically before the application read it.
  • (7) samples that arrive but are dropped because they are too old (“by source timestamp destination order”) or of equal timestamp but lose out because of the fallback order on writer GUID.

My case (1) above is a breach of contract and without a doubt worthy of a notification. Still, there is a complication: event ordering. The only guarantees I am (as of now) prepared to give are that when samples have been lost and it cannot be proved that they were never meant to be received:

  • that increasing the "lost" count and setting status, triggering waitsets and invoking listeners will all occur eventually;
  • that the "lost" count will have been increased (and "lost" status set on the reader) at least once prior to the delivery (to the application) of the first possibly-ordered sample following the event, where possibly-ordered means only those samples/lost samples that cannot be proved to be unordered.

Specifically, that means I'm not prepared to guarantee:

  • that there will be a single increase for a single loss event (I may split it over multiple events);
  • that it will not overcount, in the sense that samples counted as lost may actually have been ones you would never have received, e.g.,
    • filtered out,
    • QoS changes in the intervening period that would have caused an unmatch/rematch as a consequence of which you were never meant to receive those samples;
    • ...
  • that listener invocation will occur prior to delivery (to the application) of the first sample following the loss that is possibly-ordered with respect to any of the lost samples;
  • that any part of the notification only ever occurs after delivery (to the application) of the last sample preceding the loss that is possibly-ordered with respect to any of the lost samples (i.e., you may receive a notification of a loss event in the future).

In short, you will know very little. But if you were to diligently read the "sample lost" status after each read/take and it says nothing has been lost, you can be sure nothing was lost up to and including the samples you just read that was also ordered with respect to ones you just read.

That covers the clear-cut case reasonably well, I think 🙂

I agree with @jacobperron that the various other cases can be interesting information when monitoring system performance. My view is that none of them should be counted as lost samples (perhaps excepting rejecting a best-effort sample because of resource limits), but I do think it would be good to have an interface for monitoring them, and it would probably be wise to do this with a separate monitoring one.

For volatile data, Cyclone can, in principle, distinguish between cases (2) and (2a) and case (1) with a keep-last writer — it’ll require some extension to the network protocol and I need to check whether the changes currently being done to DDSI-RTPS will cover it. How that works out if the data is transient-local needs some more thought. Do note that, in DDS semantics cases (2) and (2a) are the same, but in system performance monitoring, they are different …

For best-effort, it can be packet loss (my case (3)), but really also overwriting in the history or rejecting a sample because of the resource limits. @ivanpauno, “rejected” for best-effort means something totally different from “rejected” for reliable. For best-effort case you should combine them. Sorry, my bad.

Data that has been filtered out by definition isn’t lost, so (4) should never be counted as lost. From a monitoring perspective, it would make sense to count the number of samples that were filtered out by the reader as it might provide interesting information on the effectiveness of filtering at the writer.

Case (5) and (5a) are similar, and it seems to me they should be treated the same as (2) and (2a): cases (2) and (5) are where the writer can’t push its data out quickly enough, and (2a) and (5a) where the reader can’t keep up with the writer.

Cases (6a) and (6b) I think are best ignored: one should never set autopurge unless one means it. Case (7) is interesting, but it is either expected or flawed design/implementation of the data model or the writer.

--

If one goes down this path, it is going to be tough to test it using the ROS2 interfaces. Cyclone has an interface to make it deaf or mute, which I use to test for handling of those "asymmetric" disconnections that my case (1) builds on. But that is not a standard feature and I don't think you'd want to expose it in ROS2!

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno
Copy link
Member Author

For best-effort, it can be packet loss (my case (3)), but really also overwriting in the history or rejecting a sample because of the resource limits. @ivanpauno, “rejected” for best-effort means something totally different from “rejected” for reliable. For best-effort case you should combine them. Sorry, my bad.

This sounds like combining lost/rejected statuses isn't a good idea, so I reverted that in 30909a4.
IIUC, with a reliable subscriber combining sample lost and rejected will make the status to overcount (even) more.
I think that in the case of best-effort subscriptions the usefulness of these status callbacks is more limited, so it doesn't sound pretty important if in that case we're under-counting (which AFAIU is that it will happen if the rejected samples aren't counted too).

For the moment, I'm planning to only add to rmw the message (sample) lost status callbacks.

@ivanpauno
Copy link
Member Author

Thanks for your super detailed answer @eboasson !!!

@ivanpauno ivanpauno marked this pull request as ready for review June 10, 2020 19:53
Copy link
Member

@jacobperron jacobperron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides one comment.

rmw_cyclonedds_cpp/src/rmw_node.cpp Outdated Show resolved Hide resolved
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>

Co-authored-by: Jacob Perron <jacob@openrobotics.org>
@ivanpauno ivanpauno merged commit 5a52d29 into master Jun 17, 2020
@delete-merged-branch delete-merged-branch bot deleted the ivanpauno/add-message-lost-event branch June 17, 2020 21:55
ahcorde pushed a commit that referenced this pull request Oct 28, 2020
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Co-authored-by: Jacob Perron <jacob@openrobotics.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants