Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery: too slow and high network usage #281

Closed
alsora opened this issue May 23, 2019 · 14 comments · Fixed by eProsima/Fast-DDS#541 or #283
Closed

Discovery: too slow and high network usage #281

alsora opened this issue May 23, 2019 · 14 comments · Fixed by eProsima/Fast-DDS#541 or #283
Labels
bug Something isn't working

Comments

@alsora
Copy link
Contributor

alsora commented May 23, 2019

Hi,

after updating to Fast-RTPS 1.8.0 I have again issues during discovery when running applications with approximately 20 nodes.

The behavior is the same I found in Fast-RTPS 1.7.0 #249

I can tell you that all the nodes discover each other, however the Endpoint Discovery Phase hangs up forever.

Moreover, during the discovery, I can see a network usage of approximately 50Kb per second in upload.

The application I'm trying to run has 20 nodes, 23 publishers and 35 subscriptions.
https://github.com/irobot-ros/ros2-performance/tree/master/performances/benchmark

Could it be related to this change?
eProsima/Fast-DDS@af648ac

@dirk-thomas dirk-thomas added the bug Something isn't working label May 23, 2019
@MiguelCompany
Copy link
Collaborator

Hi @alsora

I can tell you that all the nodes discover each other, however the Endpoint Discovery Phase hangs up forever.

How have you checked the nodes have discovered each other? Because if they havent, there is a chance for the discovery to depend on the participant's announcement period.

Moreover, during the discovery, I can see a network usage of approximately 50Kb per second in upload.

Will you be so kind to send Wireshark captures of the working and non-working case, in order to better understand where the problem may be?

Could it be related to this change?
eProsima/Fast-RTPS@af648ac

I don't think so, since that change was recovering the timings to be the same as for v1.7.2. The constructor of Duration_t has changed to receive nanoseconds instead of fraction, so that commit made the timings be the correct nanoseconds.

We will reproduce the issue with the simple example shown in #249 and will look for the commit that provoked the regression on discovery.

@alsora
Copy link
Contributor Author

alsora commented May 24, 2019

@MiguelCompany Thank you for replying

Here you can see the functions that I'm using for checking PDP and EDP
https://github.com/irobot-ros/ros2-performance/blob/master/performances/performance_test/src/ros2/system.cpp#L110

I use the following APIs: Node::get_node_names() for PDP and Node::count_subscribers(topic_name) for EDP.

For what concerns reproducing the issue: note that I am testing 2 applications:

  • 10 nodes 10 pub 13 sub
  • 20 nodes 23 pub 35 sub

In the first one I don't see any issues.

At the moment, the solution that I'm using in order to run the second one, I'm waiting 1 second between the creation of each node.

I will get to you some data from Wireshark as soon as possible

@MiguelCompany
Copy link
Collaborator

I use the following APIs: Node::get_node_names() for PDP and Node::count_subscribers(topic_name) for EDP.

The nightly sanitizer jobs have found a data race on custom_participant_info.hpp (discovered_names and discovered_namespaces are not properly protected). This means Node::get_node_names() could be giving wrong results.

At the moment, the solution that I'm using in order to run the second one, I'm waiting 1 second between the creation of each node.

So, let's summarize ...

  1. Creating 10 nodes 10 pub 13 sub everything works
  2. Creating 20 nodes 23 pub 35 sub and waiting 1 second between the creation of each node everything works
  3. Creating 20 nodes 23 pub 35 sub without waiting
    a. wait_pdp_discovery returns
    b. wait_edp_discovery waits for more than 30 seconds

Is this what happens? If so, how much time does it take for wait_pdp_discovery to return?

Thank you for helping us understanding the issue.

@MiguelCompany
Copy link
Collaborator

By the way, we are trying to reproduce the problem with the example you provided on #249. We added the example to ros2 demos repo here and haven't been able to reproduce the problem

We are also adding a blackbox test for a similar situation: 30 participants each creating one publisher and one subscriber to the same topic here (Still WIP)

@alsora
Copy link
Contributor Author

alsora commented May 24, 2019

Yes that's exactly what happens.

wait_pdp_discovery returns almost immediately (less than 100 milliseconds).

Adding some logs here and there, I see always a small number of subscriptions not matched (less than 3).

I think that in any case adding some tests like this can be really useful also for the future!

However, keep in mind that each ROS2 node also creates a Parameter Server, i.e. 6 RTPSReader and 6 RTPSWriter.
For example, if in my application I disable the Parameter Server, the discovery works

@alsora
Copy link
Contributor Author

alsora commented May 24, 2019

I tried again the old "stress test": I start seeing problems when I have 1 publisher 50 subscribers.
However, I think it's more interesting to wait for the discovery rather than looking at if all the publishers receive messages.

@alsora
Copy link
Contributor Author

alsora commented May 24, 2019

@MiguelCompany Here the wireshark data.
For these tests I set the discovery timeout to 50 seconds.

TEST 1: no wait between nodes creation

PDP time: 50ms
EDP time: timeout

test_1

TEST 2: 1 sec wait between nodes creation

PDP time: 0
EDP time: 0

test_2

In the second test, the nodes creation takes 20 seconds (1 sec per node). During this time Wireshark shows a Network usage of 10Kb per second.
In the same situation, Linux System Monitor is showing the 50Kb I was referring before.

wireshark_captures.tar.gz

@MiguelCompany
Copy link
Collaborator

We found the issue. It was related with a change necessary for the implementation of the lifespan QoS. A fix is on the way in eProsima/Fast-DDS#541, a new blackbox test is being added in eProsima/Fast-DDS#542, and a new unit test is under development.

@dirk-thomas
Copy link
Member

@MiguelCompany great news!

@alsora can you please retest with the latest code including the fix.

@alsora
Copy link
Contributor Author

alsora commented May 28, 2019

I tested again with the latest updates.

The situation is definitely improved, but it's not fixed.

Considering 10 runs:

  • 8/10 discovery is completed within 1 second
  • 2/10 PDP hangs up (timeout 50 seconds)

This is different than what I saw last week, where PDP was working and EDP was not.

@MiguelCompany
Copy link
Collaborator

PDP hangs up

This may be related with the data race I mentioned on my previous comment.

The data race affects the data structures consulted by Node::get_node_names(), so that may be the reason of the hung up.

I created #283 with a proposed fix.
@alsora Will you be so kind to test against it?

@alsora
Copy link
Contributor Author

alsora commented May 28, 2019

@MiguelCompany The problem persists even after fixing the data race.

@MiguelCompany
Copy link
Collaborator

@alsora I don't know if this issue is still relevant or not. Do you think it can be closed?

@alsora
Copy link
Contributor Author

alsora commented Sep 7, 2020

Yes, I think it can be closed.
At least in Foxy I haven't seen this issue anymore.

@alsora alsora closed this as completed Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants