Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service clients freeze on multi-client cases. #372

Open
kjjpc opened this issue Jul 8, 2024 · 6 comments
Open

Service clients freeze on multi-client cases. #372

kjjpc opened this issue Jul 8, 2024 · 6 comments

Comments

@kjjpc
Copy link

kjjpc commented Jul 8, 2024

Bug report

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • binaries
  • Version or commit hash:
    • ros-jazzy-ros-base=0.11.0-1 of apt package
  • DDS implementation:
    • Fast DDS, Cyclone DDS
  • Client library (if applicable):
    • rclcpp, rclpy

Steps to reproduce issue

I run 1 service server and 3 service clients which call service on high frequency as a stress test.
Attached are python codes, but same result on c++ codes.

Service Server

import rclpy
from rclpy.node import Node
from example_interfaces.srv import AddTwoInts

class MinimalService(Node):
    def __init__(self):
        super().__init__('minimal_service')
        self.srv = self.create_service(AddTwoInts, 'add_two_ints', self.add_two_ints_callback)

    def add_two_ints_callback(self, request, response):
        response.sum = request.a + request.b
        return response

def main(args=None):
    rclpy.init(args=args)
    minimal_service = MinimalService()
    rclpy.spin(minimal_service)
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Service Client (run on 3 terminals)

import rclpy
from rclpy.node import Node
from example_interfaces.srv import AddTwoInts

class MinimalClientAsync(Node):
    def __init__(self):
        super().__init__('minimal_client_async')
        self.cli = self.create_client(AddTwoInts, 'add_two_ints')
        while not self.cli.wait_for_service(timeout_sec=1.0):
            self.get_logger().info('service not available, waiting again...')

    def send_request(self, a, b):
        self.req = AddTwoInts.Request()
        self.req.a = a
        self.req.b = b
        self.future = self.cli.call_async(self.req)
        rclpy.spin_until_future_complete(self, self.future)
        return self.future.result()

def main(args=None):
    rclpy.init(args=args)
    minimal_client = MinimalClientAsync()
    for i in range(1000000000):
        response = minimal_client.send_request(1, 2)
        if i%10000 == 0:
            minimal_client.get_logger().info(
                'Result of add_two_ints: for %d + %d = %d' %
                (1, 2, response.sum))

    minimal_client.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Expected behavior

Continue to display results from service server repeatedly

Actual behavior

The service client sometimes freezes in spin_until_future_complete. The service client doesn't progress at all when it freezes. Even though 2 of 3 clients are terminated, the freeze client doesn't continue the computation loop.
The freeze occurs not only high frequency cases but also low frequency multi-client cases.

Additional information

This issue is reported on ros2/rcl#1163.
The freeze is resolved by changing qos parameter like following.

        qos = qos_profile_services_default
        qos.history = HistoryPolicy.KEEP_ALL
        self.cli = self.create_client(AddTwoInts, 'add_two_ints', qos_profile = qos)

The default qos parameter of service is keep last 10 data, and it is not enough for multi-client cases.

In ROS2, all service response from service server is published to all clients, it means the buffer can be filled up by the responses for other callers and the response for the caller can be lost.

The default qos parameter of service client is not suitable for multi-client cases, which are often the case in ROS2. The default qos parameters should be changed from KEEP_LAST to KEEP_ALL.
The parameter is defined in qos_profiles.h.

RMW_QOS_POLICY_HISTORY_KEEP_LAST,

@kjjpc
Copy link
Author

kjjpc commented Jul 10, 2024

Additional information (Thanks @wuisky)

The issue of sending response to all clients.
ros2/rclcpp#2397

@fujitatomoya
ros2/rclcpp#2397 causes the freeze of service client.
How should we fix this issue?
The proper solution is to resolve the issue ros2/rclcpp#2397.

Easy solution is to change default qos parameters from KEEP_LAST to KEEP_ALL.
or
Change the depth of default qos parameters from 10 to 100 or 1000.

@fujitatomoya
Copy link
Collaborator

I tried with your reproducible examples both service server and client, and no process freeze are observed with my environment.

# service server
root@tomoyafujita:~/ros2_ws/colcon_ws# ros2 run prover_rclpy rmw_372_server --ros-args --log-level debug
... <keeps logging>

# service client (print once in a thousand)
root@tomoyafujita:~/ros2_ws/colcon_ws# ros2 run prover_rclpy rmw_372_client
[INFO] [1725490218.437069421] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490244.581709735] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490324.383530299] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490461.426458776] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490658.204459264] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490919.450157012] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3

the above data tells me that it increases the response time on service server significantly as time goes.
i am not sure why this is happening, CPU utilization is not even 100% for service server, i think this is rclpy executor efficiency problem.

The service client sometimes freezes in spin_until_future_complete

Can you rephrase freeze here? Is this supposed to mean process gets hung up?

The default qos parameter of service is keep last 10 data, and it is not enough for multi-client cases.

I do not think so.

Because your sample client calls spin_until_future_complete every time after the send request, that means it waits on the single request until it receives the response. (no async send/receive manner is implemented in the sample.)
Which also means that as service server aspect, if we have 3 clients running at the same time, it can expect only 3 requests at the same time at most.
IMO default QoS keep last 10 can be really enough for this test case. I am not sure why changing QoS depth affects the behavior for this test...

ros2/rclcpp#2397 causes the freeze of service client.

Could you tell us why you think ros2/rclcpp#2397 is the root cause of this?

As mentioned above, server deals with 3 request at most.
And server sends the response to all client under the hidden topic, but client checks the identification that tells the message is for itself or not.
If message identification is not for the client, client drops that message right away. this is not efficient but i do not see why this can be the reason for this issue.

As far as i see, this is rclpy executor performance problem.

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Sep 4, 2024
  Ref: ros2/rmw#372

Signed-off-by: Tomoya Fujita <Tomoya.Fujita@sony.com>
@kjjpc
Copy link
Author

kjjpc commented Sep 10, 2024

@fujitatomoya
Thanks for your response.
Attached video is the test behavior of my environment. (I changed the print freq)
The main loop of clients stops on spin_until_future_complete.
https://github.com/user-attachments/assets/dc8087a9-74b3-46e6-9cf1-973748b5ec86

I agree the current QoS depth is enough if the clients runs their loops equally.
One possibility of the freeze is the loops are not runs equally even though the simple test script.
I will investigate the behavior more.

The increase of response time may other issue.
I will also look into that as well.

@kjjpc
Copy link
Author

kjjpc commented Sep 11, 2024

I investigated the behavior of service response lost.
The service server correctly writes responses to response_write at this line.
https://github.com/ros2/rmw_fastrtps/blob/jazzy/rmw_fastrtps_shared_cpp/src/rmw_response.cpp#L162
However, the client can not take response from response_reader at this line.
https://github.com/ros2/rmw_fastrtps/blob/jazzy/rmw_fastrtps_shared_cpp/src/rmw_response.cpp#L69
It means some messages are lost in fast dds.

The increase of loop period may cause the lost of service response.
If other clients loops 10 times in one loop of a client, the response buffer overflows and the response for the client will be lost.
It actually happens in the video I attached to my previous comment.
Although it is not common to be called such frequently, there are likely to be situations where service calls fail due to differences of loop speed.

I will also look into the cause of the increase of loop period

@fujitatomoya
Copy link
Collaborator

@kjjpc really appreciate your effort on this issue.

i am now interested in this issue that it is losing some response messages in the clients... if that is the RMW implementation problem, would you try https://github.com/ros2/rmw_cyclonedds (or https://github.com/ros2/rmw_connextdds) to see if they have the same problem with https://github.com/ros2/rmw_fastrtps?

if cyclone and connextdds can keep the high frequency rate between clients and service server, we could say this issue with https://github.com/ros2/rmw_fastrtps? that is not finding the root cause yet, but i think worth to take a shot.

@kjjpc
Copy link
Author

kjjpc commented Sep 12, 2024

@fujitatomoya Thanks for you reply.
I checked the behavior of Cyclone DDS using the script in first comment.
The result is same with Fast DDS.
The loop period become gradually longer, and the responses are lost in the case of multiple clients.

I thinks the slow down of spin loop is a issue in rmw or rcl, and the lost of service responses is a issue of current architecture and QoS parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants