Replaced p2p_connections_mutex with fine-grained locking #197

etremel · 2021-04-21T19:15:31Z

As discussed in the issue report for #195, the "one big lock" on P2PConnectionsManager within RPCManager was vulnerable to deadlocks. One solution is to make the P2P connections a fixed-size array with a separate lock for each P2P connection, so that unrelated add and remove operations don't block each other. I've implemented that solution and tested it to ensure P2P requests still work in normal operation. When merged, this should fix #195.

Working with Weijia, I sketched out a new way to manage the P2P connections in RPCManager using fine-grained locking on a fixed-size array of P2PConnection pointers. This will probably work, but one concern is that iterating over all of the connections for probe_all and check_failures_loop will now be too expensive because of all of the extra lock operations. We might need to import a concurrent hash map library after all.

After doing some performance testing with Weijia, we determined that the cost of accessing a std::vector<bool> on every loop iteration (i.e. in probe_all) is much higher than accessing a plain C array of bool.

The special case of the P2P connection to "myself" (the local node) is set up in the constructor of P2PConnectionManager, not via add_connections(). The constructor also needs to set active_p2p_connections[my_node_id] to true, otherwise the local P2P connection will never be checked in probe_all().

Since ExternalGroup copies and pastes parts of RPCManager, it must be manually synchronized every time RPCManager changes. While I'm updating it to take into account the new P2PConnectionManager, I might as well fix these variable names which have gotten out of sync with RPCManager.

songweijia · 2021-04-22T20:12:37Z

Thank you, Edward! The code looks matching our discussion. I will test it with the experiment which found the bug.

songweijia · 2021-04-25T19:40:36Z

@etremel Edward, I found that I need at least the following change to make the fix_p2p_mutex branch compile. Could you make sure you have all your local commits pushed?

diff --git a/include/derecho/core/detail/external_group_impl.hpp b/include/derecho/core/detail/external_group_impl.hpp
index 7baf45fe..509bdff4 100644
--- a/include/derecho/core/detail/external_group_impl.hpp
+++ b/include/derecho/core/detail/external_group_impl.hpp
@@ -383,7 +383,7 @@ void ExternalGroup<ReplicatedTypes...>::p2p_receive_loop() {
 
     uint64_t max_payload_size = getConfUInt64(CONF_SUBGROUP_DEFAULT_MAX_PAYLOAD_SIZE);
 
-    request_worker_thread = std::thread(&ExternalGroup<ReplicatedTypes...>::fifo_worker, this);
+    request_worker_thread = std::thread(&ExternalGroup<ReplicatedTypes...>::p2p_request_worker, this);
 
     struct timespec last_time, cur_time;
     clock_gettime(CLOCK_REALTIME, &last_time);

I missed one instance of fifo_worker when renaming it to p2p_request_worker.

etremel · 2021-04-25T22:06:30Z

Oh, you're right, I missed that instance of fifo_worker when I renamed it to p2p_request_worker in ExternalCaller. I thought for sure I checked to ensure everything compiled before pushing that last commit, but I guess I didn't. I'll fix it.

songweijia · 2021-04-26T04:08:38Z

I tested it with the original tests a couple of times. It works well without a deadlock.

songweijia

I tested it with the original tests a couple of times. It works well without a deadlock.

etremel added 4 commits April 14, 2021 18:23

Minor fixes to the fine-grained locks

2ef2fc2

After doing some performance testing with Weijia, we determined that the cost of accessing a std::vector<bool> on every loop iteration (i.e. in probe_all) is much higher than accessing a plain C array of bool.

etremel requested a review from songweijia April 21, 2021 19:15

Fixed a typo in the previous commit

c83bf7a

I missed one instance of fifo_worker when renaming it to p2p_request_worker.

songweijia approved these changes Apr 26, 2021

View reviewed changes

etremel merged commit ecd565f into master Apr 26, 2021

etremel deleted the fix_p2p_mutex branch June 8, 2021 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced p2p_connections_mutex with fine-grained locking #197

Replaced p2p_connections_mutex with fine-grained locking #197

etremel commented Apr 21, 2021

songweijia commented Apr 22, 2021

songweijia commented Apr 25, 2021

etremel commented Apr 25, 2021

songweijia commented Apr 26, 2021

songweijia left a comment

Replaced p2p_connections_mutex with fine-grained locking #197

Replaced p2p_connections_mutex with fine-grained locking #197

Conversation

etremel commented Apr 21, 2021

songweijia commented Apr 22, 2021

songweijia commented Apr 25, 2021

etremel commented Apr 25, 2021

songweijia commented Apr 26, 2021

songweijia left a comment

Choose a reason for hiding this comment