Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced p2p_connections_mutex with fine-grained locking #197

Merged
merged 5 commits into from
Apr 26, 2021

Conversation

etremel
Copy link
Contributor

@etremel etremel commented Apr 21, 2021

As discussed in the issue report for #195, the "one big lock" on P2PConnectionsManager within RPCManager was vulnerable to deadlocks. One solution is to make the P2P connections a fixed-size array with a separate lock for each P2P connection, so that unrelated add and remove operations don't block each other. I've implemented that solution and tested it to ensure P2P requests still work in normal operation. When merged, this should fix #195.

Working with Weijia, I sketched out a new way to manage the P2P
connections in RPCManager using fine-grained locking on a fixed-size
array of P2PConnection pointers. This will probably work, but one
concern is that iterating over all of the connections for probe_all and
check_failures_loop will now be too expensive because of all of the
extra lock operations. We might need to import a concurrent hash map
library after all.
After doing some performance testing with Weijia, we determined that the
cost of accessing a std::vector<bool> on every loop iteration (i.e. in
probe_all) is much higher than accessing a plain C array of bool.
The special case of the P2P connection to "myself" (the local node) is
set up in the constructor of P2PConnectionManager, not via
add_connections(). The constructor also needs to set
active_p2p_connections[my_node_id] to true, otherwise the local P2P
connection will never be checked in probe_all().
Since ExternalGroup copies and pastes parts of RPCManager, it must be
manually synchronized every time RPCManager changes. While I'm updating
it to take into account the new P2PConnectionManager, I might as well
fix these variable names which have gotten out of sync with RPCManager.
@etremel etremel requested a review from songweijia April 21, 2021 19:15
@songweijia
Copy link
Contributor

Thank you, Edward! The code looks matching our discussion. I will test it with the experiment which found the bug.

@songweijia
Copy link
Contributor

@etremel Edward, I found that I need at least the following change to make the fix_p2p_mutex branch compile. Could you make sure you have all your local commits pushed?

diff --git a/include/derecho/core/detail/external_group_impl.hpp b/include/derecho/core/detail/external_group_impl.hpp
index 7baf45fe..509bdff4 100644
--- a/include/derecho/core/detail/external_group_impl.hpp
+++ b/include/derecho/core/detail/external_group_impl.hpp
@@ -383,7 +383,7 @@ void ExternalGroup<ReplicatedTypes...>::p2p_receive_loop() {
 
     uint64_t max_payload_size = getConfUInt64(CONF_SUBGROUP_DEFAULT_MAX_PAYLOAD_SIZE);
 
-    request_worker_thread = std::thread(&ExternalGroup<ReplicatedTypes...>::fifo_worker, this);
+    request_worker_thread = std::thread(&ExternalGroup<ReplicatedTypes...>::p2p_request_worker, this);
 
     struct timespec last_time, cur_time;
     clock_gettime(CLOCK_REALTIME, &last_time);

I missed one instance of fifo_worker when renaming it to
p2p_request_worker.
@etremel
Copy link
Contributor Author

etremel commented Apr 25, 2021

Oh, you're right, I missed that instance of fifo_worker when I renamed it to p2p_request_worker in ExternalCaller. I thought for sure I checked to ensure everything compiled before pushing that last commit, but I guess I didn't. I'll fix it.

@songweijia
Copy link
Contributor

I tested it with the original tests a couple of times. It works well without a deadlock.

Copy link
Contributor

@songweijia songweijia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it with the original tests a couple of times. It works well without a deadlock.

@etremel etremel merged commit ecd565f into master Apr 26, 2021
@etremel etremel deleted the fix_p2p_mutex branch June 8, 2021 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-deterministic deadlock on handling new external clients
2 participants