-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in async_client only with the support SSL #856
Comments
Could you post a Minimal, Reproducible Example? |
Hi redboltz, Thank you for your quick reaction. In the meantime, here is the stacktrace during the crash and you can observe that it is quite similar to the one reported in xtreemfs/xtreemfs#281. It occurs after a disconnection from the broker.
|
I quote you the interesting part of the xtreemfs ticket:
Maybe it could give you some hints |
I just looked the issue and solution. Unfortunately I don't get any useful information from that. I will start some analysis when you will post a Minimal, Reproducible Example. |
Also I recommend that applying valgrind and/or address sanitizer to your application. You would get some information. |
@redboltz , yes sure, I gonna give a try. |
@ropieur |
Additional questions: You said that you are using async_client. Please tell me disconnecting situation.
|
Hi Redboltz,
|
Thank you. The current master contains some fix of shutdown process. It still doesn't clean TLS layer but on TCP layer not only socket close() but also socket shutdown(). It might solve your problem. |
OS: ubuntu 20.04 |
Thank you.
Due to #78, version 11.1.0 and 8.0.0 only close TCP socket. It worked well so far. But I guess it could cause the problem you reported. Finally, I implemented graceful shutdown for all layers. It is #860. It is not merged yet. During implementing #860, I experienced a lot of blocking problem. I will merge it test with #860 on my proprietary project's production code. Could you test the following two versions ? |
@redboltz , we are working on it. I delivered a package internally with the branch fix_856. We need to resolve issues on our platform and then we will retest the disconnection with the broker. It can take multiple days before we get there. Anyway I'll keep you posted with the result in this ticket. |
@ropieur thank you! I will wait your feedback :) |
We managed to reproduce the crash with the branch fix_856 and I get the exact same backtrace from the coredump. |
In the error handler, my app typically performs the next operations:
Question: |
set_error_handler 's callback is called from on_error() as follwos: mqtt_cpp/include/mqtt/endpoint.hpp Lines 5045 to 5062 in f2ee0b8
In the line 5051, shutdown() is called internally. So you don't need to call async_force_disconnect(). So I recommend the following code; this->tls_async_client->set_error_handler([this](const auto &error) { // triggered error = connection reset by peer
// remove async_force_disconnect()
this->steady_timer.expires_after(2s);
this->steady_timer.async_wait([this](const auto &error) { // capture this's weak_ptr to avoid fire after cancel
if (error) return; // timer canceled
this->tls_async_client->async_connect();
});
}); The timer could be fired with success error code after cancelled (destroyed): Typecal approach is async_wait handler captures mqtt_cpp/include/mqtt/endpoint.hpp Lines 2265 to 2277 in f2ee0b8
FYI: socket_->post() is mqtt_cpp/include/mqtt/tcp_endpoint.hpp Lines 72 to 77 in f2ee0b8
So, it is protected by strand. It can be called after closed. Calling force_disconnect() in the timer handler is dangerous() on multi-threaded environment. mqtt_cpp/include/mqtt/endpoint.hpp Lines 1078 to 1084 in f2ee0b8
Because timer handler's code can be worked on any threads. It is not protected by strand. So users needed to call post with strand. Now, async_force_disconnect() do it. shutdown() has the guard to avoid call twice: mqtt_cpp/include/mqtt/endpoint.hpp Lines 5263 to 5273 in f2ee0b8
|
This is a TLS client and server example: This is a async client example: This is simple connect/subscribe/publish/disconnect example.
It could preproduce the crash on your environment. |
I suspect that the memory corruption could be happened in somewhere in your application. Did valgrind and/or address sanitizer report nothing? |
Hi @redboltz ,
I'm aware about the different traps with asio, like the lifetime of the objects passed to the handlers and about the fact that cancelling a socket and timer operations may not necessarily lead to the error operation_aborted in their handler. I will still consider that the corruption resides in my application and continue investigations on my side. My apologies if I abused your time. |
Ok, if you would found the root of issue on your application side, then please close the issue. NOTE #860 has been merged. It should be merged for clean (graceful) shutdown even if it is not a reason of the issue. |
Hi @redboltz ,
Based on my observations, my feeling is that the sequence of [async_]force_disconnect followed by async_connect could potentially lead to the bug. My observation is that the crash appears now almost immediately with async_force_disconnect. Note: timer and async_client both run on the same io_context which itself runs in a single thread (active object) |
It is very difficult to understand. My first impression is it is tricky code. It seems that it is different situation from
Anyway, if |
I noticed one more thing. On Async APIs, all APIs has completion handler that's type is async_handler_t. |
I noticed that after #860 merged, async_force_disconnect() handler is called before all shutdown processes are completed. |
I create #868. But it is not directly relate to the issue, I guess. I considered what your code want to do. Then I assume that you want to care the case that the broker no response. void mqtt_active_object::connect()
{
// 1. call async_connect()
this->tls_async_client->async_connect(
[](mqtt::error_code ec) {
if (ec) {
// 4. The ec is set by async_connect() process, or
// by the result of async_force_disconnect() call
// Reconnect 3s later
this->retry_timer.expires_after(3s);
this->retry_timer.async_wait(
[this](const auto &error) {
if (error) { // canceled
return;
}
// 5. retry (goto step1)
this->connect();
}
);
}
else {
// 3. If async_connect() is succesfully finished,
// cancel connect_timer.
this->connect_timer.cancel();
}
}
);
// 2. set connect_timer to retry if broker won't response
this->connect_timer.expires_after(5s);
this->connect_timer.async_wait(
[this](const auto &error) {
if (error) { // canceled
return;
}
// 3. if connect_timer is fired, call async_force_disconnect().
// Callback handler is not used.
// During or after async_force_disconnect processing,
// 1. async_connect() callback handler is called with failure
this->tls_async_client->async_force_disconnect();
}
);
} The important point is avoiding multiple In order to avoid it, call |
Hi @redboltz , async_connect succeeded (no error), but the connack_handler is never invoked (I suppose because the broker didn't ack). This leaves my client stucked, because the connect_timer has been canceled and there will be no retry connection attempt. Thank you |
This is my observation.
First, you need to check 3. You can set a breakpoint or debug output at the following lines: mqtt_cpp/include/mqtt/endpoint.hpp Line 10456 in 6d26414
mqtt_cpp/include/mqtt/endpoint.hpp Line 11472 in 6d26414
The first line is sending CONNECT packet. If they are executed, the problem is at the broker side. So you can move void mqtt_active_object::connect()
{
// 1. call async_connect()
this->tls_async_client->async_connect(
[](mqtt::error_code ec) {
if (ec) {
// 4. The ec is set by async_connect() process, or
// by the result of async_force_disconnect() call
// Reconnect 3s later
this->retry_timer.expires_after(3s);
this->retry_timer.async_wait(
[this](const auto &error) {
if (error) { // canceled
return;
}
// 5. retry (goto step1)
this->connect();
}
);
}
}
);
this->tls_async_client->set_connack_handler( // if you use v5 then set_v5_connack_handler
[&]
(bool sp, MQTT_NS::connect_return_code connack_return_code){
// 3. All underlying connection (e.g.TCP, TLS) connected, and then MQTT connected
this->connect_timer.cancel();
}
// 2. set connect_timer to retry if broker won't response
this->connect_timer.expires_after(5s);
this->connect_timer.async_wait(
[this](const auto &error) {
if (error) { // canceled
return;
}
// 3. if connect_timer is fired, call async_force_disconnect().
// Callback handler is not used.
// During or after async_force_disconnect processing,
// 1. async_connect() callback handler is called with failure
this->tls_async_client->async_force_disconnect();
}
);
}
See https://github.com/redboltz/mqtt_cpp/wiki/Sync-vs-Async#connect-error-handling-1 In this case, step 1 to 4 finished successfully. But step5 isn't passed. No error is happened. If my updated code would solve your issue, then I will add the document to the wiki. It could be one of the best practice for connection sequence. |
I created #869 (merged). The test broker supports no CONNACK response emulation mode. And added test using the mode. After So you can write the reconnect code in the error handler. I guess that it has already been written to reconnect after MQTT connection is established. Also I merged #870. |
Hi @redboltz , |
Now v12.0.0 has been released. |
Hi,
My application crashes with a segmentation fault when I use async_client with the support of SSL.
We managed to reproduce the issue in our environment by performing a sequence of steps (which I can't explain here), and the issue does not occur without SSL. Possibly that this bug is ASIO and not MQTT.
During my investigations, I discovered a very similar issue that I wanted to share: xtreemfs/xtreemfs#281. My understanding is that the issue is related to the lifetime of the boost asio socket, so I cannot conclude whether it is really an ASIO bug or a misuse by MQTT library.
The text was updated successfully, but these errors were encountered: