Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in async_client only with the support SSL #856

Closed
ropieur opened this issue Sep 9, 2021 · 33 comments
Closed

Segmentation fault in async_client only with the support SSL #856

ropieur opened this issue Sep 9, 2021 · 33 comments

Comments

@ropieur
Copy link

ropieur commented Sep 9, 2021

Hi,
My application crashes with a segmentation fault when I use async_client with the support of SSL.
We managed to reproduce the issue in our environment by performing a sequence of steps (which I can't explain here), and the issue does not occur without SSL. Possibly that this bug is ASIO and not MQTT.
During my investigations, I discovered a very similar issue that I wanted to share: xtreemfs/xtreemfs#281. My understanding is that the issue is related to the lifetime of the boost asio socket, so I cannot conclude whether it is really an ASIO bug or a misuse by MQTT library.

@redboltz
Copy link
Owner

redboltz commented Sep 9, 2021

Could you post a Minimal, Reproducible Example?
https://stackoverflow.com/help/minimal-reproducible-example

@ropieur
Copy link
Author

ropieur commented Sep 9, 2021

Hi redboltz,

Thank you for your quick reaction.
I was expecting such a request ;-) I'll need to discuss internally whether I can get the budget ...

In the meantime, here is the stacktrace during the crash and you can observe that it is quite similar to the one reported in xtreemfs/xtreemfs#281. It occurs after a disconnection from the broker.

#0  0x00007f649001cee0 in ?? ()
#1  0x00007f64cfb79a4a in BIO_write () from /home/cbo/debug-dmm-on-diasimul/20210909T111110/usr/local/dmm/lib/libcrypto.so.1.1
#2  0x0000000000e9d34f in boost::asio::ssl::detail::engine::put_input(boost::asio::const_buffer const&) [clone .isra.124] ()
#3  0x0000000000f0ea39 in boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::ssl::detail::read_op<boost::asio::mutable_buffer>, boost::asio::detail::read_op<boost::asio::ssl::stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor> >, boost::asio::mutable_buffer, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, boost::asio::executor_binder<std::function<void (boost::system::error_code, unsigned long)>, boost::asio::io_context::strand> > >::operator()(boost::system::error_code, unsigned long, int) ()
#4  0x0000000000ea0cba in void boost::asio::io_context::strand::dispatch<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::ssl::detail::read_op<boost::asio::mutable_buffer>, boost::asio::detail::read_op<boost::asio::ssl::stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor> >, boost::asio::mutable_buffer, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, boost::asio::executor_binder<std::function<void (boost::system::error_code, unsigned long)>, boost::asio::io_context::strand> > >, boost::system::error_code, unsigned long>, std::allocator<void> >(boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::ssl::detail::read_op<boost::asio::mutable_buffer>, boost::asio::detail::read_op<boost::asio::ssl::stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor> >, boost::asio::mutable_buffer, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, boost::asio::executor_binder<std::function<void (boost::system::error_code, unsigned long)>, boost::asio::io_context::strand> > >, boost::system::error_code, unsigned long>&&, std::allocator<void> const&) const [clone .isra.6872] ()
#5  0x0000000000f1048d in boost::asio::detail::completion_handler<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::ssl::detail::read_op<boost::asio::mutable_buffer>, boost::asio::detail::read_op<boost::asio::ssl::stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor> >, boost::asio::mutable_buffer, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, boost::asio::executor_binder<std::function<void (boost::system::error_code, unsigned long)>, boost::asio::io_context::strand> > >, boost::system::error_code, unsigned long>, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0u> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) ()
#6  0x0000000000eae349 in boost::asio::detail::strand_service::do_dispatch(boost::asio::detail::strand_service::strand_impl*&, boost::asio::detail::scheduler_operation*) ()
#7  0x0000000000ea0dbc in void boost::asio::io_context::strand::dispatch<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::ssl::detail::read_op<boost::asio::mutable_buffer>, boost::asio::detail::read_op<boost::asio::ssl::stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor> >, boost::asio::mutable_buffer, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, boost::asio::executor_binder<std::function<void (boost::system::error_code, unsigned long)>, boost::asio::io_context::strand> > >, boost::system::error_code, unsigned long>, std::allocator<void> >(boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::ssl::detail::read_op<boost::asio::mutable_buffer>, boost::asio::detail::read_op<boost::asio::ssl::stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor> >, boost::asio::mutable_buffer, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, boost::asio::executor_binder<std::function<void (boost::system::error_code, unsigned long)>, boost::asio::io_context::strand> > >, boost::system::error_code, unsigned long>&&, std::allocator<void> const&) const [clone .isra.6872] ()
#8  0x0000000000f1020f in boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffer, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::ssl::detail::read_op<boost::asio::mutable_buffer>, boost::asio::detail::read_op<boost::asio::ssl::stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor> >, boost::asio::mutable_buffer, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, boost::asio::executor_binder<std::function<void (boost::system::error_code, unsigned long)>, boost::asio::io_context::strand> > >, boost::asio::any_io_executor>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) ()
#9  0x0000000000c13ab5 in boost::asio::detail::scheduler::run(boost::system::error_code&) ()
#10 0x0000000000ea31bf in Mobility::DMM::Mqtt::Publisher::run() ()
#11 0x00000000013dd1eb in thread_proxy ()
#12 0x00007f64cf6a1f94 in start_thread (arg=0x7f6499ffb700) at pthread_create.c:309
#13 0x00007f64ce9393cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@ropieur
Copy link
Author

ropieur commented Sep 9, 2021

I quote you the interesting part of the xtreemfs ticket:

It was caused by the fact that newer boost (?) versions try to write to the socket, even after the SSL stream was shutdown. However, the socket was already deleted when the connection was closed. Usually, programs using boost::asio use a shared_ptr for the socket and are therefore not affected by such obstacles. However, we once agreed to avoid using shared_ptr and therefore do not delete the socket object before the connection is closed. Evidently, this isn't enough.

Maybe it could give you some hints

@redboltz
Copy link
Owner

redboltz commented Sep 9, 2021

I just looked the issue and solution. Unfortunately I don't get any useful information from that.
I use TLS and async_client combination by many application but no errors are reported. It is also a part of CI tests.

I will start some analysis when you will post a Minimal, Reproducible Example.

@redboltz
Copy link
Owner

redboltz commented Sep 9, 2021

Also I recommend that applying valgrind and/or address sanitizer to your application. You would get some information.

@redboltz
Copy link
Owner

redboltz commented Sep 9, 2021

I noticed that #78 might cause the issue.
I have an idea to improve #78 fix that is mentioned async_shutdown call.
When I send a PR for that, do you check it on your reproducible project?

@ropieur
Copy link
Author

ropieur commented Sep 9, 2021

@redboltz , yes sure, I gonna give a try.

@redboltz
Copy link
Owner

redboltz commented Sep 9, 2021

@ropieur
Could you tell me your environment information?
OS version, OpenSSL version, Boost version, and mqtt_cpp version (it it is released version otherwise commit hash).

@redboltz
Copy link
Owner

redboltz commented Sep 9, 2021

Additional questions:

You said that you are using async_client. Please tell me disconnecting situation.

  1. Which factory function is used to create async_client.
  2. In the problem situation, disconnect from the broker or from the client ?
    If from the client, which API are you using ?
  3. Your application is multi-threaded?

redboltz added a commit that referenced this issue Sep 10, 2021
Implemented clean shutdown.
redboltz added a commit that referenced this issue Sep 10, 2021
Implemented clean shutdown.
redboltz added a commit that referenced this issue Sep 11, 2021
Implemented clean shutdown.
redboltz added a commit that referenced this issue Sep 11, 2021
Implemented clean shutdown.
redboltz added a commit that referenced this issue Sep 11, 2021
Implemented clean shutdown.
redboltz added a commit that referenced this issue Sep 11, 2021
Implemented clean shutdown.
redboltz added a commit that referenced this issue Sep 11, 2021
Implemented clean shutdown.
@ropieur
Copy link
Author

ropieur commented Sep 11, 2021

Hi Redboltz,

  1. mqtt::make_tls_async_client
  2. Disconnect is initiated by the broker
  3. The client resides inside an active object (1 single thread running an io_context). All calls from other threads are posted to the io_context. There is NO direct call to the client from another thread

@redboltz
Copy link
Owner

Thank you.
Please answer the question #856 (comment) too.

The current master contains some fix of shutdown process.
https://github.com/redboltz/mqtt_cpp/tree/11d26fae1a6c87466a1eab02a494c89e18f84a1d

It still doesn't clean TLS layer but on TCP layer not only socket close() but also socket shutdown(). It might solve your problem.

redboltz added a commit that referenced this issue Sep 12, 2021
Implemented clean shutdown.
@ropieur
Copy link
Author

ropieur commented Sep 12, 2021

OS: ubuntu 20.04
BOOST 1.76 and 1.77
MQTT 8.0.0 and 11.1.0
OPENSSL 1.1.1

@redboltz
Copy link
Owner

redboltz commented Sep 13, 2021

Thank you.

layer operation implemented mqtt_cpp version
web socket send close packet, then read until closed or error #860
TLS (async_)shutdown #860
TCP shutdown (no async version exists) #860 11d26fa
TCP close (no async version exists) #860 11d26fa 11.1.0 8.0.0

Due to #78, version 11.1.0 and 8.0.0 only close TCP socket. It worked well so far. But I guess it could cause the problem you reported.
Lack of TCP shutdown, #848 was found.
The has 11d26fa (current master HEAD) has TCP shutdown. It might solve your problem.

Finally, I implemented graceful shutdown for all layers. It is #860. It is not merged yet. During implementing #860, I experienced a lot of blocking problem. I will merge it test with #860 on my proprietary project's production code.

Could you test the following two versions ?

  1. 11d26fa
  2. Fixed #856. #860

@redboltz
Copy link
Owner

The PR #860 works fine on my project. I'm going to merge it.
But I haven't experienced your problem on my project. I think that #860 should be merged but I'm not sure it would solve your problem.
@ropieur , is your problem solved?

@ropieur
Copy link
Author

ropieur commented Sep 17, 2021

@redboltz , we are working on it. I delivered a package internally with the branch fix_856. We need to resolve issues on our platform and then we will retest the disconnection with the broker. It can take multiple days before we get there. Anyway I'll keep you posted with the result in this ticket.

@redboltz
Copy link
Owner

@ropieur thank you! I will wait your feedback :)

@ropieur
Copy link
Author

ropieur commented Sep 17, 2021

We managed to reproduce the crash with the branch fix_856 and I get the exact same backtrace from the coredump.
FYI, it is quite difficult to reproduce it, because the tester needs to perform a sequence of manipulations multiple times to reproduce it (because the issue does not necessarily reproduce immediately).

@ropieur
Copy link
Author

ropieur commented Sep 17, 2021

In the error handler, my app typically performs the next operations:

this->tls_async_client->set_error_handler([this](const auto &error) { // triggered error = connection reset by peer
    this->tls_async_client->async_force_disconnect();
    this->steady_timer.expires_after(2s);
    this->steady_timer.async_wait([this](const auto &error) {
        if (error) return; // timer canceled
        this->tls_async_client->async_connect();
    });
});

Question:
May I invoke async_force_disconnect from the error handler? (Originally, the code invoked force_disconnect)

@redboltz
Copy link
Owner

redboltz commented Sep 17, 2021

set_error_handler 's callback is called from on_error() as follwos:

bool handle_close_or_error(error_code ec) {
if (connected_) {
if (!ec) return false;
MQTT_LOG("mqtt_impl", trace)
<< MQTT_ADD_VALUE(address, this)
<< "handle_close_or_error call shutdown";
shutdown(socket());
}
connect_requested_ = false;
clean_sub_unsub_inflight();
if (disconnect_requested_) {
on_close();
disconnect_requested_ = false;
}
else {
if (!ec) ec = boost::system::errc::make_error_code(boost::system::errc::not_connected);
on_error(ec);

In the line 5051, shutdown() is called internally. So you don't need to call async_force_disconnect().

So I recommend the following code;

this->tls_async_client->set_error_handler([this](const auto &error) { // triggered error = connection reset by peer

    // remove async_force_disconnect()

    this->steady_timer.expires_after(2s);
    this->steady_timer.async_wait([this](const auto &error) { // capture this's weak_ptr to avoid fire after cancel
        if (error) return; // timer canceled
        this->tls_async_client->async_connect();
    });
});

The timer could be fired with success error code after cancelled (destroyed):
https://stackoverflow.com/questions/43045192/how-to-avoid-firing-already-destroyed-boostasiodeadline-timer
It very rarely happens.
So you need to make sure the object's lifetimes in the handler.

Typecal approach is async_wait handler captures this's weak_ptr if this is shared_ptr.
See

void async_force_disconnect(
async_handler_t func = {}
) {
MQTT_LOG("mqtt_api", info)
<< MQTT_ADD_VALUE(address, this)
<< "async_force_disconnect";
socket_->post(
[this, self = this->shared_from_this(), func = force_move(func)] {
shutdown(socket());
if (func) func(boost::system::errc::make_error_code(boost::system::errc::success));
}
);
}

FYI:

socket_->post() is

MQTT_ALWAYS_INLINE void post(std::function<void()> handler) override final {
as::post(
strand_,
force_move(handler)
);
}

So, it is protected by strand. It can be called after closed.

Calling force_disconnect() in the timer handler is dangerous() on multi-threaded environment.

void force_disconnect() {
MQTT_LOG("mqtt_api", info)
<< MQTT_ADD_VALUE(address, this)
<< "force_disconnect";
shutdown(socket());
}

Because timer handler's code can be worked on any threads. It is not protected by strand. So users needed to call post with strand. Now, async_force_disconnect() do it.
If the async_client want to disconnect forcibly then call async_disconnect().
Users don't need to call async_force_disconnect() in close and/or error handler.

shutdown() has the guard to avoid call twice:

void shutdown(MQTT_NS::socket& s) {
MQTT_LOG("mqtt_impl", trace)
<< MQTT_ADD_VALUE(address, this)
<< "shutdown";
if (shutdown_requested_) {
MQTT_LOG("mqtt_impl", trace)
<< MQTT_ADD_VALUE(address, this)
<< "already shutdowned";
return;
}
shutdown_requested_ = true;

@redboltz
Copy link
Owner

FYI, it is quite difficult to reproduce it, because the tester needs to perform a sequence of manipulations multiple times to reproduce it (because the issue does not necessarily reproduce immediately).

This is a TLS client and server example:
https://github.com/redboltz/mqtt_cpp/blob/master/example/tls_both.cpp

This is a async client example:
https://github.com/redboltz/mqtt_cpp/blob/master/example/no_tls_async_client.cpp

This is simple connect/subscribe/publish/disconnect example.
But you can create small code as follows:

  1. Async Client async_connect()
  2. Server force_disconnect()
  3. Async_Client set timer 2s and async_wait()
  4. On timer fired, async_connect() , goto step2

It could preproduce the crash on your environment.

@redboltz
Copy link
Owner

We managed to reproduce the crash with the branch fix_856 and I get the exact same backtrace from the coredump.

I suspect that the memory corruption could be happened in somewhere in your application.
The backtrace doesn't contain mqtt_cpp code.

#856 (comment)

Did valgrind and/or address sanitizer report nothing?

redboltz added a commit that referenced this issue Sep 20, 2021
@redboltz redboltz reopened this Sep 20, 2021
@ropieur
Copy link
Author

ropieur commented Sep 21, 2021

Hi @redboltz ,

The timer could be fired with success error code after cancelled (destroyed):
https://stackoverflow.com/questions/43045192/how-to-avoid-firing-already-destroyed-boostasiodeadline-timer
It very rarely happens.
So you need to make sure the object's lifetimes in the handler.

I'm aware about the different traps with asio, like the lifetime of the objects passed to the handlers and about the fact that cancelling a socket and timer operations may not necessarily lead to the error operation_aborted in their handler. I will still consider that the corruption resides in my application and continue investigations on my side. My apologies if I abused your time.
Thank you for your reactivity and support, really appreciated!

@redboltz
Copy link
Owner

Ok, if you would found the root of issue on your application side, then please close the issue.

NOTE #860 has been merged. It should be merged for clean (graceful) shutdown even if it is not a reason of the issue.

@ropieur
Copy link
Author

ropieur commented Sep 23, 2021

Hi @redboltz ,
Before invoking async_connect, my app start a timer (handling connection timeout) that will drop the current connection attempt and reinitiates a new connection

    void mqtt_active_object::connect()
    {
        this->timer.expires_after(5s);
        this->timer.async_wait([this](const auto &error) {
            if (error) { // canceled
                return;
            }
            if (!this->tls_async_client->connected()) { // skip if client connected meanwhile
                // this->tls_async_client->force_disconnect(); // with mqtt_cpp v8.0
                this->tls_async_client->async_force_disconnect(); // with mqtt_cpp v11.1 or later
                this->connect();
            }
        });
        this->tls_async_client->async_connect();
    }

Based on my observations, my feeling is that the sequence of [async_]force_disconnect followed by async_connect could potentially lead to the bug. My observation is that the crash appears now almost immediately with async_force_disconnect.
Could you tell me whether this usage is formally correct?

Note: timer and async_client both run on the same io_context which itself runs in a single thread (active object)

@redboltz
Copy link
Owner

It is very difficult to understand. My first impression is it is tricky code.

It seems that it is different situation from

Disconnect is initiated by the broker

#856 (comment)

Anyway, if this->tls_async_client->connected() returns false, then this->tls_async_client->async_force_disconnect() and this->tls_async_client->async_connect(), run concurrently. Even if they share the same io_context and run on the one thread, it is bad.
Because mqtt_cpp can multiple async operation in the APIs. It depends on the implementation and could be changed.
I'm not sure why async_force_disconnect() is required but you can know when async_force_disconnect() is finished.
Simple passing async_force_disconnect(your_completion_handler). In the handler, you can call async_connect().

@redboltz
Copy link
Owner

I noticed one more thing.
On async_connect() , you should add completion handler.
See
https://github.com/redboltz/mqtt_cpp/blob/master/example/no_tls_async_client.cpp

On Async APIs, all APIs has completion handler that's type is async_handler_t.
This should be added. You can know the timing of async operation completion.
It can eliminate query function like connected().
It is async coding style.

@redboltz
Copy link
Owner

I noticed that after #860 merged, async_force_disconnect() handler is called before all shutdown processes are completed.
It could be wrong. I will take a look it.
But it seems that it is not related to your issue.

@redboltz
Copy link
Owner

I create #868. But it is not directly relate to the issue, I guess.

I considered what your code want to do. Then I assume that you want to care the case that the broker no response.
I updated your code as follows:

    void mqtt_active_object::connect()
    {
        // 1. call async_connect()
        this->tls_async_client->async_connect(
            [](mqtt::error_code ec) {
                if (ec) {
                    // 4. The ec is set by async_connect() process, or
                    //    by the result of async_force_disconnect() call
                    //    Reconnect 3s later
                    this->retry_timer.expires_after(3s);
                    this->retry_timer.async_wait(
                        [this](const auto &error) {
                            if (error) { // canceled
                                return;
                            }
                            // 5. retry (goto step1)
                            this->connect();
                        }
                    );
                }
                else {
                    // 3. If async_connect() is succesfully finished,
                    //    cancel connect_timer.
                    this->connect_timer.cancel();
                }
            }
        );
        // 2. set connect_timer to retry if broker won't response
        this->connect_timer.expires_after(5s);
        this->connect_timer.async_wait(
            [this](const auto &error) {
                if (error) { // canceled
                    return;
                }
                // 3. if connect_timer is fired, call async_force_disconnect().
                //    Callback handler is not used.
                //    During or after async_force_disconnect processing,
                //    1. async_connect() callback handler is called with failure
                this->tls_async_client->async_force_disconnect();
            }
        );
    }

The important point is avoiding multiple async_connect() call. It could make UB.
So the retry code should be in async_connect() handler.
If something wrong, async_connect() handler is called with error code.
But it could take a long time (e.g. broker is freeze and long TCP timeout is set by OS).
NOTE: I couldn't reproduce the scenario.

In order to avoid it, call async_force_disconnect() from the client side.
But you shouldn't call async_connect() for retry in the async_force_disconnect() handler.
async_force_disconnect() calling is indirectly boost async_connect() handler call.
So longer no-response is solved.
Finally, async_connect() handler is called with error code and then retry.
I recommend that setting retry timer.

@ropieur
Copy link
Author

ropieur commented Sep 24, 2021

Hi @redboltz ,
Thank you for the code excerpt. I now start having a better understanding on how to play with the async client interface.
I implemented your proposed solution, but I meet another issue

async_connect succeeded (no error), but the connack_handler is never invoked (I suppose because the broker didn't ack). This leaves my client stucked, because the connect_timer has been canceled and there will be no retry connection attempt.
I decided to move the connect_timer.cancel inside the connack_handler instead, in order to prolong the timeout. The connect_timer is then triggered, which invokes async_force_disconnect. But unfortunately, this does not result in the invocation of connect_handler with error (because it was already invoked without error, I guess), nor in the invocation of error_handler (which is more surprising to me).
Do you have a link to a documentation describing in which circumstances the different handlers are invoked (handlers passed to async_* calls and handlers registered like set_error_handler, set_connack_handler ...) ? With such a documentation, I could design a robust solution according to the different error scenarii.

Thank you

@redboltz
Copy link
Owner

redboltz commented Sep 24, 2021

This is my observation.

  1. TCP connection is successfully established.
  2. TLS handshaking is successfully finished.
  3. mqtt_cpp client sends CONNECT packet (Need checking)
  4. broker doesn't send CONNACK packet

First, you need to check 3.

You can set a breakpoint or debug output at the following lines:

switch (version_) {

socket_->async_write(

The first line is sending CONNECT packet.
The second line is the last phase to send any packets.
Both line should be executed.

If they are executed, the problem is at the broker side.
But even if the broker has a problem, the client should work better.

So you can move connect_timer.cancel(); to connack handler as follows:

    void mqtt_active_object::connect()
    {
        // 1. call async_connect()
        this->tls_async_client->async_connect(
            [](mqtt::error_code ec) {
                if (ec) {
                    // 4. The ec is set by async_connect() process, or
                    //    by the result of async_force_disconnect() call
                    //    Reconnect 3s later
                    this->retry_timer.expires_after(3s);
                    this->retry_timer.async_wait(
                        [this](const auto &error) {
                            if (error) { // canceled
                                return;
                            }
                            // 5. retry (goto step1)
                            this->connect();
                        }
                    );
                }
            }
        );

        this->tls_async_client->set_connack_handler( // if you use v5 then set_v5_connack_handler
            [&]
            (bool sp, MQTT_NS::connect_return_code connack_return_code){ 
                // 3. All underlying connection (e.g.TCP, TLS) connected, and then MQTT connected
                this->connect_timer.cancel();
        }

        // 2. set connect_timer to retry if broker won't response
        this->connect_timer.expires_after(5s);
        this->connect_timer.async_wait(
            [this](const auto &error) {
                if (error) { // canceled
                    return;
                }
                // 3. if connect_timer is fired, call async_force_disconnect().
                //    Callback handler is not used.
                //    During or after async_force_disconnect processing,
                //    1. async_connect() callback handler is called with failure
                this->tls_async_client->async_force_disconnect();
            }
        );
    }

Do you have a link to a documentation describing in which circumstances the different handlers are invoked (handlers passed to async_* calls and handlers registered like set_error_handler, set_connack_handler ...) ? With such a documentation, I could design a robust solution according to the different error scenarii.

See https://github.com/redboltz/mqtt_cpp/wiki/Sync-vs-Async#connect-error-handling-1

In this case, step 1 to 4 finished successfully. But step5 isn't passed. No error is happened.
So the connect_timer should care step1 to step5.

If my updated code would solve your issue, then I will add the document to the wiki. It could be one of the best practice for connection sequence.

@redboltz
Copy link
Owner

I created #869 (merged). The test broker supports no CONNACK response emulation mode. And added test using the mode.
The PR also fixed broker side code. (invalid connack call)
The broker side fix is not related to the issue because you are not using mqtt_cpp example broker.
The PR doesn't contain the client side modification.

After async_connack()'s acallback is called with success error code, but the broker doesn't send CONNACK packet, nothing happens.
So the client side needs to set a timer. When the timer is fired, call async_force_disconnect().
It does shutdown process on the client side. As the result of that process, error handler (set by set_error_handler()) will be called.

So you can write the reconnect code in the error handler. I guess that it has already been written to reconnect after MQTT connection is established.

Also I merged #870.
https://github.com/redboltz/mqtt_cpp/blob/master/example/no_tls_async_client_reconnect.cpp demonstrates how to implement practical reconnecting logic.

@ropieur
Copy link
Author

ropieur commented Sep 27, 2021

Hi @redboltz ,
I finally fully fixed based on your suggestions/modifications. For sure, I didn't modify other parts of my app in order to fix. So, this is certainly due to a misuse of the async client. Difficult to me to tell what exactly leads to the crash, but might be due to consecutive calls to async_connect. Anyway, happy that it is now completely fixed on my side.
In order to deploy my fix, I would need an official tag (v12? v11.2?) of mqtt_cpp with your latest modifications. Would it be possible?
Thank you

@redboltz
Copy link
Owner

Now v12.0.0 has been released.
If your problem is solved, please close the issue.

@ropieur ropieur closed this as completed Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants