System test broker with crash #754

redboltz · 2020-12-09T03:53:12Z

It is for test. st_broker is unfinished. It is expected.

redboltz · 2020-12-09T03:59:40Z

I fixed the issue.
The main fix is close the socket from the broker side when connection is overwritten.

I was working on shared subscriptions. See #752
It separates broker components by their responsibility.
It is easier for me that applying the assertion fail fix to #752, so I did it. (and merged)

After I fixed the assertion fail, I experienced another issues.

exception boost::bad_get is thrown.
thread unsafe error at logger on the client side.
try to write connack message to closed socket

Also I fixed them.

redboltz · 2020-12-09T04:23:05Z

1c31f90
This is the fix.

redboltz · 2020-12-09T04:24:46Z

It seems that the assertion fail still happens on CI
https://github.com/redboltz/mqtt_cpp/pull/754/checks?check_run_id=1521705485#step:6:1738

I couldn't reproduce it on my local environment, so far.
Before the fix, it is frequently happened.

Need more investigation...

redboltz · 2020-12-09T05:13:23Z

I successfully reproduced the assertion fail on my local environment.
And (maybe) I understand why the error happens.

It can avoid event handler call after close. It requires a little big stack. If the packet contains N properties, then call stack complexity is O(N). N times recursive call of property parsing function. I'm not sure it is acceptable.

redboltz · 2020-12-09T05:39:22Z

I removed the post() on parse.
The post() is introduced at #337 . It is zero copy receiving mechanism.

Design choice:

Removing post

pros
- simple
- fast
cons
- increase stack consumption
  - when async_read is called during packet parsing, then stack consumption is reset.
  - Tail recursion optimization could happen? I'm not sure.

Remain post and check the connection status in the post handler

pros
- small stack consumption
cons
- complicated
  - need to make sure the error handler never call twice or more

Anyway, please check the PR on your environemnt.

codecov · 2020-12-09T05:46:10Z

Codecov Report

Merging #754 (77fe3fc) into master (5c2052b) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #754      +/-   ##
==========================================
- Coverage   85.52%   85.50%   -0.02%     
==========================================
  Files          60       60              
  Lines        8557     8521      -36     
==========================================
- Hits         7318     7286      -32     
+ Misses       1239     1235       -4

redboltz · 2020-12-09T05:55:50Z

CI reported expected results.

kleunen · 2020-12-09T07:26:38Z

Yes, it is stable now.

I guess my test for the broker does not properly shutdown.
finish() on broker is not called. But you will fix this ?

redboltz · 2020-12-09T07:34:17Z

Thank you for checking. I will fix the test too.
I will use remove post() approach. If there is some stack overflow related issue would happen in the future, I will consider the other approach. I guess the current approach works well on general platform including embedded one.

kleunen · 2020-12-09T07:40:26Z

I still get an assertion with the python script:
Assertion failed: it != idx.end(), file C:\data\projecten\mqtt_cpp\include\mqtt/broker/broker.hpp, line 1081

The C++ test runs fine now. But I guess there is still some race condition somewhere

Renamed test.

redboltz · 2020-12-09T08:49:35Z

I fixed and renamed the test.

Now, we can do test again and again until the test fails using the following command.

while ./st_issue_749; do done

Note: I'm using zsh.

I'm not sure I can do the same thing on the gdb. I want to do that.

But I figured out what is happening at the failure case.

I inserted the following debug print code just after

mqtt_cpp/include/mqtt/endpoint.hpp

Line 6504 in cf6e816

    
           if (!check_error_and_transferred_length(ec, bytes_transferred, remaining_length_)) return;

                    std::cout << connected_ << std::endl;

In success case, it outputs 1 (true), but failure case, it outputs 0 (false).
I will consider how to fix the problem. But the most difficult part (understand what is happening) is finished :)

kleunen · 2020-12-09T08:52:23Z

Well, it is difficult to validate all the different cases in which the race condition occurs. You can not really create a test, because it depends on the timing of events. Hopefully it fails when running multiple times.

Even if async_read()'s error_code is not error, the connection has already been disconnected then regard to error.

kleunen · 2020-12-09T10:16:21Z

I think I may have another test case:

terminate called after throwing an instance of 'mqtt::packet_id_exhausted_error'
what(): packet_id exhausted error
Aborted

redboltz · 2020-12-09T10:18:33Z

I added the commit to fix the issue.
The assertion fail doesn't happen in 20 minutes.

kleunen · 2020-12-09T10:19:31Z

I added the commit to fix the issue.
The assertion fail doesn't happen in 20 minutes.

i will try also

redboltz · 2020-12-09T10:19:39Z

I think I may have another test case:

terminate called after throwing an instance of 'mqtt::packet_id_exhausted_error'
what(): packet_id exhausted error
Aborted

It seems that a different topic. Please create new issue.
If the PR works well on your environment, I will merge the PR and close the issue.

redboltz · 2020-12-09T10:25:37Z

Well, it is difficult to validate all the different cases in which the race condition occurs. You can not really create a test, because it depends on the timing of events. Hopefully it fails when running multiple times.

In order to make a subtle timing, my proprietary broker has mocking mechanism. And tons of tests.
I can manage tons of tests using perl script. e.g. Breaking change for publish interface requires most of test updates.
Our team share the know how.
But I think that the same approach is not practical for mqtt_cpp because it requires a lot of work.

mqtt_cpp broker is originally test_broker. And it has delay mechanism for tests. I think that adding more delay mechanism is easier way to make subtle timing.

It is low priority for me. I need to implement a lot of functionalities.

kleunen · 2020-12-09T10:40:24Z

Yes, it seems stable now, with python script also. I will do some more testing

redboltz mentioned this pull request Dec 9, 2020

System test broker with crash #753

Closed

System test broker with crash

ae6a066

redboltz force-pushed the kleunen-system_test_broker branch from daf0c61 to ae6a066 Compare December 9, 2020 03:59

Removed post() call from packet parse process.

f8395a4

It can avoid event handler call after close. It requires a little big stack. If the packet contains N properties, then call stack complexity is O(N). N times recursive call of property parsing function. I'm not sure it is acceptable.

redboltz force-pushed the kleunen-system_test_broker branch from 4e4ab20 to f8395a4 Compare December 9, 2020 05:27

Fixed test shutdown code.

cf6e816

Renamed test.

Fixed close or error detection logic.

77fe3fc

Even if async_read()'s error_code is not error, the connection has already been disconnected then regard to error.

redboltz merged commit 7ef56d1 into master Dec 9, 2020

redboltz deleted the kleunen-system_test_broker branch December 9, 2020 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System test broker with crash #754

System test broker with crash #754

redboltz commented Dec 9, 2020 •

edited

Loading

redboltz commented Dec 9, 2020 •

edited

Loading

redboltz commented Dec 9, 2020

redboltz commented Dec 9, 2020

redboltz commented Dec 9, 2020

redboltz commented Dec 9, 2020 •

edited

Loading

codecov bot commented Dec 9, 2020 •

edited

Loading

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020 •

edited

Loading

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020 •

edited

Loading

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020 •

edited

Loading

kleunen commented Dec 9, 2020

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020

redboltz commented Dec 9, 2020

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020

System test broker with crash #754

System test broker with crash #754

Conversation

redboltz commented Dec 9, 2020 • edited Loading

redboltz commented Dec 9, 2020 • edited Loading

redboltz commented Dec 9, 2020

redboltz commented Dec 9, 2020

redboltz commented Dec 9, 2020

redboltz commented Dec 9, 2020 • edited Loading

Design choice:

Removing post

Remain post and check the connection status in the post handler

codecov bot commented Dec 9, 2020 • edited Loading

Codecov Report

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020 • edited Loading

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020 • edited Loading

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020 • edited Loading

kleunen commented Dec 9, 2020

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020

redboltz commented Dec 9, 2020

redboltz commented Dec 9, 2020

kleunen commented Dec 9, 2020

redboltz commented Dec 9, 2020 •

edited

Loading

redboltz commented Dec 9, 2020 •

edited

Loading

redboltz commented Dec 9, 2020 •

edited

Loading

codecov bot commented Dec 9, 2020 •

edited

Loading

kleunen commented Dec 9, 2020 •

edited

Loading

kleunen commented Dec 9, 2020 •

edited

Loading

kleunen commented Dec 9, 2020 •

edited

Loading