Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System test broker with crash #754

Merged
merged 4 commits into from
Dec 9, 2020
Merged

Conversation

redboltz
Copy link
Owner

@redboltz redboltz commented Dec 9, 2020

It is for test. st_broker is unfinished. It is expected.

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

I fixed the issue.
The main fix is close the socket from the broker side when connection is overwritten.

I was working on shared subscriptions. See #752
It separates broker components by their responsibility.
It is easier for me that applying the assertion fail fix to #752, so I did it. (and merged)

After I fixed the assertion fail, I experienced another issues.

  1. exception boost::bad_get is thrown.
  2. thread unsafe error at logger on the client side.
  3. try to write connack message to closed socket

Also I fixed them.

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

1c31f90
This is the fix.

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

It seems that the assertion fail still happens on CI
https://github.com/redboltz/mqtt_cpp/pull/754/checks?check_run_id=1521705485#step:6:1738

I couldn't reproduce it on my local environment, so far.
Before the fix, it is frequently happened.

Need more investigation...

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

I successfully reproduced the assertion fail on my local environment.
And (maybe) I understand why the error happens.

It can avoid event handler call after close.
It requires a little big stack.
If the packet contains N properties, then call stack complexity is O(N).
N times recursive call of property parsing function.
I'm not sure it is acceptable.
@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

I removed the post() on parse.
The post() is introduced at #337 . It is zero copy receiving mechanism.

Design choice:

Removing post

  • pros
    • simple
    • fast
  • cons
    • increase stack consumption
      • when async_read is called during packet parsing, then stack consumption is reset.
      • Tail recursion optimization could happen? I'm not sure.

Remain post and check the connection status in the post handler

  • pros
    • small stack consumption
  • cons
    • complicated
      • need to make sure the error handler never call twice or more

Anyway, please check the PR on your environemnt.

@codecov
Copy link

codecov bot commented Dec 9, 2020

Codecov Report

Merging #754 (77fe3fc) into master (5c2052b) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #754      +/-   ##
==========================================
- Coverage   85.52%   85.50%   -0.02%     
==========================================
  Files          60       60              
  Lines        8557     8521      -36     
==========================================
- Hits         7318     7286      -32     
+ Misses       1239     1235       -4     

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

CI reported expected results.

@kleunen
Copy link
Contributor

kleunen commented Dec 9, 2020

Yes, it is stable now.

I guess my test for the broker does not properly shutdown.
finish() on broker is not called. But you will fix this ?

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

Thank you for checking. I will fix the test too.
I will use remove post() approach. If there is some stack overflow related issue would happen in the future, I will consider the other approach. I guess the current approach works well on general platform including embedded one.

@kleunen
Copy link
Contributor

kleunen commented Dec 9, 2020

I still get an assertion with the python script:
Assertion failed: it != idx.end(), file C:\data\projecten\mqtt_cpp\include\mqtt/broker/broker.hpp, line 1081

The C++ test runs fine now. But I guess there is still some race condition somewhere

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

I fixed and renamed the test.

Now, we can do test again and again until the test fails using the following command.

while ./st_issue_749; do done  

Note: I'm using zsh.

I'm not sure I can do the same thing on the gdb. I want to do that.

But I figured out what is happening at the failure case.

I inserted the following debug print code just after

if (!check_error_and_transferred_length(ec, bytes_transferred, remaining_length_)) return;

                    std::cout << connected_ << std::endl;

In success case, it outputs 1 (true), but failure case, it outputs 0 (false).
I will consider how to fix the problem. But the most difficult part (understand what is happening) is finished :)

@kleunen
Copy link
Contributor

kleunen commented Dec 9, 2020

Well, it is difficult to validate all the different cases in which the race condition occurs. You can not really create a test, because it depends on the timing of events. Hopefully it fails when running multiple times.

Even if async_read()'s error_code is not error, the connection has
already been disconnected then regard to error.
@kleunen
Copy link
Contributor

kleunen commented Dec 9, 2020

I think I may have another test case:

terminate called after throwing an instance of 'mqtt::packet_id_exhausted_error'
what(): packet_id exhausted error
Aborted

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

I added the commit to fix the issue.
The assertion fail doesn't happen in 20 minutes.

@kleunen
Copy link
Contributor

kleunen commented Dec 9, 2020

I added the commit to fix the issue.
The assertion fail doesn't happen in 20 minutes.

i will try also

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

I think I may have another test case:

terminate called after throwing an instance of 'mqtt::packet_id_exhausted_error'
what(): packet_id exhausted error
Aborted

It seems that a different topic. Please create new issue.
If the PR works well on your environment, I will merge the PR and close the issue.

@redboltz
Copy link
Owner Author

redboltz commented Dec 9, 2020

Well, it is difficult to validate all the different cases in which the race condition occurs. You can not really create a test, because it depends on the timing of events. Hopefully it fails when running multiple times.

In order to make a subtle timing, my proprietary broker has mocking mechanism. And tons of tests.
I can manage tons of tests using perl script. e.g. Breaking change for publish interface requires most of test updates.
Our team share the know how.
But I think that the same approach is not practical for mqtt_cpp because it requires a lot of work.

mqtt_cpp broker is originally test_broker. And it has delay mechanism for tests. I think that adding more delay mechanism is easier way to make subtle timing.

It is low priority for me. I need to implement a lot of functionalities.

@kleunen
Copy link
Contributor

kleunen commented Dec 9, 2020

Yes, it seems stable now, with python script also. I will do some more testing

@redboltz redboltz merged commit 7ef56d1 into master Dec 9, 2020
@redboltz redboltz deleted the kleunen-system_test_broker branch December 9, 2020 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants