Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix flakyness by replacing HashSet and HashMap with LinkedHashSet and LinkedHashMap #11941

Merged

Conversation

Nemesis123925
Copy link
Contributor

As Stated in the issue/11939

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change ProjectPlanNode or AggregationFunctionUtils because it involves performance overhead to the production code. Can you try handling the order issue within the test?

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @Jackie-Jiang 's comment. best solution is to use a parsed out explanation on project node and do a lexi-sorting on the test side on both the expected and actual value

@Nemesis123925
Copy link
Contributor Author

Nemesis123925 commented Nov 4, 2023

Sure, let me try to do that. However, how can I handle the case which there could be random order for the nested sqls like line 1447 in ExplainPlanQueriesTest.java

@Jackie-Jiang
Copy link
Contributor

I can see a simpler way is to create a HashSet in the test and use toString() to generate the PROJECT clause since for the same java version the order should be deterministic

@Nemesis123925
Copy link
Contributor Author

I try the way that Jackie Jiang proposed to use toString() of HashSet in the test to generate the PROJECT clause. It doesn't seems to work because the NonDex tool, which we use to check the falkyness, use differet and random seed for all the HashSet. So almost in every run that the order is still not the same.

I check the way that walteraddr proposed, it seems like it needs to change the validateRows functions in QueriesTestUtils.java which is also used by other functions. Is there other better options that we can do this way?

@codecov-commenter
Copy link

codecov-commenter commented Nov 7, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (fee11d6) 61.41% compared to head (27755b8) 61.62%.
Report is 52 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #11941      +/-   ##
============================================
+ Coverage     61.41%   61.62%   +0.21%     
- Complexity      207     1150     +943     
============================================
  Files          2378     2385       +7     
  Lines        128865   129275     +410     
  Branches      19927    20016      +89     
============================================
+ Hits          79143    79671     +528     
+ Misses        44001    43808     -193     
- Partials       5721     5796      +75     
Flag Coverage Δ
custom-integration1 <0.01% <ø> (-0.01%) ⬇️
integration <0.01% <ø> (-0.01%) ⬇️
integration1 <0.01% <ø> (-0.01%) ⬇️
integration2 0.00% <ø> (ø)
java-11 61.57% <ø> (+0.19%) ⬆️
java-21 61.50% <ø> (+0.23%) ⬆️
skip-bytebuffers-false 61.59% <ø> (+0.19%) ⬆️
skip-bytebuffers-true 61.48% <ø> (+0.22%) ⬆️
temurin 61.62% <ø> (+0.21%) ⬆️
unittests 61.62% <ø> (+0.21%) ⬆️
unittests1 46.97% <ø> (+0.35%) ⬆️
unittests2 27.57% <ø> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Nemesis123925
Copy link
Contributor Author

I took a look at the failed test and don't have too much clue on how to solve that. Can I get a hint?

@Jackie-Jiang
Copy link
Contributor

I guess we need to modify the test to not test on the string value. @walterddr any suggestion?

@walterddr
Copy link
Contributor

walterddr commented Nov 8, 2023

i would suggest instead of using QueriesTestUtils.testInterSegmentsResult, we should create a test validator specific for explained results. coupling the data result and explain result in the same assert logic is the problem here

@Nemesis123925
Copy link
Contributor Author

@walterddr Do you mean that I can create a different validateRow function for ExplainPlanQueriesTest, so we can solve the nondeternimism without changing the HashSet to LinkedHashSet?

@walterddr
Copy link
Contributor

walterddr commented Nov 11, 2023

correct. explain query itself is not technically a "query" thus the response doesn't necessarily need to be interpretd/validated via the same method.

see https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN:

If you want to feed EXPLAIN's output to a program for further analysis, you should use one of its machine-readable output formats (XML, JSON, or YAML) instead.

we don't support format in explain, but i think parsing each operator and compare in unordered fashion should be suffice now. e.g. the format is always:

format:
<operator_name> ['(' <attribute> [, <attribute>] ')' ]

operator_name: 
    string

attribute:
    string | string:<attribute> | string=<attribute>

@walterddr
Copy link
Contributor

walterddr commented Nov 11, 2023

IMO at this stage we can

  • add a special handling for the operators with indeterministic attribute list (sort by lex order)
  • the rest can be just string-match

@Nemesis123925
Copy link
Contributor Author

Got it, let me try doing it.
I also have a question such that If I created a function like ValidateExplained, do I need to write the test to test ValidateExplained function?

@Nemesis123925
Copy link
Contributor Author

Also I am aware of the code coverage change of my change. Can I ask the tool we are using to detect code coverage here? so I can validate the coverage myself before updating the change

@walterddr
Copy link
Contributor

  • IMO you dont need to write test to validate test
  • i usually use intellij's Run with coverage to visualize inside the IDE, but you can also use mvn plugin to report such if that's more preferrable. there's no restriction here.

Thank you for considering the test validity and coverage! this is superb!

@Nemesis123925
Copy link
Contributor Author

I tried to change the LinkedHashSet and LinkedHashMap back to HashSet and HashMap, and found out that there is also nonderterminism in the PLAN_START node, which the numSegmentsForThisPlan also got randomized. While I am debuging this part, do you guys have any clue what is causing this nondeterminism?

@Nemesis123925
Copy link
Contributor Author

I found that a part of the non-determinism of numSegmentsForThisPlan happens insde of reduceOnDataTable, which in my understanding it compare and select the best query generated from the two servers, and add up the numSegments of them if the two queries are identical, which happens in extractUniqueExplainPlansAcrossServers. Due to the non-determinism inside of the PROJECT clause created by NonDex, the test which suppose to have the two generates queries identical will differ from the PROJECT field, thus be understood as different queries by extractUniqueExplainPlansAcrossServers. Thus the function will select the best one instead of adding their numSegments together, and create the nondeternism inside of the numSegmentsForThisPlan part.

For this part, I think the solution would be sorting the PROJECT clause before the two generated queries get into the extractUniqueExplainPlansAcrossServers function, so the function will add the numSegment up instead of choising the better one.

However this is not the only cause of the non-deternism of the numSegmentsForThisPlan part. I found that applying NonDex will change the numSegments part even before the two generqte queries get into the extractUniqueExplainPlansAcrossServers function. So I will be keep exploring that part.

@walterddr
Copy link
Contributor

i think we can separate the 2 issues, IIUC the project ordering issue is independent from the numSegmentsForThisPlan

  1. the ordering issue will occur even if only 1 server is servering, thus making sure the PROJECT node compare method in test is order invariant is the main thing.
  2. the numSegmentsForThisPlan IMO is really a "judgement call", b/c
    • technically speaking we should return ALL plans if there are multiple returns from different servers.
    • however pinot doesn't really have a "logical" vs. "physical" plan concept, where "logical" is invariant of the data layout; and "physical" should be reported on every execution path.
    • the result we are comparing here is neither truly logical or physical, that it only return a single plan ("as logical") but it consolidates from multiple server return ("physical").

hence my suggestion is we address the PROJECT ordering issue first in this PR and address the plan issue separately

CC @somandal who improved upon this part of the logic to chime in :-)

@Nemesis123925
Copy link
Contributor Author

Roger that

Here is the solution I suggest

In QueriesTestUtils.java, I write two new functions:

public static void testExplainSegmentsResult(BrokerResponseNative brokerResponse, ResultTable expectedResultTable) {
    assertEquals(brokerResponse.getResultTable().getDataSchema(), expectedResultTable.getDataSchema());
    validateExplainedRows(brokerResponse.getResultTable().getRows(), expectedResultTable.getRows());
  }

private static void validateExplainedRows(List<Object[]> actual, List<Object[]> expected) {
    assertEquals(actual.size(), expected.size());
    for (int j = 0; j < actual.size(); j++) {
      Object[] act = actual.get(j);
      Object[] exp = expected.get(j);
      String attributeToSort = "PROJECT";
      assertEquals(act.length, exp.length);
      if (act[0].toString().startsWith(attributeToSort)) {
        char[] act0 = act[0].toString().toCharArray();
        char[] exp0 = exp[0].toString().toCharArray();
        Arrays.sort(act0);
        Arrays.sort(exp0);
        act[0] = new String(act0);
        exp[0] = new String(exp0);
      }
      assertEquals(act, exp);
    }
  }

In ExplainPlanQueriesTest.java, I redirect the checkWithQueryExecutor to call testExplainSegmentsResult

If apply this fix, instead of fixing all 27 flaky tests in ExplainPlanQueriesTest.java, this will fix 14 of them, which are

org.apache.pinot.queries.ExplainPlanQueriesTest#testSelect
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectAggregate
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectAggregateUsingFilterGroupBy
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectAggregateUsingFilterGroupByVerbose
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectAggregateUsingFilterIndex
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectAggregateUsingFilterIndexGroupBy
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectAggregateUsingFilterIndexGroupByHaving
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectAggregateUsingFilterIndexGroupByOrderBy
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectAggregateUsingFilterOnTextIndexColumn
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectColumnUsingFilterOnRangeIndexColumn
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectColumnsUsingFilter
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectColumnsUsingFilterOnInvertedIndexColumn
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectColumnsVariationsOfAndOperators
org.apache.pinot.queries.ExplainPlanQueriesTest#testSelectColumnsVariationsOfOrOperators

If you guys think this fix is good, I will push the new commit.

@walterddr walterddr merged commit 2c88c4f into apache:master Nov 20, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants