Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integration test for rebalance in upsert tables #11568

Merged
merged 9 commits into from
Oct 9, 2023

Conversation

KKcorps
Copy link
Contributor

@KKcorps KKcorps commented Sep 11, 2023

Mostly tests that partitions move as a whole unit to new servers

@codecov-commenter
Copy link

codecov-commenter commented Sep 12, 2023

Codecov Report

Merging #11568 (e0c8532) into master (24dba83) will decrease coverage by 0.02%.
Report is 1 commits behind head on master.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master   #11568      +/-   ##
============================================
- Coverage     63.10%   63.08%   -0.02%     
  Complexity     1118     1118              
============================================
  Files          2342     2342              
  Lines        125916   125917       +1     
  Branches      19370    19370              
============================================
- Hits          79454    79440      -14     
- Misses        40813    40825      +12     
- Partials       5649     5652       +3     
Flag Coverage Δ
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 63.06% <0.00%> (-0.01%) ⬇️
java-17 62.95% <0.00%> (+<0.01%) ⬆️
java-20 62.94% <0.00%> (-0.02%) ⬇️
temurin 63.08% <0.00%> (-0.02%) ⬇️
unittests 63.08% <0.00%> (-0.02%) ⬇️
unittests1 67.23% <0.00%> (-0.01%) ⬇️
unittests2 14.42% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...spi/utils/builder/ControllerRequestURLBuilder.java 0.00% <0.00%> (ø)
...inot/controller/helix/ControllerRequestClient.java 37.34% <0.00%> (ø)

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@KKcorps KKcorps changed the title Add integration test for rebalance in upsert tables WIP: Add integration test for rebalance in upsert tables Sep 12, 2023
@KKcorps KKcorps force-pushed the upsert_rebalance_tes branch 3 times, most recently from 088a2b2 to eba73c6 Compare September 13, 2023 12:03
@KKcorps KKcorps changed the title WIP: Add integration test for rebalance in upsert tables Add integration test for rebalance in upsert tables Sep 14, 2023
@Override
protected void waitForAllDocsLoaded(long timeoutMs)
throws Exception {
Thread.sleep(5000L); //TODO: Wait for some time for rebalance to kickoff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to avoid sleeping a fixed amount of time. For fast machine, it can waste time; for slow machine, it can cause flakiness. Any specific reason why waiting for condition itself is not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, waiting on condition itself triggers an early termination since the rebalance simply hasn't started and all segments are still available.

@npawar
Copy link
Contributor

npawar commented Sep 19, 2023

Mostly tests that partitions move as a whole unit to new servers

is there an jira that prompted this item? did this not work in some env / scenario?

@KKcorps
Copy link
Contributor Author

KKcorps commented Sep 20, 2023

Mostly tests that partitions move as a whole unit to new servers

is there an jira that prompted this item? did this not work in some env / scenario?

Yes, there are cases when ideal state diverges from instance partition map and it created prod issues.

@klsince is working on the fix for the issue.

#11628

@Jackie-Jiang
Copy link
Contributor

@klsince Can you help review this test?

return 600;
}

protected long getCurrentCountStarResultWithoutNulls(String tableName, Schema schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why need to check on NULL rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah because we have observed an issue in partial upsert tables in the past where some columns always contain null even though the correct data is coming in the stream.

// Check the number of replicas after rebalancing
finalReplicas = _resourceManager.getServerInstancesForTable(getTableName(), TableType.REALTIME).size();

// Check that a replica has been added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you meant a server has been added? as from the comment below, rf=1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

waitForAllDocsLoaded(600_000L);

// number of instances assigned can't be more than number of partitions for rf = 1
verifySegmentAssignment(rebalanceResult.getSegmentAssignment(), 5, getNumKafkaPartitions());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, I assume the number of kafka partitions is 3? then, it may be interesting to test by adding one more server (4 servers in total) to check that one server is left idle w/o taking any table partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the default is 2 partitons and that scenario is already covered.


import static org.testng.Assert.assertEquals;

public class PartialUpsertTableRebalanceIntegrationTest extends BaseClusterIntegrationTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the test cases are not specific to partial upsert? or I must have missed some details 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it covers both upsert and partial upsert. I have added the table config to be partial upsert cause we observed weird scenarios there sometimes where some columns are always null.

Copy link
Contributor

@klsince klsince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
just two non-blocking comments

@KKcorps KKcorps merged commit 52d16f7 into apache:master Oct 9, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants