-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integration test for coordinator and overlord leadership client #10680
integration test for coordinator and overlord leadership client #10680
Conversation
…ervers is_leader column
Nice idea! And I have made a PR #10669 which run IT for druid on K8s. Maybe it can be helpful here :) |
@@ -586,7 +638,8 @@ public TableType getJdbcTableType() | |||
StringUtils.toLowerCase(discoveryDruidNode.getNodeRole().toString()), | |||
druidServerToUse.getTier(), | |||
currentSize, | |||
druidServerToUse.getMaxSize() | |||
druidServerToUse.getMaxSize(), | |||
NullHandling.defaultLongValue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. shouldn't this always be zero? Data nodes don't have any leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally it should always be null because they don't have the concept of leadership at all and then you can distinguish between things which are leader, are not leader, and which don't have leaders. NullHandling.defaultLongValue()
is null when druid.generic.useDefaultValueForNull=false
, but in default mode numbers cannot be null so it is 0 there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. that works. 👍
@@ -159,6 +160,7 @@ | |||
.add("tier", ValueType.STRING) | |||
.add("curr_size", ValueType.LONG) | |||
.add("max_size", ValueType.LONG) | |||
.add("is_leader", ValueType.LONG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why this is a LONG type variable and not STRING type which could have values "true", "false" and "not-applicable" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's to be consistent with other boolean types such as is_published
in the segments table. It is useful for those boolean columns in the segments table to be long so that we can easily compute sums of them. For this column, I think it doesn't have to be long, but I don't mind either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what @jihoonson was my reasoning for using long, to be consistent with other boolean-ish columns, but i'm not super attached to it and can change if that is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explaining, in the big scheme of things it hardly matters... consider it a nit
LGTM , thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 after CI
…ter, use docker-compose down to stop cluster
I made another change to this PR after it was approved, which was to split out the logic in In docker-compose $(getComposeArgs) up -d and the main reason for this change, was to modify docker-compose $(getComposeArgs) down |
thanks for the review @abhishekagarwal87, @himanshug, and @jihoonson 👍 |
…he#10680) * integration test for coordinator and overlord leadership, added sys.servers is_leader column * docs * remove not needed * fix comments * fix compile heh * oof * revert unintended * fix tests, split out docker-compose file selection from starting cluster, use docker-compose down to stop cluster * fixes * style * dang * heh * scripts are hard * fix spelling * fix thing that must not matter since was already wrong ip, log when test fails * needs more heap * fix merge * less aggro
Description
This PR adds integration tests that try to get some coverage of coordinator and overlord leadership changes, using queries against system tables and then cycling which containers are running and forcing leadership changes. The goal is to get some coverage for
DruidLeadershipClient
in the integration tests and avoid any sort of regressions if possible. Once kubernetes integration tests are in place, I imagine we could run these tests against k8s discovery instead of curator based as well since #10544 has now gone in.To aid in my sanity while testing this stuff, I have added
is_leader
tosys.servers
which is a long column which returns1
if the server is the leader and 0 if it is not (for coordinators and overlords), and for services which do not have the concept of leadership, will return the default long value (0 in default mode, null ifdruid.generic.useDefaultValueForNull=false
).The integration tests add a new test group, 'high-availability', which has a special docker-compose file that brings up a cluster with 1 router, 1 broker, 2 overlords, and 2 coordinators (and zk/kafka and metadata store). The tests check which containers are the leader, issues some system tables queries which should flex the leadership clients to both the current overlord and coordinator leaders, and then restart the containers to force leadership change, repeating this process a few times.
I modified the base docker-compose file to specify the hostnames to the container names, and set
druid.host
to the same, so that the tests could refer to hosts by hostname instead of container IP address (which is what druid.host defaults to if not specified otherwise). This was because I also had to plumb this through to the integration test config so that I could fill in the correct internal host information for test code and query/expected response templates.In other docker stuffs, I removed many of the
links
sections of the docker-compose file for integration tests, which afaict is deprecated and not necessary.Finally, I fixed a funny race condition that i think could really only happen when doing something like this in docker and starting multiple coordinators at the same time, which would have a race condition when trying to initialize the basic auth extension default auth stuffs, where both containers would detect that it had not been initialized, one would lose the race, and explode out of lifecycle start causing the service to die before starting. This probably isn't a big deal even in a real system because if the process gets started again it would succeed because it would be initialized on the 2nd pass, but our integration test configs do not auto-restart (which is noisy on purpose I think), so instead I just wrapped the initialization in a retry which will skip and continue startup if the duplicate initialization explosion is detected.
This PR has: