integration test for coordinator and overlord leadership client #10680

clintropolis · 2020-12-15T05:58:39Z

Description

This PR adds integration tests that try to get some coverage of coordinator and overlord leadership changes, using queries against system tables and then cycling which containers are running and forcing leadership changes. The goal is to get some coverage for DruidLeadershipClient in the integration tests and avoid any sort of regressions if possible. Once kubernetes integration tests are in place, I imagine we could run these tests against k8s discovery instead of curator based as well since #10544 has now gone in.

To aid in my sanity while testing this stuff, I have added is_leader to sys.servers which is a long column which returns 1 if the server is the leader and 0 if it is not (for coordinators and overlords), and for services which do not have the concept of leadership, will return the default long value (0 in default mode, null if druid.generic.useDefaultValueForNull=false).

The integration tests add a new test group, 'high-availability', which has a special docker-compose file that brings up a cluster with 1 router, 1 broker, 2 overlords, and 2 coordinators (and zk/kafka and metadata store). The tests check which containers are the leader, issues some system tables queries which should flex the leadership clients to both the current overlord and coordinator leaders, and then restart the containers to force leadership change, repeating this process a few times.

I modified the base docker-compose file to specify the hostnames to the container names, and set druid.host to the same, so that the tests could refer to hosts by hostname instead of container IP address (which is what druid.host defaults to if not specified otherwise). This was because I also had to plumb this through to the integration test config so that I could fill in the correct internal host information for test code and query/expected response templates.

In other docker stuffs, I removed many of the links sections of the docker-compose file for integration tests, which afaict is deprecated and not necessary.

Finally, I fixed a funny race condition that i think could really only happen when doing something like this in docker and starting multiple coordinators at the same time, which would have a race condition when trying to initialize the basic auth extension default auth stuffs, where both containers would detect that it had not been initialized, one would lose the race, and explode out of lifecycle start causing the service to die before starting. This probably isn't a big deal even in a real system because if the process gets started again it would succeed because it would be initialized on the 2nd pass, but our integration test configs do not auto-restart (which is noisy on purpose I think), so instead I just wrapped the initialization in a retry which will skip and continue startup if the duplicate initialization explosion is detected.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…ervers is_leader column

zhangyue19921010 · 2020-12-15T06:22:39Z

Nice idea! And I have made a PR #10669 which run IT for druid on K8s. Maybe it can be helpful here :)

abhishekagarwal87 · 2020-12-15T07:27:22Z

sql/src/main/java/org/apache/druid/sql/calcite/schema/SystemSchema.java

@@ -586,7 +638,8 @@ public TableType getJdbcTableType()
          StringUtils.toLowerCase(discoveryDruidNode.getNodeRole().toString()),
          druidServerToUse.getTier(),
          currentSize,
-          druidServerToUse.getMaxSize()
+          druidServerToUse.getMaxSize(),
+          NullHandling.defaultLongValue()


hmm. shouldn't this always be zero? Data nodes don't have any leader.

ideally it should always be null because they don't have the concept of leadership at all and then you can distinguish between things which are leader, are not leader, and which don't have leaders. NullHandling.defaultLongValue() is null when druid.generic.useDefaultValueForNull=false, but in default mode numbers cannot be null so it is 0 there.

yeah. that works. 👍

himanshug · 2020-12-15T23:12:08Z

sql/src/main/java/org/apache/druid/sql/calcite/schema/SystemSchema.java

@@ -159,6 +160,7 @@
      .add("tier", ValueType.STRING)
      .add("curr_size", ValueType.LONG)
      .add("max_size", ValueType.LONG)
+      .add("is_leader", ValueType.LONG)


not sure why this is a LONG type variable and not STRING type which could have values "true", "false" and "not-applicable" ?

I guess it's to be consistent with other boolean types such as is_published in the segments table. It is useful for those boolean columns in the segments table to be long so that we can easily compute sums of them. For this column, I think it doesn't have to be long, but I don't mind either.

Yeah, what @jihoonson was my reasoning for using long, to be consistent with other boolean-ish columns, but i'm not super attached to it and can change if that is better

thanks for explaining, in the big scheme of things it hardly matters... consider it a nit

himanshug · 2020-12-15T23:13:08Z

LGTM , thanks.

jihoonson

+1 after CI

…ter, use docker-compose down to stop cluster

…est fails

clintropolis · 2020-12-16T21:51:39Z

I made another change to this PR after it was approved, which was to split out the logic in docker_run_cluster.sh which picks which compose file to use for the tests into a new file docker_compose_args.sh which provides the function getComposeArgs.

In docker_run_cluster.sh it is used like

docker-compose $(getComposeArgs) up -d

and the main reason for this change, was to modify stop_cluster.sh so that it could do the same-ish thing to stop the cluster

docker-compose $(getComposeArgs) down

…egration-tests

clintropolis · 2020-12-18T06:49:41Z

thanks for the review @abhishekagarwal87, @himanshug, and @jihoonson 👍

…he#10680) * integration test for coordinator and overlord leadership, added sys.servers is_leader column * docs * remove not needed * fix comments * fix compile heh * oof * revert unintended * fix tests, split out docker-compose file selection from starting cluster, use docker-compose down to stop cluster * fixes * style * dang * heh * scripts are hard * fix spelling * fix thing that must not matter since was already wrong ip, log when test fails * needs more heap * fix merge * less aggro

clintropolis added 3 commits December 14, 2020 21:26

integration test for coordinator and overlord leadership, added sys.s…

2e605eb

…ervers is_leader column

docs

be04821

remove not needed

c7e8b1f

clintropolis added Area - Testing Area - ZooKeeper/Curator WIP Area - SQL Area - Operations labels Dec 15, 2020

fix comments

caa2205

abhishekagarwal87 reviewed Dec 15, 2020

View reviewed changes

fix compile heh

8b369ac

himanshug self-requested a review December 15, 2020 16:31

oof

c6a917c

himanshug reviewed Dec 15, 2020

View reviewed changes

revert unintended

ed27049

jihoonson approved these changes Dec 16, 2020

View reviewed changes

clintropolis added 8 commits December 16, 2020 01:46

fix tests, split out docker-compose file selection from starting clus…

3d1ffcf

…ter, use docker-compose down to stop cluster

fixes

b6fcc2f

style

9f750cf

dang

216b06a

heh

8989482

scripts are hard

3c3d2cf

fix spelling

4e59bd4

fix thing that must not matter since was already wrong ip, log when t…

433960a

…est fails

himanshug approved these changes Dec 16, 2020

View reviewed changes

clintropolis added 2 commits December 16, 2020 15:36

needs more heap

2f36491

Merge remote-tracking branch 'upstream/master' into ha-leadership-int…

518721d

…egration-tests

clintropolis added 3 commits December 16, 2020 23:44

fix merge

5dea9e9

less aggro

470dec3

Merge remote-tracking branch 'upstream/master' into ha-leadership-int…

abd5c95

…egration-tests

clintropolis added Release Notes and removed WIP labels Dec 18, 2020

clintropolis merged commit da0eaba into apache:master Dec 18, 2020

clintropolis deleted the ha-leadership-integration-tests branch December 18, 2020 06:50

clintropolis mentioned this pull request Dec 18, 2020

fix integration test override config #10694

Merged

jihoonson added this to the 0.21.0 milestone Jan 4, 2021

jihoonson mentioned this pull request Jan 5, 2021

Support to show leader of coordinators in Services web console view #10418

Closed

9 tasks

jihoonson mentioned this pull request Jan 13, 2021

[Draft] 0.21.0 Release Notes #10752

Closed

FrankChen021 mentioned this pull request Mar 5, 2021

Show leader nodes in Services Tab #10951

Merged

9 tasks

himanshug mentioned this pull request Mar 9, 2021

k8s discovery module: fix issue for druid.host being more than 63chars not permitted as k8s resource label value #10961

Merged

9 tasks

himanshug mentioned this pull request May 5, 2021

kubernetes-druid-discovery-extension: stress-test overlord/coordinator leadership changes #11205

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration test for coordinator and overlord leadership client #10680

integration test for coordinator and overlord leadership client #10680

clintropolis commented Dec 15, 2020 •

edited

Loading

zhangyue19921010 commented Dec 15, 2020 •

edited

Loading

abhishekagarwal87 Dec 15, 2020

clintropolis Dec 15, 2020

abhishekagarwal87 Dec 15, 2020

himanshug Dec 15, 2020

jihoonson Dec 16, 2020

clintropolis Dec 16, 2020

himanshug Dec 16, 2020

himanshug commented Dec 15, 2020

jihoonson left a comment

clintropolis commented Dec 16, 2020

clintropolis commented Dec 18, 2020

integration test for coordinator and overlord leadership client #10680

integration test for coordinator and overlord leadership client #10680

Conversation

clintropolis commented Dec 15, 2020 • edited Loading

Description

zhangyue19921010 commented Dec 15, 2020 • edited Loading

abhishekagarwal87 Dec 15, 2020

Choose a reason for hiding this comment

clintropolis Dec 15, 2020

Choose a reason for hiding this comment

abhishekagarwal87 Dec 15, 2020

Choose a reason for hiding this comment

himanshug Dec 15, 2020

Choose a reason for hiding this comment

jihoonson Dec 16, 2020

Choose a reason for hiding this comment

clintropolis Dec 16, 2020

Choose a reason for hiding this comment

himanshug Dec 16, 2020

Choose a reason for hiding this comment

himanshug commented Dec 15, 2020

jihoonson left a comment

Choose a reason for hiding this comment

clintropolis commented Dec 16, 2020

clintropolis commented Dec 18, 2020

clintropolis commented Dec 15, 2020 •

edited

Loading

zhangyue19921010 commented Dec 15, 2020 •

edited

Loading