Add start_time column to sys.servers #13358

a2l007 · 2022-11-11T22:02:09Z

Fixes #12090

Description

Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.

Key changed/added classes in this PR

DiscoveryDruidNode
SystemSchema

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

LakshSingla

Thanks for the contribution! The Java code changes LGTM barring a minor question.

LakshSingla · 2022-11-15T09:02:08Z

server/src/main/java/org/apache/druid/discovery/DiscoveryDruidNode.java

+  public DateTime getStartTime()
+  {
+    return startTime;
+  }


Why is startTime omitted while verifying the equality of this class?

Since a DiscoveryDruidNode object is primarily identified by its DruidNode, role and the service map, I wanted to preserve the equality condition. Also, start_time might not be a concrete enough to decide the equality between two DiscoveryDruidNode objects, if the other field values are the same. What do you think?

Thanks for the explanation! I think it's better to keep it the current way.

vogievetsky · 2022-11-15T19:43:28Z

Oh damn! so cool! I can not wait to add an "uptime" column to the web console!

vogievetsky · 2022-11-15T19:45:08Z

web-console/src/views/services-view/services-view.tsx

@@ -67,14 +67,25 @@ const allColumns: string[] = [
  'Current size',
  'Max size',
  'Usage',
+  'Start Time',


Please capitalize as Start time

vogievetsky · 2022-11-15T19:45:42Z

web-console/src/views/services-view/services-view.tsx

@@ -510,6 +524,14 @@ ORDER BY
              }
            },
          },
+          {
+            Header: 'Start Time',


Ditto re: capitalization

vogievetsky · 2022-11-15T19:49:16Z

web-console/src/views/services-view/services-view.tsx

@@ -211,6 +224,7 @@ ORDER BY
        curr_size: s.currSize,
        max_size: s.maxSize,
        tls_port: -1,
+        start_time: s.start_time,


did you add start_time to the /druid/coordinator/v1/servers?simple response also? If so you should update https://github.com/apache/druid/blob/master/docs/operations/api-reference.md#L496 also, if not then update the line of code above.

Good catch, removed it.

vogievetsky · 2022-11-16T18:23:51Z

web-console/src/views/services-view/services-view.tsx

@@ -67,14 +67,25 @@ const allColumns: string[] = [
  'Current size',
  'Max size',
  'Usage',
+  'Start time',
  'Detail',
  ACTION_COLUMN_LABEL,
 ];

 const tableColumns: Record<CapabilitiesMode, string[]> = {
  'full': allColumns,
  'no-sql': allColumns,


ok, so since you are not updating /druid/coordinator/v1/servers?simple you should make sure that Start time is no in the no-sql column list. no-sql is the mode used when the user had no SQL access so it falls back to the old endpoint. You should set no-sql to:

[ 'Service', 'Type', 'Tier', 'Host', 'Port', 'Current size', 'Max size', 'Usage', 'Detail', ACTION_COLUMN_LABEL, ];

Also at this point you can inline the allColumns constant as it's reasons for existence was to set full and no-sql to the same thing.

vogievetsky

👍 for everything but the Java (I did not review the Java code - only TS + general idea). Thank you for promptly responding to feedback!

a2l007 · 2022-11-17T19:24:55Z

@vogievetsky Thanks for the review. Do you know what is going on with the Travis job failure: web console end-to-end test ?
It builds fine locally but Travis seems to keep failing that job.

vogievetsky · 2022-11-17T22:18:17Z

not sure, will have a look in a bit.

a2l007 · 2022-11-23T00:14:55Z

I found an issue when coordinator starts up with druid.coordinator.asOverlord.enabled as true. In this case, the coordinator and overlord services are announced twice and in each case with a different instance of DiscoveryDruidNode. This breaks the announcer flow as each of the DiscoveryDruidNode objects have a slightly different startTime which causes a mismatch between the node bytes announced at the path.
Converting this PR to a draft until I find a better way to fix this.

vogievetsky · 2022-11-23T23:09:48Z

Was that what was causing the e2e failures?

a2l007 · 2022-11-28T17:23:02Z

Was that what was causing the e2e failures?

@vogievetsky I believe so, since we run coordinator with asOverlord enabled in our builds.

vogievetsky · 2023-03-20T18:07:07Z

What is the status of this PR: is it good to merge if conflicts are resolved?

a2l007 · 2023-03-20T18:39:11Z

@vogievetsky Sorry for the delay, I'll find some time this week to fix up this PR

a2l007 · 2023-03-24T22:53:19Z

@vogievetsky @abhishekagarwal87 @LakshSingla Sorry for the delay in fixing up the conflicts for this PR, but it'd be great if you could take a quick look at this again.

abhishekagarwal87 · 2023-03-29T13:29:51Z

server/src/main/java/org/apache/druid/curator/announcement/Announcer.java

@@ -314,7 +313,7 @@ public void childEvent(CuratorFramework client, PathChildrenCacheEvent event) th
        if (oldBytes == null) {
          created = true;
        } else if (!Arrays.equals(oldBytes, bytes)) {
-          throw new IAE("Cannot reannounce different values under the same path");
+          log.error("Ignoring attempt to announce different values under same path");


what is the rationale behind this change?

When the coordinator is run in overlord mode, Announcer.announce() is called twice since the Announcer module is part of the coordinator and overlord lifecycle modules. The second call is a no-op since the existing DiscoveryDruidNode bytes announced at the path is the same as the node bytes in the second call.
With this patch, the start time is now part of DiscoveryDruidNode and so in some cases, there could be a millisecond delay between the two announce calls. This causes the node objects to be different and the second announce call fails due to the validation check.
I couldn't find another scenario where it would be useful to fail the process in this condition and so I'm logging it here instead. Let me know if you have any thoughts on this approach.

it's called 4 times. From my local logs of a run

2023-03-22T13:24:51,527 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}]. 2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}]. 2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}]. 2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}].

I debugged this a bit and you are right about it being called from two modules. I think that we can also skip the duplicate announcement when Overlord is not in standalone mode. That should fix the problem you ran into. And we keep this assert in place. what do you think?

I tried this change and it's working fine.

diff --git a/services/src/main/java/org/apache/druid/cli/CliOverlord.java b/services/src/main/java/org/apache/druid/cli/CliOverlord.java index e4383c673c..79b90e63b5 100644 --- a/services/src/main/java/org/apache/druid/cli/CliOverlord.java +++ b/services/src/main/java/org/apache/druid/cli/CliOverlord.java @@ -267,13 +267,13 @@ public class CliOverlord extends ServerRunnable if (standalone) { LifecycleModule.register(binder, Server.class); - } - bindAnnouncer( - binder, - IndexingService.class, - DiscoverySideEffectsProvider.create() - ); + bindAnnouncer( + binder, + IndexingService.class, + DiscoverySideEffectsProvider.create() + ); + } Jerseys.addResource(binder, SelfDiscoveryResource.class); LifecycleModule.registerKey(binder, Key.get(SelfDiscoveryResource.class));

Thanks, that seems like the better solution. I've tested out the changes and it works as expected.

abhishekagarwal87 · 2023-04-04T07:51:45Z

@a2l007 - the PR is ready to go. I just had one question on a change that you made in this PR.

docs/querying/sql-metadata-tables.md

abhishekagarwal87 · 2023-04-14T09:54:29Z

thank you @a2l007. I merged this PR. This missed the cut for the 26 milestone. if you want this in 26 release, please create a backport PR against 26.0.0 branch.

Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.

Server start time

232631f

kfaraz added the Area - Operations label Nov 14, 2022

LakshSingla reviewed Nov 15, 2022

View reviewed changes

vogievetsky reviewed Nov 15, 2022

View reviewed changes

vogievetsky added the Area - Web Console label Nov 15, 2022

a2l007 added 3 commits November 15, 2022 16:41

Merge branch 'master' of github.com:apache/druid into servercreatedtime

d80dd80

Spelling fixes

737c77c

Changes to test

f9f4415

vogievetsky reviewed Nov 16, 2022

View reviewed changes

Fix services view for nosql

cdc0ed7

vogievetsky approved these changes Nov 17, 2022

View reviewed changes

abhishekagarwal87 approved these changes Nov 22, 2022

View reviewed changes

Merge branch 'master' of github.com:apache/druid into servercreatedtime

6def47d

a2l007 marked this pull request as draft November 23, 2022 00:15

a2l007 added 2 commits November 28, 2022 09:20

Log attempt to announce diff values under same path

f1c46d3

Merge branch 'master' of github.com:apache/druid into servercreatedtime

b3e5a9f

Checkstyle

4aaa647

Merge branch 'master' of github.com:apache/druid into servercreatedtime

3523071

github-actions bot added the Area - Documentation label Mar 21, 2023

Default start time for int tests

1a9c5e9

a2l007 marked this pull request as ready for review March 22, 2023 20:30

abhishekagarwal87 reviewed Mar 29, 2023

View reviewed changes

techdocsmith reviewed Apr 6, 2023

View reviewed changes

docs/querying/sql-metadata-tables.md Outdated Show resolved Hide resolved

a2l007 added 3 commits April 10, 2023 16:37

Merge branch 'master' of github.com:apache/druid into servercreatedtime

ded8668

Fix doc comments

839f2a5

bind announcer only in standalone mode

e2ae733

abhishekagarwal87 added the Release Notes label Apr 14, 2023

abhishekagarwal87 merged commit e3c160f into apache:master Apr 14, 2023

abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023

vogievetsky mentioned this pull request Aug 3, 2023

DO NOT MERGE - 27.0.0 WIP release notes #14600

Closed

churromorales pushed a commit to churromorales/druid that referenced this pull request Sep 13, 2023

Add start_time column to sys.servers (apache#13358)

33092ed

Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add start_time column to sys.servers #13358

Add start_time column to sys.servers #13358

a2l007 commented Nov 11, 2022 •

edited

Loading

LakshSingla left a comment

LakshSingla Nov 15, 2022

a2l007 Nov 16, 2022

LakshSingla Nov 16, 2022

vogievetsky commented Nov 15, 2022

vogievetsky Nov 15, 2022

vogievetsky Nov 15, 2022

vogievetsky Nov 15, 2022

a2l007 Nov 16, 2022

vogievetsky Nov 16, 2022

vogievetsky left a comment

a2l007 commented Nov 17, 2022

vogievetsky commented Nov 17, 2022

a2l007 commented Nov 23, 2022

vogievetsky commented Nov 23, 2022

a2l007 commented Nov 28, 2022

vogievetsky commented Mar 20, 2023

a2l007 commented Mar 20, 2023

a2l007 commented Mar 24, 2023

abhishekagarwal87 Mar 29, 2023

a2l007 Apr 4, 2023

abhishekagarwal87 Apr 12, 2023

abhishekagarwal87 Apr 13, 2023

a2l007 Apr 13, 2023

abhishekagarwal87 commented Apr 4, 2023

abhishekagarwal87 commented Apr 14, 2023

Add start_time column to sys.servers #13358

Add start_time column to sys.servers #13358

Conversation

a2l007 commented Nov 11, 2022 • edited Loading

Description

Key changed/added classes in this PR

LakshSingla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vogievetsky commented Nov 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vogievetsky left a comment

Choose a reason for hiding this comment

a2l007 commented Nov 17, 2022

vogievetsky commented Nov 17, 2022

a2l007 commented Nov 23, 2022

vogievetsky commented Nov 23, 2022

a2l007 commented Nov 28, 2022

vogievetsky commented Mar 20, 2023

a2l007 commented Mar 20, 2023

a2l007 commented Mar 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekagarwal87 commented Apr 4, 2023

abhishekagarwal87 commented Apr 14, 2023

a2l007 commented Nov 11, 2022 •

edited

Loading