Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add start_time column to sys.servers #13358

Merged
merged 14 commits into from
Apr 14, 2023

Conversation

a2l007
Copy link
Contributor

@a2l007 a2l007 commented Nov 11, 2022

Fixes #12090

Description

Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.


Key changed/added classes in this PR
  • DiscoveryDruidNode
  • SystemSchema

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! The Java code changes LGTM barring a minor question.

public DateTime getStartTime()
{
return startTime;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is startTime omitted while verifying the equality of this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a DiscoveryDruidNode object is primarily identified by its DruidNode, role and the service map, I wanted to preserve the equality condition. Also, start_time might not be a concrete enough to decide the equality between two DiscoveryDruidNode objects, if the other field values are the same. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I think it's better to keep it the current way.

@vogievetsky
Copy link
Contributor

Oh damn! so cool! I can not wait to add an "uptime" column to the web console!

@@ -67,14 +67,25 @@ const allColumns: string[] = [
'Current size',
'Max size',
'Usage',
'Start Time',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please capitalize as Start time

@@ -510,6 +524,14 @@ ORDER BY
}
},
},
{
Header: 'Start Time',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto re: capitalization

@@ -211,6 +224,7 @@ ORDER BY
curr_size: s.currSize,
max_size: s.maxSize,
tls_port: -1,
start_time: s.start_time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add start_time to the /druid/coordinator/v1/servers?simple response also? If so you should update https://github.com/apache/druid/blob/master/docs/operations/api-reference.md#L496 also, if not then update the line of code above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, removed it.

@@ -67,14 +67,25 @@ const allColumns: string[] = [
'Current size',
'Max size',
'Usage',
'Start time',
'Detail',
ACTION_COLUMN_LABEL,
];

const tableColumns: Record<CapabilitiesMode, string[]> = {
'full': allColumns,
'no-sql': allColumns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so since you are not updating /druid/coordinator/v1/servers?simple you should make sure that Start time is no in the no-sql column list. no-sql is the mode used when the user had no SQL access so it falls back to the old endpoint. You should set no-sql to:

[
  'Service',
  'Type',
  'Tier',
  'Host',
  'Port',
  'Current size',
  'Max size',
  'Usage',
  'Detail',
  ACTION_COLUMN_LABEL,
];

Also at this point you can inline the allColumns constant as it's reasons for existence was to set full and no-sql to the same thing.

Copy link
Contributor

@vogievetsky vogievetsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for everything but the Java (I did not review the Java code - only TS + general idea). Thank you for promptly responding to feedback!

@a2l007
Copy link
Contributor Author

a2l007 commented Nov 17, 2022

@vogievetsky Thanks for the review. Do you know what is going on with the Travis job failure: web console end-to-end test ?
It builds fine locally but Travis seems to keep failing that job.

@vogievetsky
Copy link
Contributor

not sure, will have a look in a bit.

@a2l007
Copy link
Contributor Author

a2l007 commented Nov 23, 2022

I found an issue when coordinator starts up with druid.coordinator.asOverlord.enabled as true. In this case, the coordinator and overlord services are announced twice and in each case with a different instance of DiscoveryDruidNode. This breaks the announcer flow as each of the DiscoveryDruidNode objects have a slightly different startTime which causes a mismatch between the node bytes announced at the path.
Converting this PR to a draft until I find a better way to fix this.

@a2l007 a2l007 marked this pull request as draft November 23, 2022 00:15
@vogievetsky
Copy link
Contributor

Was that what was causing the e2e failures?

@a2l007
Copy link
Contributor Author

a2l007 commented Nov 28, 2022

Was that what was causing the e2e failures?

@vogievetsky I believe so, since we run coordinator with asOverlord enabled in our builds.

@vogievetsky
Copy link
Contributor

What is the status of this PR: is it good to merge if conflicts are resolved?

@a2l007
Copy link
Contributor Author

a2l007 commented Mar 20, 2023

@vogievetsky Sorry for the delay, I'll find some time this week to fix up this PR

@a2l007 a2l007 marked this pull request as ready for review March 22, 2023 20:30
@a2l007
Copy link
Contributor Author

a2l007 commented Mar 24, 2023

@vogievetsky @abhishekagarwal87 @LakshSingla Sorry for the delay in fixing up the conflicts for this PR, but it'd be great if you could take a quick look at this again.

@@ -314,7 +313,7 @@ public void childEvent(CuratorFramework client, PathChildrenCacheEvent event) th
if (oldBytes == null) {
created = true;
} else if (!Arrays.equals(oldBytes, bytes)) {
throw new IAE("Cannot reannounce different values under the same path");
log.error("Ignoring attempt to announce different values under same path");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the rationale behind this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the coordinator is run in overlord mode, Announcer.announce() is called twice since the Announcer module is part of the coordinator and overlord lifecycle modules. The second call is a no-op since the existing DiscoveryDruidNode bytes announced at the path is the same as the node bytes in the second call.
With this patch, the start time is now part of DiscoveryDruidNode and so in some cases, there could be a millisecond delay between the two announce calls. This causes the node objects to be different and the second announce call fails due to the validation check.
I couldn't find another scenario where it would be useful to fail the process in this condition and so I'm logging it here instead. Let me know if you have any thoughts on this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's called 4 times. From my local logs of a run

2023-03-22T13:24:51,527 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}].
2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}].
2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}].
2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}].

I debugged this a bit and you are right about it being called from two modules. I think that we can also skip the duplicate announcement when Overlord is not in standalone mode. That should fix the problem you ran into. And we keep this assert in place. what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this change and it's working fine.

diff --git a/services/src/main/java/org/apache/druid/cli/CliOverlord.java b/services/src/main/java/org/apache/druid/cli/CliOverlord.java
index e4383c673c..79b90e63b5 100644
--- a/services/src/main/java/org/apache/druid/cli/CliOverlord.java
+++ b/services/src/main/java/org/apache/druid/cli/CliOverlord.java
@@ -267,13 +267,13 @@ public class CliOverlord extends ServerRunnable
 
             if (standalone) {
               LifecycleModule.register(binder, Server.class);
-            }
 
-            bindAnnouncer(
-                binder,
-                IndexingService.class,
-                DiscoverySideEffectsProvider.create()
-            );
+              bindAnnouncer(
+                      binder,
+                      IndexingService.class,
+                      DiscoverySideEffectsProvider.create()
+              );
+            }
 
             Jerseys.addResource(binder, SelfDiscoveryResource.class);
             LifecycleModule.registerKey(binder, Key.get(SelfDiscoveryResource.class));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that seems like the better solution. I've tested out the changes and it works as expected.

@abhishekagarwal87
Copy link
Contributor

@a2l007 - the PR is ready to go. I just had one question on a change that you made in this PR.

@abhishekagarwal87 abhishekagarwal87 merged commit e3c160f into apache:master Apr 14, 2023
@abhishekagarwal87
Copy link
Contributor

thank you @a2l007. I merged this PR. This missed the cut for the 26 milestone. if you want this in 26 release, please create a backport PR against 26.0.0 branch.

@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023
churromorales pushed a commit to churromorales/druid that referenced this pull request Sep 13, 2023
Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a service start_time column, so you know when each of your services was started.
6 participants