-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add start_time column to sys.servers #13358
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! The Java code changes LGTM barring a minor question.
public DateTime getStartTime() | ||
{ | ||
return startTime; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is startTime omitted while verifying the equality of this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a DiscoveryDruidNode object is primarily identified by its DruidNode, role and the service map, I wanted to preserve the equality condition. Also, start_time might not be a concrete enough to decide the equality between two DiscoveryDruidNode objects, if the other field values are the same. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! I think it's better to keep it the current way.
Oh damn! so cool! I can not wait to add an "uptime" column to the web console! |
@@ -67,14 +67,25 @@ const allColumns: string[] = [ | |||
'Current size', | |||
'Max size', | |||
'Usage', | |||
'Start Time', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please capitalize as Start time
@@ -510,6 +524,14 @@ ORDER BY | |||
} | |||
}, | |||
}, | |||
{ | |||
Header: 'Start Time', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto re: capitalization
@@ -211,6 +224,7 @@ ORDER BY | |||
curr_size: s.currSize, | |||
max_size: s.maxSize, | |||
tls_port: -1, | |||
start_time: s.start_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you add start_time
to the /druid/coordinator/v1/servers?simple
response also? If so you should update https://github.com/apache/druid/blob/master/docs/operations/api-reference.md#L496 also, if not then update the line of code above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, removed it.
@@ -67,14 +67,25 @@ const allColumns: string[] = [ | |||
'Current size', | |||
'Max size', | |||
'Usage', | |||
'Start time', | |||
'Detail', | |||
ACTION_COLUMN_LABEL, | |||
]; | |||
|
|||
const tableColumns: Record<CapabilitiesMode, string[]> = { | |||
'full': allColumns, | |||
'no-sql': allColumns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so since you are not updating /druid/coordinator/v1/servers?simple
you should make sure that Start time
is no in the no-sql
column list. no-sql
is the mode used when the user had no SQL access so it falls back to the old endpoint. You should set no-sql
to:
[
'Service',
'Type',
'Tier',
'Host',
'Port',
'Current size',
'Max size',
'Usage',
'Detail',
ACTION_COLUMN_LABEL,
];
Also at this point you can inline the allColumns
constant as it's reasons for existence was to set full
and no-sql
to the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for everything but the Java (I did not review the Java code - only TS + general idea). Thank you for promptly responding to feedback!
@vogievetsky Thanks for the review. Do you know what is going on with the Travis job failure: web console end-to-end test ? |
not sure, will have a look in a bit. |
I found an issue when coordinator starts up with |
Was that what was causing the e2e failures? |
@vogievetsky I believe so, since we run coordinator with asOverlord enabled in our builds. |
What is the status of this PR: is it good to merge if conflicts are resolved? |
@vogievetsky Sorry for the delay, I'll find some time this week to fix up this PR |
@vogievetsky @abhishekagarwal87 @LakshSingla Sorry for the delay in fixing up the conflicts for this PR, but it'd be great if you could take a quick look at this again. |
@@ -314,7 +313,7 @@ public void childEvent(CuratorFramework client, PathChildrenCacheEvent event) th | |||
if (oldBytes == null) { | |||
created = true; | |||
} else if (!Arrays.equals(oldBytes, bytes)) { | |||
throw new IAE("Cannot reannounce different values under the same path"); | |||
log.error("Ignoring attempt to announce different values under same path"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the rationale behind this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the coordinator is run in overlord mode, Announcer.announce() is called twice since the Announcer module is part of the coordinator and overlord lifecycle modules. The second call is a no-op since the existing DiscoveryDruidNode bytes announced at the path is the same as the node bytes in the second call.
With this patch, the start time is now part of DiscoveryDruidNode and so in some cases, there could be a millisecond delay between the two announce calls. This causes the node objects to be different and the second announce call fails due to the validation check.
I couldn't find another scenario where it would be useful to fail the process in this condition and so I'm logging it here instead. Let me know if you have any thoughts on this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's called 4 times. From my local logs of a run
2023-03-22T13:24:51,527 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}].
2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}].
2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}].
2023-03-22T13:24:51,530 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Announced self [{"druidNode":{"service":"druid/coordinator","host":"localhost","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"overlord","services":{}}].
I debugged this a bit and you are right about it being called from two modules. I think that we can also skip the duplicate announcement when Overlord is not in standalone mode. That should fix the problem you ran into. And we keep this assert in place. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this change and it's working fine.
diff --git a/services/src/main/java/org/apache/druid/cli/CliOverlord.java b/services/src/main/java/org/apache/druid/cli/CliOverlord.java
index e4383c673c..79b90e63b5 100644
--- a/services/src/main/java/org/apache/druid/cli/CliOverlord.java
+++ b/services/src/main/java/org/apache/druid/cli/CliOverlord.java
@@ -267,13 +267,13 @@ public class CliOverlord extends ServerRunnable
if (standalone) {
LifecycleModule.register(binder, Server.class);
- }
- bindAnnouncer(
- binder,
- IndexingService.class,
- DiscoverySideEffectsProvider.create()
- );
+ bindAnnouncer(
+ binder,
+ IndexingService.class,
+ DiscoverySideEffectsProvider.create()
+ );
+ }
Jerseys.addResource(binder, SelfDiscoveryResource.class);
LifecycleModule.registerKey(binder, Key.get(SelfDiscoveryResource.class));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that seems like the better solution. I've tested out the changes and it works as expected.
@a2l007 - the PR is ready to go. I just had one question on a change that you made in this PR. |
thank you @a2l007. I merged this PR. This missed the cut for the 26 milestone. if you want this in 26 release, please create a backport PR against 26.0.0 branch. |
Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.
Fixes #12090
Description
Adds a new column
start_time
to sys.servers that captures the time at which the server was added to the cluster.Key changed/added classes in this PR
DiscoveryDruidNode
SystemSchema
This PR has: