-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: summary metrics #6477
feat: summary metrics #6477
Conversation
EXPLAIN ANALYZE output of 1000 trials
|
e228d37
to
6c18e7f
Compare
}, | ||
}, | ||
TotalBatches: 1, | ||
EndTime: time.Now().AddDate(0, 0, -1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: is it past because of the end time here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the query only looks at end time to know if we need to recalculate a trial's summary metrics.
|
||
ALTER TABLE trials | ||
ADD COLUMN IF NOT EXISTS summary_metrics jsonb NOT NULL DEFAULT '{}', | ||
ADD COLUMN IF NOT EXISTS summary_metrics_timestamp timestamptz DEFAULT NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: do we need DEFAULT NULL here? I think the default will be null anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we don't need it, removed it.
ADD COLUMN IF NOT EXISTS summary_metrics_timestamp timestamptz DEFAULT NULL; | ||
|
||
-- Invalidate summary_metrics_timestamp for trials that have a metric added since. | ||
WITH max_training as ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: this sql file is a little difficult to read because it's so long. it might be good to have a couple of comments in the PR or in the file about what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some high level comments of what each CTE does.
m[strconv.Itoa(j)] = rand.Float64() //nolint: gosec | ||
} | ||
|
||
metrics = append(metrics, step{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why aren't you using AddTrainingMetrics and AddValidationMetrics here? if the goal here is to generate metrics for a trial can we make use of the functions from here: #6222?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a lot quicker to generate with data for the test with this. I was considering deleting this just needed it for a test and wanted to make sure the test database generation was reasonable.
Its quicker since it uses batch inserts, skips batch metrics, skips rbac checks related queries, and also skips archiving checks.
WHEN (metrics->'avg_metrics'->name)::text = '"Infinity"'::text THEN 'number' | ||
WHEN (metrics->'avg_metrics'->name)::text = '"-Infinity"'::text THEN 'number' | ||
WHEN (metrics->'avg_metrics'->name)::text = '"NaN"'::text THEN 'number' | ||
ELSE jsonb_typeof(metrics->'avg_metrics'->name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
came up in testing top trials by metric
old code would sometimes let people get away with report the string "1.0"
and we would cast it to the float 1.0
in some queries.
should this migration and ingestion code treat "1.0"
as 1.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I don't think this is a kind of behavior we should support. that doesn't seem to me like something a user would rely upon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
f5c323e
to
e23f9c4
Compare
@ioga should be ready for review now |
master/internal/db/postgres_trial.go
Outdated
"count": 1, | ||
} | ||
} else { | ||
// Check if the metric had a non numeric value in the past. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: when would this happen?
math.Max(summaryMetric["max"].(float64), metricValue)) | ||
summaryMetric["sum"] = replaceSpecialFloatsWithString( | ||
summaryMetric["sum"].(float64) + metricValue) | ||
// Go parsing odditity treats JSON whole numbers as floats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's javascript fault: there're no integer values, only floating point numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work.
@ioga added types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
084de9d
to
7829e2c
Compare
Description
Adds summary metrics migration and ingestion code.
Test Plan
Integration test plus test database generated for benchmarking
Spot test run an experiment and look in the database for
summary_metrics
Commentary (optional)
Test cluster generated with 10k trials each with 25k steps and 2.5k validations reporting 5 different metrics.
This should be a decent overestimate on what we expect.
156 seconds
to compute which metrics will need to be incrementally updated (will run every migration)1.4 seconds
to update 1 single trial13.8 seconds
to update 10 trials3.0 minutes
to update 100 trials39.4 minutes
to update 1000 trials4.2 hours
when the full database updatedQuery plan shown below
Checklist
docs/release-notes/
.See Release Note for details.
Ticket