feat: summary metrics #6477

NicholasBlaskey · 2023-04-05T19:32:25Z

Description

Adds summary metrics migration and ingestion code.

Test Plan

Integration test plus test database generated for benchmarking

Spot test run an experiment and look in the database for summary_metrics

SELECT summary_metrics FROM trials;

Commentary (optional)

Test cluster generated with 10k trials each with 25k steps and 2.5k validations reporting 5 different metrics.

This should be a decent overestimate on what we expect.

156 seconds to compute which metrics will need to be incrementally updated (will run every migration)
1.4 seconds to update 1 single trial
13.8 seconds to update 10 trials
3.0 minutes to update 100 trials
39.4 minutes to update 1000 trials
4.2 hours when the full database updated

Query plan shown below

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

NicholasBlaskey · 2023-04-05T19:41:26Z

EXPLAIN ANALYZE output of 1000 trials

Update on trials  (cost=105974293190770960.00..105974293190770976.00 rows=200 width=306) (actual time=2361002.548..2361003.329 rows=0 loops=1)
   CTE training_trial_metrics
     ->  GroupAggregate  (cost=105186359243412064.00..105966659514393120.00 rows=40000 width=68) (actual time=1642094.969..1642134.016 rows=5005 loops=1)
           Group Key: typed.trial_id, typed.name
           ->  Sort  (cost=105186359243412064.00..105342419297608176.00 rows=62424021678443808 width=76) (actual time=1642094.867..1642106.566 rows=5005 loops=1)
                 Sort Key: typed.trial_id, typed.name
                 Sort Method: quicksort  Memory: 584kB
                 ->  Subquery Scan on typed  (cost=52423435890008880.00..57261297570088280.00 rows=62424021678443808 width=76) (actual time=1576286.007..1642077.312 rows=5005 loops=1)
                       ->  GroupAggregate  (cost=52423435890008880.00..56637057353303840.00 rows=62424021678443808 width=76) (actual time=1576286.003..1642068.199 rows=5005 loops=1)
                             Group Key: (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))), (CASE WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"Infinity"'::text) THEN 'n
umber'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.
metrics -> 'avg_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))))) END), raw_steps.trial_id
                             ->  Sort  (cost=52423435890008880.00..52579495944204992.00 rows=62424021678443808 width=68) (actual time=1576273.313..1618415.467 rows=125125000 loops=1)
                                   Sort Key: (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))), (CASE WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"Infinity"'::text) TH
EN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_ste
ps_1.metrics -> 'avg_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))))) END), raw_steps.trial_id
                                   Sort Method: external merge  Disk: 3183280kB
                                   ->  Nested Loop  (cost=653418029.17..6845371125493768.00 rows=62424021678443808 width=68) (actual time=743501.088..1345399.227 rows=125125000 loops=1)
                                         Join Filter: (CASE WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"Infinity"'::text) THEN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metri
cs'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"NaN"'::
text) THEN 'number'::text ELSE jsonb_typeof(((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))))) END IS NOT NULL)
                                         ->  Hash Join  (cost=710.53..9913995.84 rows=25047497 width=162) (actual time=304491.255..328435.286 rows=25025000 loops=1)
                                               Hash Cond: (raw_steps.trial_id = trials_1.id)
                                               ->  Seq Scan on raw_steps  (cost=0.00..9256757.16 rows=250000016 width=162) (actual time=0.026..299802.735 rows=250000000 loops=1)
                                                     Filter: (NOT archived)
                                               ->  Hash  (cost=698.01..698.01 rows=1002 width=4) (actual time=3.009..3.010 rows=1002 loops=1)
                                                     Buckets: 1024  Batches: 1  Memory Usage: 44kB
                                                     ->  Seq Scan on trials trials_1  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.010..2.762 rows=1002 loops=1)
                                                           Filter: (summary_metrics_timestamp IS NULL)
                                                           Rows Removed by Filter: 8999
                                         ->  Materialize  (cost=653417318.63..720634625.63 rows=2504749700 width=32) (actual time=0.018..0.019 rows=5 loops=25025000)
                                               ->  Unique  (cost=653417318.63..665941067.13 rows=2504749700 width=32) (actual time=439009.729..466547.242 rows=5 loops=1)
                                                     ->  Sort  (cost=653417318.63..659679192.88 rows=2504749700 width=32) (actual time=439009.723..456243.823 rows=125125000 loops=1)
                                                           Sort Key: (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))
                                                           Sort Method: external merge  Disk: 1469224kB
                                                           ->  ProjectSet  (cost=710.53..22688219.31 rows=2504749700 width=32) (actual time=4.170..332663.530 rows=125125000 loops=1)
                                                                 ->  Hash Join  (cost=710.53..9913995.84 rows=25047497 width=158) (actual time=4.133..300994.375 rows=25025000 loops=1)
                                                                       Hash Cond: (raw_steps_1.trial_id = trials_2.id)
                                                                       ->  Seq Scan on raw_steps raw_steps_1  (cost=0.00..9256757.16 rows=250000016 width=162) (actual time=0.009..276630.486 rows=250000000 loops=1)
                                                                             Filter: (NOT archived)
                                                                       ->  Hash  (cost=698.01..698.01 rows=1002 width=4) (actual time=3.947..3.948 rows=1002 loops=1)
                                                                             Buckets: 1024  Batches: 1  Memory Usage: 44kB
                                                                             ->  Seq Scan on trials trials_2  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.012..3.771 rows=1002 loops=1)
                                                                                   Filter: (summary_metrics_timestamp IS NULL)
                                                                                   Rows Removed by Filter: 8999
   CTE training_numeric_trial_metrics
     ->  CTE Scan on training_trial_metrics  (cost=0.00..800.00 rows=200 width=36) (actual time=1642095.001..1642153.440 rows=5005 loops=1)
           Filter: (nonumbers IS NULL)
   CTE training_trial_metric_aggs
     ->  HashAggregate  (cost=9536260.43..9536660.43 rows=40000 width=68) (actual time=1979318.811..1979321.420 rows=5005 loops=1)
           Group Key: ntm.name, ntm.trial_id, (count(1)), (sum((((raw_steps_2.metrics -> 'avg_metrics'::text) ->> ntm.name))::double precision)), (min((((raw_steps_2.metrics -> 'avg_metrics'::text) ->> ntm.name))::double precision)), (max((((raw_steps_2.metrics -> 'avg_
metrics'::text) ->> ntm.name))::double precision))
           ->  Append  (cost=9534458.43..9535660.43 rows=40000 width=68) (actual time=1979302.041..1979307.106 rows=5005 loops=1)
                 ->  HashAggregate  (cost=9534458.43..9534460.43 rows=200 width=68) (actual time=1979301.994..1979305.185 rows=5005 loops=1)
                       Group Key: ntm.name, ntm.trial_id
                       ->  Nested Loop  (cost=0.57..9295861.99 rows=5302143 width=194) (actual time=1642098.707..1754744.385 rows=125125000 loops=1)
                             ->  CTE Scan on training_numeric_trial_metrics ntm  (cost=0.00..4.00 rows=200 width=36) (actual time=1642095.022..1642168.127 rows=5005 loops=1)
                             ->  Index Scan using steps_trial_id_total_batches_run_id_unique on raw_steps raw_steps_2  (cost=0.57..46214.18 rows=26511 width=162) (actual time=0.026..19.431 rows=25000 loops=5005)
                                   Index Cond: (trial_id = ntm.trial_id)
                                   Filter: ((NOT archived) AND (((metrics -> 'avg_metrics'::text) -> ntm.name) IS NOT NULL))
                 ->  CTE Scan on training_trial_metrics training_trial_metrics_1  (cost=0.00..800.00 rows=39800 width=68) (actual time=1.264..1.265 rows=0 loops=1)
                       Filter: (nonumbers IS NOT NULL)
                       Rows Removed by Filter: 5005
   CTE latest_training
     ->  Nested Loop  (cost=19412979.18..20477496.83 rows=12523700 width=68) (actual time=352280.817..368470.283 rows=5005 loops=1)
           ->  Subquery Scan on s  (cost=19412979.17..20227022.83 rows=125237 width=162) (actual time=352280.572..368432.594 rows=1001 loops=1)
                 Filter: (s.rank = 1)
                 Rows Removed by Filter: 25023999
                 ->  WindowAgg  (cost=19412979.17..19913929.11 rows=25047497 width=191) (actual time=352280.523..366661.354 rows=25025000 loops=1)
                       ->  Sort  (cost=19412979.17..19475597.92 rows=25047497 width=170) (actual time=352280.375..357177.956 rows=25025000 loops=1)
                             Sort Key: raw_steps_3.trial_id, raw_steps_3.end_time DESC
                             Sort Method: external merge  Disk: 4520936kB
                             ->  Hash Join  (cost=710.53..9913995.84 rows=25047497 width=170) (actual time=282656.849..294993.415 rows=25025000 loops=1)
                                   Hash Cond: (raw_steps_3.trial_id = trials_3.id)
                                   ->  Seq Scan on raw_steps raw_steps_3  (cost=0.00..9256757.16 rows=250000016 width=170) (actual time=0.094..269736.997 rows=250000000 loops=1)
                                         Filter: (NOT archived)
                                   ->  Hash  (cost=698.01..698.01 rows=1002 width=4) (actual time=30.110..30.111 rows=1002 loops=1)
                                         Buckets: 1024  Batches: 1  Memory Usage: 44kB
                                         ->  Seq Scan on trials trials_3  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.140..29.960 rows=1002 loops=1)
                                               Filter: (summary_metrics_timestamp IS NULL)
                                               Rows Removed by Filter: 8999
           ->  Function Scan on jsonb_each unpacked  (cost=0.01..1.00 rows=100 width=64) (actual time=0.031..0.031 rows=5 loops=1001)
   CTE training_combined_latest_agg
     ->  Merge Full Join  (cost=3271771.90..3522545.90 rows=12523700 width=100) (actual time=2347807.991..2347813.685 rows=5005 loops=1)
           Merge Cond: ((tma.trial_id = lt.trial_id) AND (tma.name = lt.name))
           ->  Sort  (cost=3857.54..3957.54 rows=40000 width=68) (actual time=1979329.294..1979330.089 rows=5005 loops=1)
                 Sort Key: tma.trial_id, tma.name
                 Sort Method: quicksort  Memory: 584kB
                 ->  CTE Scan on training_trial_metric_aggs tma  (cost=0.00..800.00 rows=40000 width=68) (actual time=1979318.937..1979324.809 rows=5005 loops=1)
           ->  Materialize  (cost=3267914.36..3330532.86 rows=12523700 width=68) (actual time=368478.461..368479.438 rows=5005 loops=1)
                 ->  Sort  (cost=3267914.36..3299223.61 rows=12523700 width=68) (actual time=368478.448..368478.807 rows=5005 loops=1)
                       Sort Key: lt.trial_id, lt.name
                       Sort Method: quicksort  Memory: 584kB
                       ->  CTE Scan on latest_training lt  (cost=0.00..250474.00 rows=12523700 width=68) (actual time=352280.821..368474.772 rows=5005 loops=1)
   CTE training_trial_metrics_final
     ->  HashAggregate  (cost=375711.00..375713.00 rows=200 width=36) (actual time=2347870.713..2347871.450 rows=1001 loops=1)
           Group Key: training_combined_latest_agg.trial_id
           ->  CTE Scan on training_combined_latest_agg  (cost=0.00..250474.00 rows=12523700 width=100) (actual time=2347808.016..2347816.381 rows=5005 loops=1)
   CTE validation_trial_metrics
     ->  GroupAggregate  (cost=7555611511387.69..7633642323047.95 rows=40000 width=68) (actual time=9359.126..9363.250 rows=5005 loops=1)
           Group Key: typed_1.trial_id, typed_1.name
           ->  Sort  (cost=7555611511387.69..7571217673619.74 rows=6242464892821 width=76) (actual time=9359.119..9360.194 rows=5005 loops=1)
                 Sort Key: typed_1.trial_id, typed_1.name
                 Sort Method: quicksort  Memory: 584kB
                 ->  Subquery Scan on typed_1  (cost=3803506692631.65..4287297721825.28 rows=6242464892821 width=76) (actual time=8723.062..9356.700 rows=5005 loops=1)
                       ->  GroupAggregate  (cost=3803506692631.65..4224873072897.07 rows=6242464892821 width=76) (actual time=8723.061..9356.030 rows=5005 loops=1)
                             Group Key: (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text))), (CASE WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text))
)))::text = '"Infinity"'::text) THEN 'number'::text WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_validati
ons.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((ra
w_validations_1.metrics -> 'validation_metrics'::text))))) END), raw_validations.trial_id
                             ->  Sort  (cost=3803506692631.65..3819112854863.70 rows=6242464892821 width=68) (actual time=8722.927..9119.791 rows=1251250 loops=1)
                                   Sort Key: (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text))), (CASE WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::t
ext)))))::text = '"Infinity"'::text) THEN 'number'::text WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_val
idations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_key
s((raw_validations_1.metrics -> 'validation_metrics'::text))))) END), raw_validations.trial_id
                                   Sort Method: external merge  Disk: 31904kB
                                   ->  Nested Loop  (cost=5046164.33..684548752558.24 rows=6242464892821 width=68) (actual time=1325.319..7009.356 rows=1251250 loops=1)
                                         Join Filter: (CASE WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"Infinity"'::text) THEN 'number'::text WHEN ((((raw_v
alidations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((
raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text))))) END IS 
NOT NULL)
                                         ->  Nested Loop  (cost=0.43..42448.40 rows=250476 width=170) (actual time=3.087..133.964 rows=250250 loops=1)
                                               ->  Seq Scan on trials trials_4  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.023..3.763 rows=1002 loops=1)
                                                     Filter: (summary_metrics_timestamp IS NULL)
                                                     Rows Removed by Filter: 8999
                                               ->  Index Scan using ix_validations_trial_id on raw_validations  (cost=0.43..39.05 rows=262 width=170) (actual time=0.011..0.097 rows=250 loops=1002)
                                                     Index Cond: (trial_id = trials_4.id)
                                                     Filter: (NOT archived)
                                         ->  Materialize  (cost=5046163.90..5718339.90 rows=25047600 width=32) (actual time=0.005..0.007 rows=5 loops=250250)
                                               ->  Unique  (cost=5046163.90..5171401.90 rows=25047600 width=32) (actual time=1322.187..1588.828 rows=5 loops=1)
                                                     ->  Sort  (cost=5046163.90..5108782.90 rows=25047600 width=32) (actual time=1322.184..1488.577 rows=1251250 loops=1)
                                                           Sort Key: (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))
                                                           Sort Method: external merge  Disk: 14728kB
                                                           ->  ProjectSet  (cost=0.43..170191.16 rows=25047600 width=32) (actual time=0.149..783.893 rows=1251250 loops=1)
                                                                 ->  Nested Loop  (cost=0.43..42448.40 rows=250476 width=166) (actual time=0.105..463.345 rows=250250 loops=1)
                                                                       ->  Seq Scan on trials trials_5  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.011..3.855 rows=1002 loops=1)
                                                                             Filter: (summary_metrics_timestamp IS NULL)
                                                                             Rows Removed by Filter: 8999
                                                                       ->  Index Scan using ix_validations_trial_id on raw_validations raw_validations_1  (cost=0.43..39.05 rows=262 width=170) (actual time=0.035..0.429 rows=250 loops=1002)
                                                                             Index Cond: (trial_id = trials_5.id)
                                                                             Filter: (NOT archived)
   CTE validation_numeric_trial_metrics
     ->  CTE Scan on validation_trial_metrics  (cost=0.00..800.00 rows=200 width=36) (actual time=9359.130..9365.224 rows=5005 loops=1)
           Filter: (nonumbers IS NULL)
   CTE validation_trial_metric_aggs
     ->  HashAggregate  (cost=13066.75..13466.75 rows=40000 width=68) (actual time=12337.528..12339.521 rows=5005 loops=1)
           Group Key: ntm_1.name, ntm_1.trial_id, (count(1)), (sum((((raw_validations_2.metrics -> 'validation_metrics'::text) ->> ntm_1.name))::double precision)), (min((((raw_validations_2.metrics -> 'validation_metrics'::text) ->> ntm_1.name))::double precision)), (m
ax((((raw_validations_2.metrics -> 'validation_metrics'::text) ->> ntm_1.name))::double precision))
           ->  Append  (cost=11264.75..12466.75 rows=40000 width=68) (actual time=12332.656..12335.233 rows=5005 loops=1)
                 ->  HashAggregate  (cost=11264.75..11266.75 rows=200 width=68) (actual time=12332.655..12334.335 rows=5005 loops=1)
                       Group Key: ntm_1.name, ntm_1.trial_id
                       ->  Nested Loop  (cost=0.43..8915.07 rows=52215 width=202) (actual time=9359.176..10233.923 rows=1251250 loops=1)
                             ->  CTE Scan on validation_numeric_trial_metrics ntm_1  (cost=0.00..4.00 rows=200 width=36) (actual time=9359.132..9366.810 rows=5005 loops=1)
                             ->  Index Scan using ix_validations_trial_id on raw_validations raw_validations_2  (cost=0.43..41.95 rows=261 width=170) (actual time=0.006..0.144 rows=250 loops=5005)
                                   Index Cond: (trial_id = ntm_1.trial_id)
                                   Filter: ((NOT archived) AND (((metrics -> 'validation_metrics'::text) -> ntm_1.name) IS NOT NULL))
                 ->  CTE Scan on validation_trial_metrics validation_trial_metrics_1  (cost=0.00..800.00 rows=39800 width=68) (actual time=0.527..0.527 rows=0 loops=1)
                       Filter: (nonumbers IS NOT NULL)
                       Rows Removed by Filter: 5005
   CTE latest_validation
     ->  Nested Loop  (cost=87168.98..97813.45 rows=125200 width=68) (actual time=476.201..650.369 rows=5005 loops=1)
           ->  Subquery Scan on s_1  (cost=87168.98..95309.45 rows=1252 width=170) (actual time=476.160..645.191 rows=1001 loops=1)
                 Filter: (s_1.rank = 1)
                 Rows Removed by Filter: 249249
                 ->  WindowAgg  (cost=87168.98..92178.50 rows=250476 width=199) (actual time=476.157..626.806 rows=250250 loops=1)
                       ->  Sort  (cost=87168.98..87795.17 rows=250476 width=178) (actual time=476.131..530.063 rows=250250 loops=1)
                             Sort Key: raw_validations_3.trial_id, raw_validations_3.end_time DESC
                             Sort Method: external merge  Disk: 47176kB
                             ->  Nested Loop  (cost=0.43..42448.40 rows=250476 width=178) (actual time=0.056..117.858 rows=250250 loops=1)
                                   ->  Seq Scan on trials trials_6  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.028..3.139 rows=1002 loops=1)
                                         Filter: (summary_metrics_timestamp IS NULL)
                                         Rows Removed by Filter: 8999
                                   ->  Index Scan using ix_validations_trial_id on raw_validations raw_validations_3  (cost=0.43..39.05 rows=262 width=178) (actual time=0.006..0.083 rows=250 loops=1002)
                                         Index Cond: (trial_id = trials_6.id)
                                         Filter: (NOT archived)
           ->  Function Scan on jsonb_each unpacked_1  (cost=0.01..1.00 rows=100 width=64) (actual time=0.004..0.004 rows=5 loops=1001)
   CTE validation_combined_latest_agg
     ->  Merge Full Join  (cost=22100.15..24904.15 rows=125200 width=100) (actual time=12997.620..13003.246 rows=5005 loops=1)
           Merge Cond: ((tma_1.trial_id = lt_1.trial_id) AND (tma_1.name = lt_1.name))
           ->  Sort  (cost=3857.54..3957.54 rows=40000 width=68) (actual time=12344.086..12344.907 rows=5005 loops=1)
                 Sort Key: tma_1.trial_id, tma_1.name
                 Sort Method: quicksort  Memory: 584kB
                 ->  CTE Scan on validation_trial_metric_aggs tma_1  (cost=0.00..800.00 rows=40000 width=68) (actual time=12337.531..12341.449 rows=5005 loops=1)
           ->  Materialize  (cost=18242.61..18868.61 rows=125200 width=68) (actual time=653.524..654.524 rows=5005 loops=1)
                 ->  Sort  (cost=18242.61..18555.61 rows=125200 width=68) (actual time=653.521..653.882 rows=5005 loops=1)
                       Sort Key: lt_1.trial_id, lt_1.name
                       Sort Method: quicksort  Memory: 584kB
                       ->  CTE Scan on latest_validation lt_1  (cost=0.00..2504.00 rows=125200 width=68) (actual time=476.203..651.711 rows=5005 loops=1)
   CTE validation_trial_metrics_final
     ->  HashAggregate  (cost=3756.00..3758.00 rows=200 width=36) (actual time=13059.239..13059.819 rows=1001 loops=1)
           Group Key: validation_combined_latest_agg.trial_id
           ->  CTE Scan on validation_combined_latest_agg  (cost=0.00..2504.00 rows=125200 width=100) (actual time=12997.621..13005.151 rows=5005 loops=1)
   CTE validation_training_combined_json
     ->  Hash Full Join  (cost=6.50..19.50 rows=200 width=36) (actual time=2360933.376..2360953.042 rows=1001 loops=1)
           Hash Cond: (ttm.trial_id = vtm.trial_id)
           ->  CTE Scan on training_trial_metrics_final ttm  (cost=0.00..4.00 rows=200 width=36) (actual time=2347870.805..2347873.120 rows=1001 loops=1)
           ->  Hash  (cost=4.00..4.00 rows=200 width=36) (actual time=13062.278..13062.278 rows=1001 loops=1)
                 Buckets: 1024  Batches: 1  Memory Usage: 763kB
                 ->  CTE Scan on validation_trial_metrics_final vtm  (cost=0.00..4.00 rows=200 width=36) (actual time=13059.241..13060.237 rows=1001 loops=1)
   ->  Hash Join  (cost=823.02..827.55 rows=200 width=306) (actual time=2360942.414..2360965.699 rows=1001 loops=1)
         Hash Cond: (vtcj.trial_id = trials.id)
         ->  CTE Scan on validation_training_combined_json vtcj  (cost=0.00..4.00 rows=200 width=96) (actual time=2360933.490..2360955.605 rows=1001 loops=1)
         ->  Hash  (cost=698.01..698.01 rows=10001 width=214) (actual time=8.781..8.782 rows=10001 loops=1)
               Buckets: 16384  Batches: 1  Memory Usage: 2519kB
               ->  Seq Scan on trials  (cost=0.00..698.01 rows=10001 width=214) (actual time=0.056..4.834 rows=10001 loops=1)
 Planning time: 17.958 ms
 Execution time: 2362683.854 ms
(190 rows)

nrajanee · 2023-04-06T15:39:56Z

master/internal/db/postgres_checkpoints_intg_test.go

+			},
+		},
+		TotalBatches: 1,
+		EndTime:      time.Now().AddDate(0, 0, -1),


question: is it past because of the end time here?

Yeah the query only looks at end time to know if we need to recalculate a trial's summary metrics.

nrajanee · 2023-04-06T15:50:44Z

master/static/migrations/20230405164440_add-summary-metrics.tx.up.sql

+
+ALTER TABLE trials
+    ADD COLUMN IF NOT EXISTS summary_metrics jsonb NOT NULL DEFAULT '{}',
+    ADD COLUMN IF NOT EXISTS summary_metrics_timestamp timestamptz DEFAULT NULL;


question: do we need DEFAULT NULL here? I think the default will be null anyway?

Yeah we don't need it, removed it.

nrajanee · 2023-04-06T15:59:07Z

master/static/migrations/20230405164440_add-summary-metrics.tx.up.sql

+    ADD COLUMN IF NOT EXISTS summary_metrics_timestamp timestamptz DEFAULT NULL;
+
+-- Invalidate summary_metrics_timestamp for trials that have a metric added since.
+WITH max_training as (


suggestion: this sql file is a little difficult to read because it's so long. it might be good to have a couple of comments in the PR or in the file about what's going on.

Added some high level comments of what each CTE does.

nrajanee · 2023-04-06T16:02:16Z

master/internal/db/postgres_checkpoints_intg_test.go

+			m[strconv.Itoa(j)] = rand.Float64() //nolint: gosec
+		}
+
+		metrics = append(metrics, step{


question: why aren't you using AddTrainingMetrics and AddValidationMetrics here? if the goal here is to generate metrics for a trial can we make use of the functions from here: #6222?

It is a lot quicker to generate with data for the test with this. I was considering deleting this just needed it for a test and wanted to make sure the test database generation was reasonable.

Its quicker since it uses batch inserts, skips batch metrics, skips rbac checks related queries, and also skips archiving checks.

NicholasBlaskey · 2023-04-07T14:33:54Z

master/static/migrations/20230405164440_add-summary-metrics.tx.up.sql

+        WHEN (metrics->'avg_metrics'->name)::text = '"Infinity"'::text THEN 'number'
+        WHEN (metrics->'avg_metrics'->name)::text = '"-Infinity"'::text THEN 'number'
+        WHEN (metrics->'avg_metrics'->name)::text = '"NaN"'::text THEN 'number'
+        ELSE jsonb_typeof(metrics->'avg_metrics'->name)


came up in testing top trials by metric

old code would sometimes let people get away with report the string "1.0" and we would cast it to the float 1.0 in some queries.

should this migration and ingestion code treat "1.0" as 1.0?

no, I don't think this is a kind of behavior we should support. that doesn't seem to me like something a user would rely upon.

nrajanee

lgtm

NicholasBlaskey · 2023-04-12T20:40:39Z

@ioga should be ready for review now

ioga · 2023-04-14T19:02:53Z

master/internal/db/postgres_trial.go

+				"count": 1,
+			}
+		} else {
+			// Check if the metric had a non numeric value in the past.


q: when would this happen?

ioga · 2023-04-14T19:04:22Z

master/internal/db/postgres_trial.go

+				math.Max(summaryMetric["max"].(float64), metricValue))
+			summaryMetric["sum"] = replaceSpecialFloatsWithString(
+				summaryMetric["sum"].(float64) + metricValue)
+			// Go parsing odditity treats JSON whole numbers as floats.


it's javascript fault: there're no integer values, only floating point numbers

ioga

great work.

NicholasBlaskey · 2023-04-20T16:30:29Z

@ioga added types

ioga

lgtm

cla-bot bot added the cla-signed label Apr 5, 2023

NicholasBlaskey force-pushed the unified_metrics_migration branch 2 times, most recently from e228d37 to 6c18e7f Compare April 5, 2023 21:01

NicholasBlaskey requested review from ioga, hamidzr and nrajanee April 5, 2023 21:01

NicholasBlaskey changed the title ~~Unified metrics migration~~ summary metrics migration Apr 6, 2023

nrajanee reviewed Apr 6, 2023

View reviewed changes

NicholasBlaskey commented Apr 7, 2023

View reviewed changes

nrajanee reviewed Apr 7, 2023

View reviewed changes

NicholasBlaskey force-pushed the unified_metrics_migration branch from f5c323e to e23f9c4 Compare April 12, 2023 20:36

NicholasBlaskey changed the title ~~summary metrics migration~~ feat: summary metrics Apr 12, 2023

NicholasBlaskey changed the title ~~feat: summary metrics~~ feat: summary metrics Apr 12, 2023

NicholasBlaskey marked this pull request as ready for review April 12, 2023 20:40

NicholasBlaskey requested a review from a team as a code owner April 12, 2023 20:40

ioga approved these changes Apr 14, 2023

View reviewed changes

gt2345 mentioned this pull request Apr 15, 2023

feat: Support order for experiment search API #6479

Merged

4 tasks

NicholasBlaskey requested a review from ioga April 20, 2023 16:30

ioga approved these changes Apr 24, 2023

View reviewed changes

NicholasBlaskey added 3 commits April 25, 2023 09:59

feat: summary metrics

591407b

bumped to top

022ef90

adds types

7829e2c

NicholasBlaskey force-pushed the unified_metrics_migration branch from 084de9d to 7829e2c Compare April 25, 2023 13:59

NicholasBlaskey and others added 2 commits April 25, 2023 10:02

bump to top

83237d5

Merge branch 'main' into unified_metrics_migration

5f7f852

NicholasBlaskey merged commit 61a804a into determined-ai:main Apr 25, 2023

tayritenour pushed a commit to tayritenour/determined that referenced this pull request Apr 25, 2023

feat: summary metrics (determined-ai#6477)

43d62ca

dannysauer added this to the 0.22.0 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: summary metrics #6477

feat: summary metrics #6477

NicholasBlaskey commented Apr 5, 2023 •

edited

Loading

NicholasBlaskey commented Apr 5, 2023

nrajanee Apr 6, 2023

NicholasBlaskey Apr 6, 2023

nrajanee Apr 6, 2023

NicholasBlaskey Apr 6, 2023

nrajanee Apr 6, 2023

NicholasBlaskey Apr 6, 2023

nrajanee Apr 6, 2023

NicholasBlaskey Apr 6, 2023

NicholasBlaskey Apr 7, 2023

ioga Apr 7, 2023

nrajanee left a comment

NicholasBlaskey commented Apr 12, 2023

ioga Apr 14, 2023

ioga Apr 14, 2023

ioga left a comment

NicholasBlaskey commented Apr 20, 2023

ioga left a comment

feat: summary metrics #6477

feat: summary metrics #6477

Conversation

NicholasBlaskey commented Apr 5, 2023 • edited Loading

Description

Test Plan

Commentary (optional)

Checklist

Ticket

NicholasBlaskey commented Apr 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nrajanee left a comment

Choose a reason for hiding this comment

NicholasBlaskey commented Apr 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ioga left a comment

Choose a reason for hiding this comment

NicholasBlaskey commented Apr 20, 2023

ioga left a comment

Choose a reason for hiding this comment

NicholasBlaskey commented Apr 5, 2023 •

edited

Loading