Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: summary metrics #6477

Merged

Conversation

NicholasBlaskey
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey commented Apr 5, 2023

Description

Adds summary metrics migration and ingestion code.

Test Plan

Integration test plus test database generated for benchmarking

Spot test run an experiment and look in the database for summary_metrics

SELECT summary_metrics FROM trials;

Commentary (optional)

Test cluster generated with 10k trials each with 25k steps and 2.5k validations reporting 5 different metrics.

This should be a decent overestimate on what we expect.

  • 156 seconds to compute which metrics will need to be incrementally updated (will run every migration)
  • 1.4 seconds to update 1 single trial
  • 13.8 seconds to update 10 trials
  • 3.0 minutes to update 100 trials
  • 39.4 minutes to update 1000 trials
  • 4.2 hours when the full database updated

Query plan shown below

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@cla-bot cla-bot bot added the cla-signed label Apr 5, 2023
@NicholasBlaskey
Copy link
Contributor Author

EXPLAIN ANALYZE output of 1000 trials

Update on trials  (cost=105974293190770960.00..105974293190770976.00 rows=200 width=306) (actual time=2361002.548..2361003.329 rows=0 loops=1)
   CTE training_trial_metrics
     ->  GroupAggregate  (cost=105186359243412064.00..105966659514393120.00 rows=40000 width=68) (actual time=1642094.969..1642134.016 rows=5005 loops=1)
           Group Key: typed.trial_id, typed.name
           ->  Sort  (cost=105186359243412064.00..105342419297608176.00 rows=62424021678443808 width=76) (actual time=1642094.867..1642106.566 rows=5005 loops=1)
                 Sort Key: typed.trial_id, typed.name
                 Sort Method: quicksort  Memory: 584kB
                 ->  Subquery Scan on typed  (cost=52423435890008880.00..57261297570088280.00 rows=62424021678443808 width=76) (actual time=1576286.007..1642077.312 rows=5005 loops=1)
                       ->  GroupAggregate  (cost=52423435890008880.00..56637057353303840.00 rows=62424021678443808 width=76) (actual time=1576286.003..1642068.199 rows=5005 loops=1)
                             Group Key: (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))), (CASE WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"Infinity"'::text) THEN 'n
umber'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.
metrics -> 'avg_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))))) END), raw_steps.trial_id
                             ->  Sort  (cost=52423435890008880.00..52579495944204992.00 rows=62424021678443808 width=68) (actual time=1576273.313..1618415.467 rows=125125000 loops=1)
                                   Sort Key: (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))), (CASE WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"Infinity"'::text) TH
EN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_ste
ps_1.metrics -> 'avg_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))))) END), raw_steps.trial_id
                                   Sort Method: external merge  Disk: 3183280kB
                                   ->  Nested Loop  (cost=653418029.17..6845371125493768.00 rows=62424021678443808 width=68) (actual time=743501.088..1345399.227 rows=125125000 loops=1)
                                         Join Filter: (CASE WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"Infinity"'::text) THEN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metri
cs'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))))::text = '"NaN"'::
text) THEN 'number'::text ELSE jsonb_typeof(((raw_steps.metrics -> 'avg_metrics'::text) -> (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text))))) END IS NOT NULL)
                                         ->  Hash Join  (cost=710.53..9913995.84 rows=25047497 width=162) (actual time=304491.255..328435.286 rows=25025000 loops=1)
                                               Hash Cond: (raw_steps.trial_id = trials_1.id)
                                               ->  Seq Scan on raw_steps  (cost=0.00..9256757.16 rows=250000016 width=162) (actual time=0.026..299802.735 rows=250000000 loops=1)
                                                     Filter: (NOT archived)
                                               ->  Hash  (cost=698.01..698.01 rows=1002 width=4) (actual time=3.009..3.010 rows=1002 loops=1)
                                                     Buckets: 1024  Batches: 1  Memory Usage: 44kB
                                                     ->  Seq Scan on trials trials_1  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.010..2.762 rows=1002 loops=1)
                                                           Filter: (summary_metrics_timestamp IS NULL)
                                                           Rows Removed by Filter: 8999
                                         ->  Materialize  (cost=653417318.63..720634625.63 rows=2504749700 width=32) (actual time=0.018..0.019 rows=5 loops=25025000)
                                               ->  Unique  (cost=653417318.63..665941067.13 rows=2504749700 width=32) (actual time=439009.729..466547.242 rows=5 loops=1)
                                                     ->  Sort  (cost=653417318.63..659679192.88 rows=2504749700 width=32) (actual time=439009.723..456243.823 rows=125125000 loops=1)
                                                           Sort Key: (jsonb_object_keys((raw_steps_1.metrics -> 'avg_metrics'::text)))
                                                           Sort Method: external merge  Disk: 1469224kB
                                                           ->  ProjectSet  (cost=710.53..22688219.31 rows=2504749700 width=32) (actual time=4.170..332663.530 rows=125125000 loops=1)
                                                                 ->  Hash Join  (cost=710.53..9913995.84 rows=25047497 width=158) (actual time=4.133..300994.375 rows=25025000 loops=1)
                                                                       Hash Cond: (raw_steps_1.trial_id = trials_2.id)
                                                                       ->  Seq Scan on raw_steps raw_steps_1  (cost=0.00..9256757.16 rows=250000016 width=162) (actual time=0.009..276630.486 rows=250000000 loops=1)
                                                                             Filter: (NOT archived)
                                                                       ->  Hash  (cost=698.01..698.01 rows=1002 width=4) (actual time=3.947..3.948 rows=1002 loops=1)
                                                                             Buckets: 1024  Batches: 1  Memory Usage: 44kB
                                                                             ->  Seq Scan on trials trials_2  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.012..3.771 rows=1002 loops=1)
                                                                                   Filter: (summary_metrics_timestamp IS NULL)
                                                                                   Rows Removed by Filter: 8999
   CTE training_numeric_trial_metrics
     ->  CTE Scan on training_trial_metrics  (cost=0.00..800.00 rows=200 width=36) (actual time=1642095.001..1642153.440 rows=5005 loops=1)
           Filter: (nonumbers IS NULL)
   CTE training_trial_metric_aggs
     ->  HashAggregate  (cost=9536260.43..9536660.43 rows=40000 width=68) (actual time=1979318.811..1979321.420 rows=5005 loops=1)
           Group Key: ntm.name, ntm.trial_id, (count(1)), (sum((((raw_steps_2.metrics -> 'avg_metrics'::text) ->> ntm.name))::double precision)), (min((((raw_steps_2.metrics -> 'avg_metrics'::text) ->> ntm.name))::double precision)), (max((((raw_steps_2.metrics -> 'avg_
metrics'::text) ->> ntm.name))::double precision))
           ->  Append  (cost=9534458.43..9535660.43 rows=40000 width=68) (actual time=1979302.041..1979307.106 rows=5005 loops=1)
                 ->  HashAggregate  (cost=9534458.43..9534460.43 rows=200 width=68) (actual time=1979301.994..1979305.185 rows=5005 loops=1)
                       Group Key: ntm.name, ntm.trial_id
                       ->  Nested Loop  (cost=0.57..9295861.99 rows=5302143 width=194) (actual time=1642098.707..1754744.385 rows=125125000 loops=1)
                             ->  CTE Scan on training_numeric_trial_metrics ntm  (cost=0.00..4.00 rows=200 width=36) (actual time=1642095.022..1642168.127 rows=5005 loops=1)
                             ->  Index Scan using steps_trial_id_total_batches_run_id_unique on raw_steps raw_steps_2  (cost=0.57..46214.18 rows=26511 width=162) (actual time=0.026..19.431 rows=25000 loops=5005)
                                   Index Cond: (trial_id = ntm.trial_id)
                                   Filter: ((NOT archived) AND (((metrics -> 'avg_metrics'::text) -> ntm.name) IS NOT NULL))
                 ->  CTE Scan on training_trial_metrics training_trial_metrics_1  (cost=0.00..800.00 rows=39800 width=68) (actual time=1.264..1.265 rows=0 loops=1)
                       Filter: (nonumbers IS NOT NULL)
                       Rows Removed by Filter: 5005
   CTE latest_training
     ->  Nested Loop  (cost=19412979.18..20477496.83 rows=12523700 width=68) (actual time=352280.817..368470.283 rows=5005 loops=1)
           ->  Subquery Scan on s  (cost=19412979.17..20227022.83 rows=125237 width=162) (actual time=352280.572..368432.594 rows=1001 loops=1)
                 Filter: (s.rank = 1)
                 Rows Removed by Filter: 25023999
                 ->  WindowAgg  (cost=19412979.17..19913929.11 rows=25047497 width=191) (actual time=352280.523..366661.354 rows=25025000 loops=1)
                       ->  Sort  (cost=19412979.17..19475597.92 rows=25047497 width=170) (actual time=352280.375..357177.956 rows=25025000 loops=1)
                             Sort Key: raw_steps_3.trial_id, raw_steps_3.end_time DESC
                             Sort Method: external merge  Disk: 4520936kB
                             ->  Hash Join  (cost=710.53..9913995.84 rows=25047497 width=170) (actual time=282656.849..294993.415 rows=25025000 loops=1)
                                   Hash Cond: (raw_steps_3.trial_id = trials_3.id)
                                   ->  Seq Scan on raw_steps raw_steps_3  (cost=0.00..9256757.16 rows=250000016 width=170) (actual time=0.094..269736.997 rows=250000000 loops=1)
                                         Filter: (NOT archived)
                                   ->  Hash  (cost=698.01..698.01 rows=1002 width=4) (actual time=30.110..30.111 rows=1002 loops=1)
                                         Buckets: 1024  Batches: 1  Memory Usage: 44kB
                                         ->  Seq Scan on trials trials_3  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.140..29.960 rows=1002 loops=1)
                                               Filter: (summary_metrics_timestamp IS NULL)
                                               Rows Removed by Filter: 8999
           ->  Function Scan on jsonb_each unpacked  (cost=0.01..1.00 rows=100 width=64) (actual time=0.031..0.031 rows=5 loops=1001)
   CTE training_combined_latest_agg
     ->  Merge Full Join  (cost=3271771.90..3522545.90 rows=12523700 width=100) (actual time=2347807.991..2347813.685 rows=5005 loops=1)
           Merge Cond: ((tma.trial_id = lt.trial_id) AND (tma.name = lt.name))
           ->  Sort  (cost=3857.54..3957.54 rows=40000 width=68) (actual time=1979329.294..1979330.089 rows=5005 loops=1)
                 Sort Key: tma.trial_id, tma.name
                 Sort Method: quicksort  Memory: 584kB
                 ->  CTE Scan on training_trial_metric_aggs tma  (cost=0.00..800.00 rows=40000 width=68) (actual time=1979318.937..1979324.809 rows=5005 loops=1)
           ->  Materialize  (cost=3267914.36..3330532.86 rows=12523700 width=68) (actual time=368478.461..368479.438 rows=5005 loops=1)
                 ->  Sort  (cost=3267914.36..3299223.61 rows=12523700 width=68) (actual time=368478.448..368478.807 rows=5005 loops=1)
                       Sort Key: lt.trial_id, lt.name
                       Sort Method: quicksort  Memory: 584kB
                       ->  CTE Scan on latest_training lt  (cost=0.00..250474.00 rows=12523700 width=68) (actual time=352280.821..368474.772 rows=5005 loops=1)
   CTE training_trial_metrics_final
     ->  HashAggregate  (cost=375711.00..375713.00 rows=200 width=36) (actual time=2347870.713..2347871.450 rows=1001 loops=1)
           Group Key: training_combined_latest_agg.trial_id
           ->  CTE Scan on training_combined_latest_agg  (cost=0.00..250474.00 rows=12523700 width=100) (actual time=2347808.016..2347816.381 rows=5005 loops=1)
   CTE validation_trial_metrics
     ->  GroupAggregate  (cost=7555611511387.69..7633642323047.95 rows=40000 width=68) (actual time=9359.126..9363.250 rows=5005 loops=1)
           Group Key: typed_1.trial_id, typed_1.name
           ->  Sort  (cost=7555611511387.69..7571217673619.74 rows=6242464892821 width=76) (actual time=9359.119..9360.194 rows=5005 loops=1)
                 Sort Key: typed_1.trial_id, typed_1.name
                 Sort Method: quicksort  Memory: 584kB
                 ->  Subquery Scan on typed_1  (cost=3803506692631.65..4287297721825.28 rows=6242464892821 width=76) (actual time=8723.062..9356.700 rows=5005 loops=1)
                       ->  GroupAggregate  (cost=3803506692631.65..4224873072897.07 rows=6242464892821 width=76) (actual time=8723.061..9356.030 rows=5005 loops=1)
                             Group Key: (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text))), (CASE WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text))
)))::text = '"Infinity"'::text) THEN 'number'::text WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_validati
ons.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((ra
w_validations_1.metrics -> 'validation_metrics'::text))))) END), raw_validations.trial_id
                             ->  Sort  (cost=3803506692631.65..3819112854863.70 rows=6242464892821 width=68) (actual time=8722.927..9119.791 rows=1251250 loops=1)
                                   Sort Key: (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text))), (CASE WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::t
ext)))))::text = '"Infinity"'::text) THEN 'number'::text WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_val
idations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_key
s((raw_validations_1.metrics -> 'validation_metrics'::text))))) END), raw_validations.trial_id
                                   Sort Method: external merge  Disk: 31904kB
                                   ->  Nested Loop  (cost=5046164.33..684548752558.24 rows=6242464892821 width=68) (actual time=1325.319..7009.356 rows=1251250 loops=1)
                                         Join Filter: (CASE WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"Infinity"'::text) THEN 'number'::text WHEN ((((raw_v
alidations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"-Infinity"'::text) THEN 'number'::text WHEN ((((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((
raw_validations_1.metrics -> 'validation_metrics'::text)))))::text = '"NaN"'::text) THEN 'number'::text ELSE jsonb_typeof(((raw_validations.metrics -> 'validation_metrics'::text) -> (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text))))) END IS 
NOT NULL)
                                         ->  Nested Loop  (cost=0.43..42448.40 rows=250476 width=170) (actual time=3.087..133.964 rows=250250 loops=1)
                                               ->  Seq Scan on trials trials_4  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.023..3.763 rows=1002 loops=1)
                                                     Filter: (summary_metrics_timestamp IS NULL)
                                                     Rows Removed by Filter: 8999
                                               ->  Index Scan using ix_validations_trial_id on raw_validations  (cost=0.43..39.05 rows=262 width=170) (actual time=0.011..0.097 rows=250 loops=1002)
                                                     Index Cond: (trial_id = trials_4.id)
                                                     Filter: (NOT archived)
                                         ->  Materialize  (cost=5046163.90..5718339.90 rows=25047600 width=32) (actual time=0.005..0.007 rows=5 loops=250250)
                                               ->  Unique  (cost=5046163.90..5171401.90 rows=25047600 width=32) (actual time=1322.187..1588.828 rows=5 loops=1)
                                                     ->  Sort  (cost=5046163.90..5108782.90 rows=25047600 width=32) (actual time=1322.184..1488.577 rows=1251250 loops=1)
                                                           Sort Key: (jsonb_object_keys((raw_validations_1.metrics -> 'validation_metrics'::text)))
                                                           Sort Method: external merge  Disk: 14728kB
                                                           ->  ProjectSet  (cost=0.43..170191.16 rows=25047600 width=32) (actual time=0.149..783.893 rows=1251250 loops=1)
                                                                 ->  Nested Loop  (cost=0.43..42448.40 rows=250476 width=166) (actual time=0.105..463.345 rows=250250 loops=1)
                                                                       ->  Seq Scan on trials trials_5  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.011..3.855 rows=1002 loops=1)
                                                                             Filter: (summary_metrics_timestamp IS NULL)
                                                                             Rows Removed by Filter: 8999
                                                                       ->  Index Scan using ix_validations_trial_id on raw_validations raw_validations_1  (cost=0.43..39.05 rows=262 width=170) (actual time=0.035..0.429 rows=250 loops=1002)
                                                                             Index Cond: (trial_id = trials_5.id)
                                                                             Filter: (NOT archived)
   CTE validation_numeric_trial_metrics
     ->  CTE Scan on validation_trial_metrics  (cost=0.00..800.00 rows=200 width=36) (actual time=9359.130..9365.224 rows=5005 loops=1)
           Filter: (nonumbers IS NULL)
   CTE validation_trial_metric_aggs
     ->  HashAggregate  (cost=13066.75..13466.75 rows=40000 width=68) (actual time=12337.528..12339.521 rows=5005 loops=1)
           Group Key: ntm_1.name, ntm_1.trial_id, (count(1)), (sum((((raw_validations_2.metrics -> 'validation_metrics'::text) ->> ntm_1.name))::double precision)), (min((((raw_validations_2.metrics -> 'validation_metrics'::text) ->> ntm_1.name))::double precision)), (m
ax((((raw_validations_2.metrics -> 'validation_metrics'::text) ->> ntm_1.name))::double precision))
           ->  Append  (cost=11264.75..12466.75 rows=40000 width=68) (actual time=12332.656..12335.233 rows=5005 loops=1)
                 ->  HashAggregate  (cost=11264.75..11266.75 rows=200 width=68) (actual time=12332.655..12334.335 rows=5005 loops=1)
                       Group Key: ntm_1.name, ntm_1.trial_id
                       ->  Nested Loop  (cost=0.43..8915.07 rows=52215 width=202) (actual time=9359.176..10233.923 rows=1251250 loops=1)
                             ->  CTE Scan on validation_numeric_trial_metrics ntm_1  (cost=0.00..4.00 rows=200 width=36) (actual time=9359.132..9366.810 rows=5005 loops=1)
                             ->  Index Scan using ix_validations_trial_id on raw_validations raw_validations_2  (cost=0.43..41.95 rows=261 width=170) (actual time=0.006..0.144 rows=250 loops=5005)
                                   Index Cond: (trial_id = ntm_1.trial_id)
                                   Filter: ((NOT archived) AND (((metrics -> 'validation_metrics'::text) -> ntm_1.name) IS NOT NULL))
                 ->  CTE Scan on validation_trial_metrics validation_trial_metrics_1  (cost=0.00..800.00 rows=39800 width=68) (actual time=0.527..0.527 rows=0 loops=1)
                       Filter: (nonumbers IS NOT NULL)
                       Rows Removed by Filter: 5005
   CTE latest_validation
     ->  Nested Loop  (cost=87168.98..97813.45 rows=125200 width=68) (actual time=476.201..650.369 rows=5005 loops=1)
           ->  Subquery Scan on s_1  (cost=87168.98..95309.45 rows=1252 width=170) (actual time=476.160..645.191 rows=1001 loops=1)
                 Filter: (s_1.rank = 1)
                 Rows Removed by Filter: 249249
                 ->  WindowAgg  (cost=87168.98..92178.50 rows=250476 width=199) (actual time=476.157..626.806 rows=250250 loops=1)
                       ->  Sort  (cost=87168.98..87795.17 rows=250476 width=178) (actual time=476.131..530.063 rows=250250 loops=1)
                             Sort Key: raw_validations_3.trial_id, raw_validations_3.end_time DESC
                             Sort Method: external merge  Disk: 47176kB
                             ->  Nested Loop  (cost=0.43..42448.40 rows=250476 width=178) (actual time=0.056..117.858 rows=250250 loops=1)
                                   ->  Seq Scan on trials trials_6  (cost=0.00..698.01 rows=1002 width=4) (actual time=0.028..3.139 rows=1002 loops=1)
                                         Filter: (summary_metrics_timestamp IS NULL)
                                         Rows Removed by Filter: 8999
                                   ->  Index Scan using ix_validations_trial_id on raw_validations raw_validations_3  (cost=0.43..39.05 rows=262 width=178) (actual time=0.006..0.083 rows=250 loops=1002)
                                         Index Cond: (trial_id = trials_6.id)
                                         Filter: (NOT archived)
           ->  Function Scan on jsonb_each unpacked_1  (cost=0.01..1.00 rows=100 width=64) (actual time=0.004..0.004 rows=5 loops=1001)
   CTE validation_combined_latest_agg
     ->  Merge Full Join  (cost=22100.15..24904.15 rows=125200 width=100) (actual time=12997.620..13003.246 rows=5005 loops=1)
           Merge Cond: ((tma_1.trial_id = lt_1.trial_id) AND (tma_1.name = lt_1.name))
           ->  Sort  (cost=3857.54..3957.54 rows=40000 width=68) (actual time=12344.086..12344.907 rows=5005 loops=1)
                 Sort Key: tma_1.trial_id, tma_1.name
                 Sort Method: quicksort  Memory: 584kB
                 ->  CTE Scan on validation_trial_metric_aggs tma_1  (cost=0.00..800.00 rows=40000 width=68) (actual time=12337.531..12341.449 rows=5005 loops=1)
           ->  Materialize  (cost=18242.61..18868.61 rows=125200 width=68) (actual time=653.524..654.524 rows=5005 loops=1)
                 ->  Sort  (cost=18242.61..18555.61 rows=125200 width=68) (actual time=653.521..653.882 rows=5005 loops=1)
                       Sort Key: lt_1.trial_id, lt_1.name
                       Sort Method: quicksort  Memory: 584kB
                       ->  CTE Scan on latest_validation lt_1  (cost=0.00..2504.00 rows=125200 width=68) (actual time=476.203..651.711 rows=5005 loops=1)
   CTE validation_trial_metrics_final
     ->  HashAggregate  (cost=3756.00..3758.00 rows=200 width=36) (actual time=13059.239..13059.819 rows=1001 loops=1)
           Group Key: validation_combined_latest_agg.trial_id
           ->  CTE Scan on validation_combined_latest_agg  (cost=0.00..2504.00 rows=125200 width=100) (actual time=12997.621..13005.151 rows=5005 loops=1)
   CTE validation_training_combined_json
     ->  Hash Full Join  (cost=6.50..19.50 rows=200 width=36) (actual time=2360933.376..2360953.042 rows=1001 loops=1)
           Hash Cond: (ttm.trial_id = vtm.trial_id)
           ->  CTE Scan on training_trial_metrics_final ttm  (cost=0.00..4.00 rows=200 width=36) (actual time=2347870.805..2347873.120 rows=1001 loops=1)
           ->  Hash  (cost=4.00..4.00 rows=200 width=36) (actual time=13062.278..13062.278 rows=1001 loops=1)
                 Buckets: 1024  Batches: 1  Memory Usage: 763kB
                 ->  CTE Scan on validation_trial_metrics_final vtm  (cost=0.00..4.00 rows=200 width=36) (actual time=13059.241..13060.237 rows=1001 loops=1)
   ->  Hash Join  (cost=823.02..827.55 rows=200 width=306) (actual time=2360942.414..2360965.699 rows=1001 loops=1)
         Hash Cond: (vtcj.trial_id = trials.id)
         ->  CTE Scan on validation_training_combined_json vtcj  (cost=0.00..4.00 rows=200 width=96) (actual time=2360933.490..2360955.605 rows=1001 loops=1)
         ->  Hash  (cost=698.01..698.01 rows=10001 width=214) (actual time=8.781..8.782 rows=10001 loops=1)
               Buckets: 16384  Batches: 1  Memory Usage: 2519kB
               ->  Seq Scan on trials  (cost=0.00..698.01 rows=10001 width=214) (actual time=0.056..4.834 rows=10001 loops=1)
 Planning time: 17.958 ms
 Execution time: 2362683.854 ms
(190 rows)

@NicholasBlaskey NicholasBlaskey force-pushed the unified_metrics_migration branch 2 times, most recently from e228d37 to 6c18e7f Compare April 5, 2023 21:01
@NicholasBlaskey NicholasBlaskey changed the title Unified metrics migration summary metrics migration Apr 6, 2023
},
},
TotalBatches: 1,
EndTime: time.Now().AddDate(0, 0, -1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is it past because of the end time here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the query only looks at end time to know if we need to recalculate a trial's summary metrics.


ALTER TABLE trials
ADD COLUMN IF NOT EXISTS summary_metrics jsonb NOT NULL DEFAULT '{}',
ADD COLUMN IF NOT EXISTS summary_metrics_timestamp timestamptz DEFAULT NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do we need DEFAULT NULL here? I think the default will be null anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we don't need it, removed it.

ADD COLUMN IF NOT EXISTS summary_metrics_timestamp timestamptz DEFAULT NULL;

-- Invalidate summary_metrics_timestamp for trials that have a metric added since.
WITH max_training as (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: this sql file is a little difficult to read because it's so long. it might be good to have a couple of comments in the PR or in the file about what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some high level comments of what each CTE does.

m[strconv.Itoa(j)] = rand.Float64() //nolint: gosec
}

metrics = append(metrics, step{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why aren't you using AddTrainingMetrics and AddValidationMetrics here? if the goal here is to generate metrics for a trial can we make use of the functions from here: #6222?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a lot quicker to generate with data for the test with this. I was considering deleting this just needed it for a test and wanted to make sure the test database generation was reasonable.

Its quicker since it uses batch inserts, skips batch metrics, skips rbac checks related queries, and also skips archiving checks.

WHEN (metrics->'avg_metrics'->name)::text = '"Infinity"'::text THEN 'number'
WHEN (metrics->'avg_metrics'->name)::text = '"-Infinity"'::text THEN 'number'
WHEN (metrics->'avg_metrics'->name)::text = '"NaN"'::text THEN 'number'
ELSE jsonb_typeof(metrics->'avg_metrics'->name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

came up in testing top trials by metric

old code would sometimes let people get away with report the string "1.0" and we would cast it to the float 1.0 in some queries.

should this migration and ingestion code treat "1.0" as 1.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I don't think this is a kind of behavior we should support. that doesn't seem to me like something a user would rely upon.

Copy link
Contributor

@nrajanee nrajanee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@NicholasBlaskey NicholasBlaskey changed the title summary metrics migration feat: summary metrics Apr 12, 2023
@NicholasBlaskey NicholasBlaskey changed the title feat: summary metrics feat: summary metrics Apr 12, 2023
@NicholasBlaskey NicholasBlaskey marked this pull request as ready for review April 12, 2023 20:40
@NicholasBlaskey NicholasBlaskey requested a review from a team as a code owner April 12, 2023 20:40
@NicholasBlaskey
Copy link
Contributor Author

@ioga should be ready for review now

"count": 1,
}
} else {
// Check if the metric had a non numeric value in the past.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: when would this happen?

math.Max(summaryMetric["max"].(float64), metricValue))
summaryMetric["sum"] = replaceSpecialFloatsWithString(
summaryMetric["sum"].(float64) + metricValue)
// Go parsing odditity treats JSON whole numbers as floats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's javascript fault: there're no integer values, only floating point numbers

Copy link
Contributor

@ioga ioga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work.

@NicholasBlaskey
Copy link
Contributor Author

@ioga added types

Copy link
Contributor

@ioga ioga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@NicholasBlaskey NicholasBlaskey merged commit 61a804a into determined-ai:main Apr 25, 2023
tayritenour pushed a commit to tayritenour/determined that referenced this pull request Apr 25, 2023
@dannysauer dannysauer added this to the 0.22.0 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants