feat: non-blocking metrics reports [MD-144] #9107

azhou-determined · 2024-04-04T16:15:25Z

Description

Report metrics in a background thread to not block training.

Introduce core.MetricsContext as a centralized place for all metric reporting.

Test Plan

This PR makes all metrics reporting happen in the background and affects many parts of the python code. A few areas that would be helpful to test:

Core API

submit a core API experiment that reports a lot of metrics quickly in succession.

metrics.py

import logging

import determined as det
from determined import core


def main(core_context):
    for batch in range(100):
        steps_completed = batch + 1
        if steps_completed % 5 == 0:
            core_context.train.report_training_metrics(
                steps_completed=steps_completed, metrics={"x": batch}
            )
        if steps_completed % 10 == 0:
            core_context.train.report_validation_metrics(
                steps_completed=steps_completed, metrics={"x": batch}
            )


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG, format=det.LOG_FORMAT)
    with core.init() as core_context:
        main(core_context=core_context)
        print(f"Finished 'training' loop")
    print(f"Finished reporting metrics. Exiting.")

metrics.yaml

name: metrics
entrypoint: python3 metrics.py

searcher:
   name: single
   metric: x
   max_length: 1

max_restarts: 0

the trial logs should show print(f"Finished 'training' loop") print statement for a few seconds, while the remaining metrics finish reporting (and the metrics view updates accordingly). the debug logs should not show HTTP POSTs (i.e. urllib3.connectionpool: http://host.docker.internal:8080 "POST /api/v1/trials/197/metrics) immediately/synchronously following each determined.core: report_training_metrics log. since these are reported in the background now, most POST logs should appear between the Finished 'training' loop and Finished reporting metrics. Exiting. print statements.

Exception handling

Modify above script to throw an exception at some point in the loop:

import logging

import determined as det
from determined import core


def main(core_context):
    for batch in range(100):
+       if batch == 10:
+          raise ValueError("test exception!")
        
        steps_completed = batch + 1
        if steps_completed % 5 == 0:
            core_context.train.report_training_metrics(
                steps_completed=steps_completed, metrics={"x": batch}
            )
        if steps_completed % 10 == 0:
            core_context.train.report_validation_metrics(
                steps_completed=steps_completed, metrics={"x": batch}
            )


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG, format=det.LOG_FORMAT)
    with core.init() as core_context:
        main(core_context=core_context)
        print(f"Finished 'training' loop")
    print(f"Finished reporting metrics. Exiting.")

the logs should show the exception being thrown, and the experiment should terminate (not hang) with an error. the metrics reported before the exception should show up (only 1 in this case)

Detached mode

run this script locally:

os.environ["DET_MASTER"] = "MASTER_URL"

def main():
    core_v2.init(
        defaults=core_v2.DefaultConfig(
            name="unmanaged-1-singleton",
        ),
    )
    for i in range(100):
        core_v2.train.report_training_metrics(
            steps_completed=i, metrics={"loss": random.random()}
        )

        if (i + 1) % 10 == 0:
            core_v2.train.report_validation_metrics(
                steps_completed=i, metrics={"loss": random.random()}
            )
    print("Waiting for metrics reporting...")
    core_v2.close()


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG, format=det.LOG_FORMAT)
    main()

the script should run instantaneously, and you should see the Waiting for metrics reporting... print statement for a few seconds, while the remaining metrics finish reporting (and the metrics view updates accordingly). the trial should then exit successfully.

Exception handling

modify the above script to throw an exception at some point during metrics reporting:

os.environ["DET_MASTER"] = "MASTER_URL"

def main():
    core_v2.init(
        defaults=core_v2.DefaultConfig(
            name="unmanaged-1-singleton",
        ),
    )
    for i in range(100):
+       if i == 10:
+          raise ValueError("test exception!")
        core_v2.train.report_training_metrics(
            steps_completed=i, metrics={"loss": random.random()}
        )

        if (i + 1) % 10 == 0:
            core_v2.train.report_validation_metrics(
                steps_completed=i, metrics={"loss": random.random()}
            )
    print("Waiting for metrics reporting...")
    core_v2.close()


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG, format=det.LOG_FORMAT)
    main()

the logs should show the exception being thrown, and the experiment should terminate (not hang) with an error. the metrics reported before the exception should show up ([0-9] in this case)

Commentary (optional)

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

netlify · 2024-04-04T16:15:41Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`b229537`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/662014320b88b0000880166c

codecov · 2024-04-04T17:05:54Z

Codecov Report

Attention: Patch coverage is 85.10638% with 21 lines in your changes are missing coverage. Please review.

Project coverage is 45.52%. Comparing base (0fc247c) to head (b229537).
Report is 16 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9107      +/-   ##
==========================================
+ Coverage   45.48%   45.52%   +0.03%     
==========================================
  Files        1197     1199       +2     
  Lines      147556   147646      +90     
  Branches     2438     2437       -1     
==========================================
+ Hits        67121    67209      +88     
- Misses      80203    80205       +2     
  Partials      232      232

Flag	Coverage Δ
backend	`43.68% <ø> (-0.07%)`	⬇️
harness	`64.17% <85.10%> (+0.15%)`	⬆️
web	`35.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
harness/determined/core/__init__.py	`100.00% <100.00%> (ø)`
harness/tests/core/test_metrics.py	`100.00% <100.00%> (ø)`
harness/determined/core/_train.py	`40.90% <33.33%> (+1.08%)`	⬆️
harness/determined/core/_context.py	`57.63% <50.00%> (+0.08%)`	⬆️
harness/determined/core/_profiler.py	`57.56% <14.28%> (+5.08%)`	⬆️
harness/determined/core/_metrics.py	`88.37% <88.37%> (ø)`

... and 6 files with indirect coverage changes

ioga · 2024-04-04T23:37:41Z

harness/determined/core/_metrics.py

+        # Check for thread exceptions here since we're not polling.
+        if not self._error_queue.empty():
+            err_msg = self._error_queue.get(block=False)
+            logger.error(f"Error reporting metrics: {err_msg}")
+            raise err_msg
+


should we also check/flush the error queue on close?

e.g. what if user

logs only one metric

it fails in the background

user logs no more and exits

it'd be good to wait for that report to flush & log error, if any.

good idea. done.

ioga · 2024-04-04T23:38:22Z

harness/determined/core/_metrics.py

+class _TrialMetrics:
+    def __init__(
+        self,
+        group: str,
+        steps_completed: int,
+        metrics: Dict[str, Any],
+        batch_metrics: Optional[List[Dict[str, Any]]] = None,
+    ):
+        self.group = group
+        self.steps_completed = steps_completed
+        self.metrics = metrics
+        self.batch_metrics = batch_metrics


don't we have something like this class elsewhere?

we have determined.common.experimental.metrics.TrialMetrics but it's under the experimental namespace.
the profiler determined.core._profiler had a similar object for the same purpose, but i've refactored profiler to now use the MetricsContext

ioga · 2024-04-04T23:46:44Z

harness/determined/core/_metrics.py

+
+    def close(self) -> None:
+        self._shipper.stop()
+        self._shipper.join()


instead of a plain join, can we do a join with a time out logic similar to https://github.com/determined-ai/determined/blob/main/harness/determined/core/_log_shipper.py#L107

way too many times we had cases when the process is stuck on bg thread join, because there's a networking hangup or something. hard to debug every time. so I'd suggest all non-daemon threads which are joined must have a timeout and logging of some sort for "it took too long and we gave up".

good point. done, it's gated behind a check for if the inbound queue is empty. because if the core context is waiting on metrics to finish reporting, we can't just time it out.

if the queue is not empty and there's a hang, i'd wager it's probably better to risk the hang than to risk killing some metric reports.

if the core context is waiting on metrics to finish reporting, but they are stuck for a long time, and that join takes a while, it'd still be opaque for a user why their script doesn't exit. I'd suggest we should also join with a timeout in a loop, and print a log message (after 10 seconds) that we are waiting for that to finish.

ioga · 2024-04-04T23:49:37Z

harness/determined/core/_train.py

-            stepsCompleted=steps_completed,
-            trialId=self._trial_id,
-            trialRunId=self._run_id,
+        self._metrics.report(


general question: do we have sufficient automated testing for this.

do we have an existing unit test, or is it possible to have a unit test testing completeness/correctness of metrics sent from the harness?

if not possible: do we have an e2e test which'd check these?

good question. the full end-to-end journey of metrics probably goes something like:
training code -> trial APIs -> core API (now with async logic)-> POST APIs to master -> master side stuff -> read APIs

individual trials have unit tests (test_tf_keras_trial.py, test_pytorch_trial.py) that test the Trial API generates the expected metrics (training code -> trial APIs) but we don't have unit tests for the core API -> master APIs leg of metrics reporting. i'll write one.

beyond that the end-to-end training code -> metrics read APIs is covered by e2e_tests/.../test_metrics.py

added unit tests for the new MetricsContext

ioga · 2024-04-09T17:15:29Z

harness/determined/core/_metrics.py

+        self._trial_id = trial_id
+        self._run_id = run_id
+
+        super().__init__()


lgtm, one last debugging convenience

Suggested change

super().__init__()

super().__init__(name="MetricsShipperThread")

- introduce ``_MetricsContext`` into Core API as a centralized place for metrics reporting, which reports metrics in a background thread - refactor `core.train` and `core.profiler` to report metrics through the new ``_MetricsContext``

azhou-determined requested a review from ioga April 4, 2024 16:15

azhou-determined assigned ioga Apr 4, 2024

azhou-determined requested a review from a team as a code owner April 4, 2024 16:15

cla-bot bot added the cla-signed label Apr 4, 2024

ioga reviewed Apr 4, 2024

View reviewed changes

ioga approved these changes Apr 9, 2024

View reviewed changes

azhou-determined added 9 commits April 16, 2024 16:20

shipper in traincontext

17ec834

v2: metrics context

4f83d6a

cleanup _train

16b6c07

join with timeout and refactor profiler to use metricscontext

a97dfd7

fix

2adfdc0

typing

ee25730

join with timeout and add tst

d61d162

make metricsconext private

65a107d

mypy

352bec5

azhou-determined force-pushed the non-blocking-metrics branch from 1b9c8af to 352bec5 Compare April 16, 2024 23:21

daemonize thread

b229537

ioga approved these changes Apr 17, 2024

View reviewed changes

azhou-determined merged commit 0bc13d8 into main Apr 18, 2024
74 of 86 checks passed

azhou-determined deleted the non-blocking-metrics branch April 18, 2024 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: non-blocking metrics reports [MD-144] #9107

feat: non-blocking metrics reports [MD-144] #9107

azhou-determined commented Apr 4, 2024 •

edited

Loading

netlify bot commented Apr 4, 2024 •

edited

Loading

codecov bot commented Apr 4, 2024 •

edited

Loading

ioga Apr 4, 2024

azhou-determined Apr 5, 2024

ioga Apr 4, 2024

azhou-determined Apr 5, 2024

ioga Apr 4, 2024

azhou-determined Apr 5, 2024

ioga Apr 5, 2024

azhou-determined Apr 9, 2024

ioga Apr 4, 2024

azhou-determined Apr 5, 2024

azhou-determined Apr 9, 2024

ioga Apr 9, 2024

	super().__init__()
	super().__init__(name="MetricsShipperThread")

feat: non-blocking metrics reports [MD-144] #9107

feat: non-blocking metrics reports [MD-144] #9107

Conversation

azhou-determined commented Apr 4, 2024 • edited Loading

Description

Test Plan

Core API

Exception handling

Detached mode

Exception handling

Commentary (optional)

Checklist

Ticket

netlify bot commented Apr 4, 2024 • edited Loading

✅ Deploy Preview for determined-ui canceled.

codecov bot commented Apr 4, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azhou-determined commented Apr 4, 2024 •

edited

Loading

netlify bot commented Apr 4, 2024 •

edited

Loading

codecov bot commented Apr 4, 2024 •

edited

Loading