-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: non-blocking metrics reports [MD-144] #9107
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9107 +/- ##
==========================================
+ Coverage 45.48% 45.52% +0.03%
==========================================
Files 1197 1199 +2
Lines 147556 147646 +90
Branches 2438 2437 -1
==========================================
+ Hits 67121 67209 +88
- Misses 80203 80205 +2
Partials 232 232
Flags with carried forward coverage won't be shown. Click here to find out more.
|
harness/determined/core/_metrics.py
Outdated
# Check for thread exceptions here since we're not polling. | ||
if not self._error_queue.empty(): | ||
err_msg = self._error_queue.get(block=False) | ||
logger.error(f"Error reporting metrics: {err_msg}") | ||
raise err_msg | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also check/flush the error queue on close?
e.g. what if user
- logs only one metric
- it fails in the background
- user logs no more and exits
it'd be good to wait for that report to flush & log error, if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. done.
harness/determined/core/_metrics.py
Outdated
class _TrialMetrics: | ||
def __init__( | ||
self, | ||
group: str, | ||
steps_completed: int, | ||
metrics: Dict[str, Any], | ||
batch_metrics: Optional[List[Dict[str, Any]]] = None, | ||
): | ||
self.group = group | ||
self.steps_completed = steps_completed | ||
self.metrics = metrics | ||
self.batch_metrics = batch_metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we have something like this class elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have determined.common.experimental.metrics.TrialMetrics
but it's under the experimental namespace.
the profiler determined.core._profiler
had a similar object for the same purpose, but i've refactored profiler to now use the MetricsContext
harness/determined/core/_metrics.py
Outdated
|
||
def close(self) -> None: | ||
self._shipper.stop() | ||
self._shipper.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of a plain join, can we do a join with a time out logic similar to https://github.com/determined-ai/determined/blob/main/harness/determined/core/_log_shipper.py#L107
way too many times we had cases when the process is stuck on bg thread join, because there's a networking hangup or something. hard to debug every time. so I'd suggest all non-daemon threads which are joined must have a timeout and logging of some sort for "it took too long and we gave up".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. done, it's gated behind a check for if the inbound queue is empty. because if the core context is waiting on metrics to finish reporting, we can't just time it out.
if the queue is not empty and there's a hang, i'd wager it's probably better to risk the hang than to risk killing some metric reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the core context is waiting on metrics to finish reporting, but they are stuck for a long time, and that join takes a while, it'd still be opaque for a user why their script doesn't exit. I'd suggest we should also join with a timeout in a loop, and print a log message (after 10 seconds) that we are waiting for that to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
stepsCompleted=steps_completed, | ||
trialId=self._trial_id, | ||
trialRunId=self._run_id, | ||
self._metrics.report( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general question: do we have sufficient automated testing for this.
- do we have an existing unit test, or is it possible to have a unit test testing completeness/correctness of metrics sent from the harness?
- if not possible: do we have an e2e test which'd check these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. the full end-to-end journey of metrics probably goes something like:
training code -> trial APIs -> core API (now with async logic)-> POST APIs to master -> master side stuff -> read APIs
individual trials have unit tests (test_tf_keras_trial.py
, test_pytorch_trial.py
) that test the Trial API generates the expected metrics (training code -> trial APIs) but we don't have unit tests for the core API -> master APIs leg of metrics reporting. i'll write one.
beyond that the end-to-end training code -> metrics read APIs is covered by e2e_tests/.../test_metrics.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added unit tests for the new MetricsContext
harness/determined/core/_metrics.py
Outdated
self._trial_id = trial_id | ||
self._run_id = run_id | ||
|
||
super().__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, one last debugging convenience
super().__init__() | |
super().__init__(name="MetricsShipperThread") |
1b9c8af
to
352bec5
Compare
- introduce ``_MetricsContext`` into Core API as a centralized place for metrics reporting, which reports metrics in a background thread - refactor `core.train` and `core.profiler` to report metrics through the new ``_MetricsContext``
Description
Report metrics in a background thread to not block training.
Introduce
core.MetricsContext
as a centralized place for all metric reporting.Test Plan
This PR makes all metrics reporting happen in the background and affects many parts of the python code. A few areas that would be helpful to test:
Core API
submit a core API experiment that reports a lot of metrics quickly in succession.
metrics.py
metrics.yaml
the trial logs should show
print(f"Finished 'training' loop")
print statement for a few seconds, while the remaining metrics finish reporting (and the metrics view updates accordingly). the debug logs should not show HTTP POSTs (i.e.urllib3.connectionpool: http://host.docker.internal:8080 "POST /api/v1/trials/197/metrics
) immediately/synchronously following eachdetermined.core: report_training_metrics
log. since these are reported in the background now, most POST logs should appear between theFinished 'training' loop
andFinished reporting metrics. Exiting.
print statements.Exception handling
Modify above script to throw an exception at some point in the loop:
the logs should show the exception being thrown, and the experiment should terminate (not hang) with an error. the metrics reported before the exception should show up (only 1 in this case)
Detached mode
run this script locally:
the script should run instantaneously, and you should see the
Waiting for metrics reporting...
print statement for a few seconds, while the remaining metrics finish reporting (and the metrics view updates accordingly). the trial should then exit successfully.Exception handling
modify the above script to throw an exception at some point during metrics reporting:
the logs should show the exception being thrown, and the experiment should terminate (not hang) with an error. the metrics reported before the exception should show up ([0-9] in this case)
Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket