[ML] Add extra debug logging to enable end-to-end profiling of jobs #29857

elasticmachine · 2018-01-17T10:31:09Z

Original comment by @droberts195:

When an ML job is running time can be spent in the following areas:

Searching Elasticsearch indices for input LINK REDACTED
Pre-processing this input in the ML Java code prior to sending it to the ML C++ process
(Possibly) categorization inside the C++ process
"Data gathering" in the anomaly detection part of the C++ process
End-of-bucket processing in the C++ process
Result processing in the ML Java code

@richcollier has found that it is extremely hard to pinpoint which of these processing phases is responsible for an ML job running slower than real-time at a customer.

We calculate and store the end-of-bucket processing time in the C++ anomaly detection code, but time spent in other areas is not easy to determine (other than by using a profiler in a development environment).

Such troubleshooting would be greatly helped by the following instrumentation:

Debug messages at the beginning and end of every Elasticsearch search that the datafeed does
Debug messages at the beginning and end of post_data processing
Some sort of instrumentation of the categorization code in the C++ process, with debug logging to report periodically how long it is taking

Item (3) is that hardest here, as the categorization and data gathering are both done in sequence per input record. Using a millisecond timer to time the categorization part is probably not accurate enough, and using our current nanosecond timer on some platforms (Windows) is quite slow.

But even if just items (1) and (2) are added then it will improve our ability to troubleshoot certain performance problems at customer sites.

elasticmachine · 2018-01-17T15:46:31Z

Original comment by @richcollier:

thanks for writing this up @droberts195 - I came here to do it but found your entry first!

droberts195 · 2019-05-22T13:26:58Z

After discussing with @sophiec20 we decided that rather than just make this information available via debug messages it would be nice to incorporate it into the output of the get jobs stats and get datafeed stats APIs.

In the first instance we will add average_bucket_processing_time_ms to the job stats and some measure of search time to the datafeed stats. Depending on what is possible this could be total time spent searching since the datafeed was created or average search time per bucket. (In the event that only total search time is possible the troubleshooter will have to look up the number of buckets in the job stats and divide the total time by this to get a number commensurate with the average bucket processing time.)

droberts195 · 2019-05-22T13:28:47Z

average_bucket_processing_time_ms can be found by doing an aggregation on the bucket_processing_time_ms field of the ML bucket results.

elasticsearchmachine · 2024-06-21T10:37:19Z

Pinging @elastic/ml-core (Team:ML)

elasticmachine added :ml Machine learning >enhancement labels Apr 25, 2018

przemekwitek self-assigned this May 22, 2019

This was referenced Jun 13, 2019

Report timing stats as part of the Job stats response #42709

Merged

Report exponential_avg_bucket_processing_time which gives more weight to recent buckets #43189

Merged

This was referenced Jul 10, 2019

[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response #43045

Merged

Add DatafeedTimingStats.average_search_time_per_bucket_ms and TimingStats.total_bucket_processing_time_ms stats #44125

Merged

This was referenced Jul 22, 2019

Implement exponential average search time per hour statistics. #44683

Merged

Change version in serialization code in TimingStats.java to 7.4.0 #44905

Merged

Mpdreamz mentioned this issue Aug 7, 2019

[meta] 7.3 Release elastic/elasticsearch-net#4001

Closed

16 tasks

codebrain mentioned this issue Oct 14, 2019

7.4 meta ticket elastic/elasticsearch-net#4133

Closed

56 tasks

przemekwitek removed their assignment Jun 21, 2024

elasticsearchmachine added the Team:ML Meta label for the ML team label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add extra debug logging to enable end-to-end profiling of jobs #29857

[ML] Add extra debug logging to enable end-to-end profiling of jobs #29857

elasticmachine commented Jan 17, 2018 •

edited by droberts195

Loading

elasticmachine commented Jan 17, 2018

droberts195 commented May 22, 2019

droberts195 commented May 22, 2019

elasticsearchmachine commented Jun 21, 2024

[ML] Add extra debug logging to enable end-to-end profiling of jobs #29857

[ML] Add extra debug logging to enable end-to-end profiling of jobs #29857

Comments

elasticmachine commented Jan 17, 2018 • edited by droberts195 Loading

elasticmachine commented Jan 17, 2018

droberts195 commented May 22, 2019

droberts195 commented May 22, 2019

elasticsearchmachine commented Jun 21, 2024

elasticmachine commented Jan 17, 2018 •

edited by droberts195

Loading