[ML] Reimplement established model memory #35263

This is the 6.6/6.7 implementation of a master node service to keep track of the native process memory requirement of each ML job with an associated native process. The new ML memory tracker service works when the whole cluster is upgraded to at least version 6.6. For mixed version clusters the old mechanism of established model memory stored on the job in cluster state is used. This means that the old (and complex) code to keep established model memory up to date on the job object cannot yet be removed. When this change is forward ported to 7.0 the old way of keeping established model memory updated will be removed.

…emory

Persistent tasks will recheck allocations when custom metadata changes, so updating a timestamp in the ML metadata will enable persistent tasks whose allocation was deferred to allow for a memory refresh to have another go at selecting an ML node.

Adding this test showed that there was a problem with using Instant.now() in a cluster state update. It is possible that the cluster state update can get applied on more than one master node if the cluster is unstable, and it's essential that they both apply exactly the same change. Therefore I changed the discriminant to the old cluster state version plus one. The actual time is not required. We just need a field whose change will kick persistent tasks.

Also added test logging in case it fails again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Reimplement established model memory #35263

[ML] Reimplement established model memory #35263

Commits on Nov 5, 2018

Commits on Nov 7, 2018

Commits on Nov 8, 2018

Commits on Nov 9, 2018

Commits on Nov 12, 2018