Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Reimplement established model memory #35263

Commits on Nov 5, 2018

  1. [ML] Reimplement established model memory

    This is the 6.6/6.7 implementation of a master node service to
    keep track of the native process memory requirement of each ML
    job with an associated native process.
    
    The new ML memory tracker service works when the whole cluster
    is upgraded to at least version 6.6.  For mixed version clusters
    the old mechanism of established model memory stored on the job
    in cluster state is used.  This means that the old (and complex)
    code to keep established model memory up to date on the job object
    cannot yet be removed.  When this change is forward ported to 7.0
    the old way of keeping established model memory updated will be
    removed.
    droberts195 committed Nov 5, 2018
    Configuration menu
    Copy the full SHA
    36b8dbd View commit details
    Browse the repository at this point in the history

Commits on Nov 7, 2018

  1. Configuration menu
    Copy the full SHA
    f860da1 View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2018

  1. Configuration menu
    Copy the full SHA
    d4124d4 View commit details
    Browse the repository at this point in the history
  2. Kick persistent tasks after memory requirement refresh

    Persistent tasks will recheck allocations when custom metadata
    changes, so updating a timestamp in the ML metadata will enable
    persistent tasks whose allocation was deferred to allow for a
    memory refresh to have another go at selecting an ML node.
    droberts195 committed Nov 8, 2018
    Configuration menu
    Copy the full SHA
    f30c8cc View commit details
    Browse the repository at this point in the history
  3. Improve comments

    droberts195 committed Nov 8, 2018
    Configuration menu
    Copy the full SHA
    10a1467 View commit details
    Browse the repository at this point in the history
  4. Fix bug and add test

    droberts195 committed Nov 8, 2018
    Configuration menu
    Copy the full SHA
    7ac0883 View commit details
    Browse the repository at this point in the history

Commits on Nov 9, 2018

  1. Add an integration test

    Adding this test showed that there was a problem with using
    Instant.now() in a cluster state update.  It is possible that
    the cluster state update can get applied on more than one master
    node if the cluster is unstable, and it's essential that they
    both apply exactly the same change.  Therefore I changed the
    discriminant to the old cluster state version plus one.  The
    actual time is not required.  We just need a field whose change
    will kick persistent tasks.
    droberts195 committed Nov 9, 2018
    Configuration menu
    Copy the full SHA
    7492b97 View commit details
    Browse the repository at this point in the history

Commits on Nov 12, 2018

  1. Fix applying diff

    Also added test logging in case it fails again
    droberts195 committed Nov 12, 2018
    Configuration menu
    Copy the full SHA
    4e004e4 View commit details
    Browse the repository at this point in the history