-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental model selection #288
Incremental model selection #288
Conversation
|
||
else: | ||
x_key, y_key = seen[j] | ||
return Future(x_key), Future(y_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recommend looking too deeply at this function. It's unclean and likely to go in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this implementation @mrocklin. I've glanced at the code, but need to take time to thoroughly review. Does this clarify the ideas in #221 (comment)?
It implements starting/stopping work based on the current estimate of good/bad models (which I agree should be done in #221). Is there anything else I should pay close attention to for #221?
Oh, and the gist isn't loading for me.
model = client.submit(_partial_fit, model, X_future, y_future, | ||
fit_params, priority=-i + meta['score']) | ||
score = client.submit(_score, model, X_test, y_test, scorer, | ||
priority=-time_step + meta['score']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice – the priority keyword is useful, and I wasn't aware of the implementation.
I should spend some time thinking about priority. I think we want to use a multiplicative ratio to try to remove scale: priority=score / i
.
I recommend that before we develop further on either solution that we play
a bit with concrete problems to see what is important.
Eventually my guess is that this implementation will grow to the point
where you can use it as a piece to build out the full algorithm that you
want without having to do any low-level dask work. However, I'm going to
claim that we don't yet know what algorithm you'll want to build before
first having a few more serious examples to play with.
…On Sun, Jul 8, 2018 at 4:47 PM, Scott Sievert ***@***.***> wrote:
***@***.**** commented on this pull request.
Thanks for this implementation @mrocklin <https://github.com/mrocklin>.
I've glanced at the code, but need to take time to thoroughly review. Does
this clarify the ideas in #221 (comment)
<#221 (comment)>?
It implements starting/stopping work based on the current estimate of
good/bad models (which I agree should be done in #221
<#221>). Is there anything else I
should pay close attention to for #221
<#221>?
------------------------------
In dask_ml/model_selection/_incremental.py
<#288 (comment)>:
> + continue
+ time_step = meta['time_step']
+ ident = meta['ident']
+
+ done[time_step].add(ident)
+ info[ident].update(meta)
+ history.append(meta)
+
+ # Evolve the model by a few time steps, then call score on the last one
+ model = models[ident]
+ for i in range(time_step, next_time_step):
+ X_future, y_future = get_futures(i + 1)
+ model = client.submit(_partial_fit, model, X_future, y_future,
+ fit_params, priority=-i + meta['score'])
+ score = client.submit(_score, model, X_test, y_test, scorer,
+ priority=-time_step + meta['score'])
Nice – the priority keyword is useful, and I wasn't aware of the
implementation.
I should spend some time thinking about priority. I think we want to use a
multiplicative ratio to try to remove scale: priority=score / i.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#288 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszFZ9gF2CukknIBePCoxb0pFCPVRDks5uEm_GgaJpZM4VGzHI>
.
|
After looking at the >>> scores # example simplified input
{'model-1': [0.3, 0.4, 0.5, 0.6, 0.7], # evolving
'model-2': [0.2, 0.6, 0.6, 0.6, 0.6], # converged
'model-2': [0.1, 0.1, 0.1, 0.1, 0.1], # bad
}
>>> out # example output
{'model-1': 3 # evolve forward three more steps, then give us back the score
'model-2': 0 # keep this around, but don't bother evolving further
# 'model-3': ... # omit model 3 to forget about it
}
def update(scores):
""" Something like successive halving for hyperband """
best = topk(n=len(scores) / 2, scores, lambda k: score[k][-1])
additional_steps = int(...)
return {ident: additional_steps for ident in best}
def update(scores):
""" Something like Incremental.fit """
assert len(score_info) == 1
[(ident, v)] = scores.items()
if any(score > v[-10] + tolerance for score in v[-10:]):
return {ident: 1} # do one more time step
else:
return {ident: 0} # no more work to do, signals termination
def update(scores):
""" Something like RandomSearchCV """
out = {}
for ident, v in scores:
if any(score > v[-10] + tolerance for score in v[-10:]):
out[ident] = 1 # keep evolving this one
else:
out[ident] = 0 # don't evolve this one any further, it's done The full fit function would have the something like the following inputs: def fit(model, parameter_list, X_train, y_train, X_test, y_test, update, fit_params, scorer):
... This may not cover these cases though. Critique welcome before I invest time into this. |
I simplified the example above significantly in an edit. Better to look on github rather than depend on e-mail. |
This looks like a disciplined way of controlling when to call I think this looks good to me. I would like to store state inside I think we'd want to control the >>> scores
{'model-1': [0.3, 0.4, 0.5],
'model-2': [0.5, 0.5, 0.5]}
>>> model_meta
{'model-1': {...},
'model-2': {...}}
>>> all_meta # for HyperbandCV
{'scorer': ...,
'X_test': ...,
'y_test': ...} where Then the call signature for
I'm not seeing why |
I don't understand this. Which float? In principle there is no one thing that determines model quality here. We're a bit lower level.
What state would you want to propagate. I expect that the
In the context of successive halving update is called at the end of every epoch
The convention I proposed above is that the user passes back the number of additional steps go to forward in each model before calling update again. A number of
Yes, I agree that some collection of metadata is likely to be important.
So the user would also pass functions to include in the def _partial_fit(model_and_meta, X, y, fit_params, meta_func=None):
...
def _score(model_and_meta, X, y, scorer, meta_func=None):
...
I'm probably misusing the term fit here. That's just the name of my full function that does all of the training work. I'm using the term |
I got confused by naming/etc but am starting to understand (and disregard most of #288 (comment); it's ill-informed). I think this adaptive training will require
I think the
I think I think I do think
I think 'not present' should default to 'do nothing'. I'd only want to specify the state for some subset of the models, not all the models. For Hyperband with some subset of the models I'd specify which models should be trained more and which models should be killed. I think there |
I'm not sure about some of the motivation behind your thoughts above. For the desire to call Similarly I suspect that this was your motivation for having "no change" be the default in update. But I'm not sure. If you can include motivation behind your thoughts that would be useful |
Or, putting this another way. I believe that I've presented a sufficient approach to successive halving in my update function above. I believe that the speculative execution should handle the underlying desires behind the move to async hyperband. I've removed the prioritization for now for simplicity, but it will be easy to put it back. If I'm missing something and the approach is unlikely to work for some reason please let me know |
Asynchronous Hyperband is part of my motivation but not all of it. This is an adaptive training function, one that decides to train models based on prior evaluations and stored metadata. Wikipedia has a good example of the model updating after seeing one evaluation, what I'm used to seeing in adaptive algorithms.
This won't perform well for synchronous Hyperband. There are a couple embarrassingly parallel loops in synchronous Hyperband, and there's no reason for loop to wait on another. It will work well for one successive halving call in synchronous Hyperband. This is important: the initial call to This could be avoided if there's some way to specify the different Hyperband loops as independent, at least for synchronous Hyperband. This could work for asynchronous Hyperband too, as long as additional arguments are passed.
I think this should be supported as well. That's the simplest interface I would expect. |
Notebook implementing prototypes of of Incremental.fit, RandomSearch, and SucessiveHalving: |
Understood. And in principle I agree that we might gain from making decisions whenever we find new information. To start out though with my current constraints (finding solutions to the above problems) I seem to be able to implement most things about as efficiently as desired. If we choose to get fancier we can get fancier easily. This seems to suffice for now though?
How would you interleave calls? The paper specifies a nested loop. That would be trivial to implement here. We can definitely launch the various successive-halving computations concurrently if desired. Or perhaps you're referring to some way to fuse and share computations between successive halving computations?
Just checking, 81 is just an example number here, right? One might ask for thousands I think? I believe that 81 is just a number they use in a table to show an example of how things might evolve, and what they use in their LeNet benchmark. I don't think it's a magic number, please let me know if I'm missing something.
It's definitely possible to run multiple of these at the same time, though they won't be sharing work. I can imagine that that would be helpful if we have abundant parallelism, which seems possible. To make this concrete, something like the following would work: @gen.coroutine
def hyperband(...):
yield [successive_halving(...) for _ in ...] The Dask client will then run this process many times from the same thread. |
Also, I'm running this notebook using my cull-before-send branch in dask/distributed |
Correct. Sorry for the confusion.
Then for synchronous Hyperband I am satisfied. My other concerns go out the window. There might be something more with tpot or Bayesian sampling, but I'll look at the code first.
Could you push your code? I take it the notebook was run with modifications to |
Ah! Indeed. I was operating on the wrong branch. Pushed! |
I recommend the graph view of the dashboard when executing. My current plan is to leave this as-is until some sort of case study arrives to benchmark it against. I'm inclined to experiment a bit with successive_halving to see how different decay rates behave and then use that as motivation for hyperband after we have some practical experience. Presumably UI bits would come after that. |
Pushed some formatting changes. Will take a closer look later (tomorrow most likely). |
""" Create a model by cloning and then setting params """ | ||
with log_errors(pdb=True): | ||
model = clone(model).set_params(**params) | ||
return model, {"ident": ident, "params": params, "time_step": -1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused with time_step
here. I thought it referred to the number of times update
was called, not the number of times partial_fit
was called.
This looks pretty good. A lot of my concerns are addressed by the fact that this
Making this work with asynchronous Hyperband will be some work but not an incredible amount. |
Yeah, I'm happy to set up concurrent execution of successive halving for hyperband (or walk someone else through it) after we have a bit more understaning of how randomsearch and successive halving perform. I don't think that building hyperband is likely to be a priority until then. |
This implements a simple approach to do hyper-parameter optimization on incrementally trained models. It is heavily insprired by (and steals content from) dask#221 but attempts less. Currently it implements a function ``fit``, which takes in a model, training and testing data and a parameter space. It creates many models and trains them on a block of data, scores them, and then keeps only the better performing ones. It continues to do this until only one is selected. This function is not intended for public users, but is hopefully something we can play with as we eventually build a full solution. It is decently configurable today and we will need to find sane defaults. Notably, this function does not take opinions on the following: 1. How testing data should be separated from training data, currently they are both explicitly provided 2. How many parameters to start with, or how to decrease this over time These are given as inputs with an integer and function respectively Additionally it does not, but probably should provide configurations for the following: 1. How to select models from the current group 2. How to decide on stopping criteria 3. How any ability actively decide the number of surviving models based on the scores received so far I believe that this implementation is a decent start though and should be a conversation piece. I think that it makes some decisions that we don't want to think about yet (online determination of how many models to keep) while giving us knobs for other parameters that I think we'll want to play with (model decay functions, how to handle testing, etc..)
This removes logic from the fit function to make it less about successive halving, and be more general to support Incremental.fit and RandomSearchCV.
b213809
to
aaaf7e8
Compare
I thought I discovered some bugs, but turns out I didn't. |
In this, it'd be nice to support keyboard interrupts in this, and preserve the models/infos/etc on exit. My use case is a user firing a keyboard interrupt and still recovering the best model's score/related info. |
Agreed. That's a nice idea. I'll take a look at that soonish.
…On Fri, Aug 31, 2018 at 6:49 PM, Scott Sievert ***@***.***> wrote:
In this, it'd be nice to support keyboard interrupts in this, and preserve
the models/infos/etc on exit.
My use case is a user firing a keyboard interrupt and still recovering the
best model's score/related info.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#288 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszBaq2dz1JdoM9SrmMiBeA08sN2ptks5uWb1jgaJpZM4VGzHI>
.
|
"soonish" meaning in the next few days
On Fri, Aug 31, 2018 at 7:30 PM, Matthew Rocklin <mrocklin@anaconda.com>
wrote:
… Agreed. That's a nice idea. I'll take a look at that soonish.
On Fri, Aug 31, 2018 at 6:49 PM, Scott Sievert ***@***.***>
wrote:
> In this, it'd be nice to support keyboard interrupts in this, and
> preserve the models/infos/etc on exit.
>
> My use case is a user firing a keyboard interrupt and still recovering
> the best model's score/related info.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#288 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AASszBaq2dz1JdoM9SrmMiBeA08sN2ptks5uWb1jgaJpZM4VGzHI>
> .
>
|
@TomAugspurger do you have thoughts on the right way to make a RandomSearch version of this that is API-familiar to a Scikit-Learn user? Is there even an established API for this kind of thing with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger do you have thoughts on the right way to make a RandomSearch version of this that is API-familiar to a Scikit-Learn user?
Does something roughly like this make sense?
class IncrementalSearch(BaseEstimator):
def __init__(self, estimator, params, additional_calls, X_test, y_test):
self.estimator = estimator
self.params = params
self.additional_calls = additional_calls
self.X_test = X_test
self.y_test = y_test
def fit(self, X, y=None, **fit_params):
info, model, history = fit(self.estimator, self.params, X, y, self.X_test,
self.y_test, self.additional_calls, **fit_params)
self.info_ = info
self.model_ = model
self.history_ = history
The only strange part is passing the validation set to the init. That could maybe be done as a fit param, and plucked from fit_params
in fit. I'm not sure there's anything in scikit-learn that takes a validation set as an argument (maybe @ogrisel knows otherwise). Estimators like SGDClassifier will internally create a validation set from the training data when early_stopping=True
. Perhaps we could do the same?
raise gen.Return((info, models, history)) | ||
|
||
|
||
def fit(*args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be best to make this the actual signature.
Numpy array or small dask array. Should fit in memory. | ||
y_test : Array | ||
Numpy array or small dask array. Should fit in memory. | ||
start : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed I think.
Add additional_calls.
>>> X_train = X[100000:] | ||
>>> y_train = y[100000:] | ||
|
||
>>> info, model, history = yield fit(model, params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update.
Would it make sense to factor this out from the Hyperband work going on now? |
Yes, I think so. But I'm not sure what order we should do things in.
does that conform with others' expectations? I'm happy to take over the work from those that are busy with other things. |
I've updated the docstring considerably, including a runnable example. Tests will fail until dask/distributed#2232 is merged in. |
Yes, I think that ideally something like Hyperband would look something like the following: class Hyperband(IncrementalSearch):
def __init__(**hyperband_specific_kwargs);
...
def additional_calls(self, scores):
... If this existed I think that it would be easier to experiment with this sort of thing. I should write up documentation for this PR, but I might want to wait until such a super-class exists. |
I would be happy to leave the sklearn-specific conventions to someone else :) |
OK then, so how's this for a plan
And followup items for documentation, examples, etc. of IncrementalSeach The goal being to maximize the effectiveness of @stsievert's time :) |
Sounds good to me
…On Wed, Sep 5, 2018 at 10:55 AM, Tom Augspurger ***@***.***> wrote:
OK then, so how's this for a plan
1. Merge dask/distributed#2232
<dask/distributed#2232>
2. Merge this
3. I take a pass at an IncrementalSearch base class today
4. We rebase Hyperband on IncrementalSearch
And followup items for documentation, examples, etc. of IncrementalSeach
The goal being to maximize the effectiveness of @stsievert
<https://github.com/stsievert>'s time :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#288 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszPmVR3XDUo_xlrdACmhvYLI3NxQxks5uX-X5gaJpZM4VGzHI>
.
|
@@ -218,10 +219,23 @@ def get_futures(partial_fit_calls): | |||
|
|||
new_scores = list(_scores2.values()) | |||
|
|||
models = {k: client.submit(operator.getitem, v, 0) for k, v in models.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break the current Hyperband implementation BTW
Perfect – I had this idea when playing with |
That's my current plan. Should have something later this afternoon.
…On Wed, Sep 5, 2018 at 11:02 AM Scott Sievert ***@***.***> wrote:
class IncrementalSearch(BaseEstimator):
Perfect – I had this idea when playing with stop_on_plateau, but didn't
bring it up. This should also inherit from DaskBaseSearchCV, right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#288 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIikP5ChUY33NGFup36R6nltlSHRcks5uX_WKgaJpZM4VGzHI>
.
|
Good to merge any time @TomAugspurger . I'll leave this to you |
if isinstance(X_test, da.Array): | ||
X_test = client.compute(X_test) | ||
else: | ||
y_test = yield client.scatter(y_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be X_test = yield client.scatter(X_test)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it probably should. My mistake.
Updated that, and made the data and a bit smaller in a hope to make it run a bit faster. |
commit 41b0af2 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Sep 5 11:11:20 2018 -0500 [WIP]: Base class for estimators using incremental fit commit 80f0b14 Author: Matthew Rocklin <mrocklin@gmail.com> Date: Wed Sep 5 14:44:12 2018 -0400 Incremental model selection (dask#288) This implements a simple approach to do hyper-parameter optimization on incrementally trained models. It is heavily insprired by (and steals content from) dask#221 but attempts less.
This implements a simple approach to do hyper-parameter optimization on
incrementally trained models. It is heavily insprired by (and steals content
from) #221 but makes fewer opinions, and doesn't attempt to implement
scikit-learn APIs. I don't intend for this to be user-facing, or necessarily get
merged, but I hope that it to enable conversation around the topic.
Currently it implements a function
fit
, which takes in a model, trainingand testing data and a parameter space. It creates many models and trains them
on a block of data, scores them, and then keeps only the better performing
ones, training them on subsequent blocks of data.
It continues to do this until only one is selected.
This function is not intended for public users, but is hopefully something we
can play with as we eventually build a full solution. It is decently
configurable today and we will need to find sane defaults. Notably, this
function does not take opinions on the following:
currently they are both explicitly provided
Additionally it does not, but probably should provide configurations for the
following:
I believe that this implementation is a decent start though and should be ok as a
conversation piece. I think that it makes some decisions that we don't want to
think about yet (online determination of how many models to keep) while giving
us knobs for other parameters that I think we'll want to play with (model decay
functions, how to handle testing, etc..)
It is decently well tuned from a dask perspsective.
It optimistically tries moving forward on models between rounds and then cancels
work if it's not necessary. This depends on dask master.
Performance is also a bit better with dask/distributed#2100
merged in (but not necessary)
Notebook: https://gist.github.com/3a33bed7e7289550bfea8a48af702c82
It would be nice to settle on a sane default for this operation, which I understand is commonly needed.
I suspect that we should play a bit with some real problems using this to see what ends up being important before doing much more development work.