[R-package] callbacks per #892 #1264

khotilov · 2016-06-09T14:08:02Z

The majority of stuff is working, but some things are raw. The code docs would need some more work, and some changes in how things are handled might be needed (see below and also some TODO markers in the code). I'm pr-ing it mainly to get some feedback.
Note that this coming weekend I would not be able to do any work on all this.

Work notes:

Added the callbacks that are mostly similar as in python, plus the cb.save_model one. The 'cb.' prefix is purely for convenience when using the tab-completion, and I'm open to other naming suggestions. Some callbacks have a "finalizer" functionality.
Instead of the reset_learning_rates callback, added a more general cb.reset_parameters
The cb.save_model callback uses save_period=0 to save the last model (the changed default save_period=NULL in xgb.train would not enable the saving). It also allows to include iteration number into the file name.
The cb.early_stop callback allows to set a specific metric column name to use. If it is not set, the last evaluation column is used. Choosing the last one is more natural, and python API does that too.
Added training continuation to xgb.train
Cleaning: making the training flow better structured, commented, with reusable blocks, and the code to be more R-ish
WARNING: return value change for the train and cv methods. My goal was to have a somewhat unified output interface among the two, that would also be what an R user would normally expect, which is a list of various output data items (as opposed to the previous part-list/part-attributes mix, or CV's output changing to a different structure when prediction in CV is required).
Changed iteration index to be consistently 1-indexed in R
Added formatted print methods for xgb.Booster and xgb.cv.result
Changed to use the roxygen-handled useDynLib library loading mechanism in place of doing it manually via onLoad/onUnload. That plays better with devtools.
Merged the documentation of xgboost and xgb.Booster to reduce the redundancy
Added extra examples to the docs of xgb.train and xgboost
Added bunch of additional unit tests

List of some things and questions to consider:

predictions from CV: 0) leave them as they are 1) create a callback for that 2) return folds' models, and have a separate predict method to use with the cv-output object? I'm torn b/w 2.0 version, lots of changes #1 and fix loss_type #2.
It might make sense to ensure that the early stopping callback is the last
Why not to use the limit on number of iterations in predict instead on the number of trees?
I was using the style option of "single statement after 'if' goes on a new line with no curly brackets". I find that easier to read when there are many if-checks. But I can change it to the "lots-of-{}" style if you wish.
There is still a fair amount of parameter name style inconsistencies among the R package methods. Would it be a good time to make those consistent? Or even as close to python interface as makes sense? (can leave the old ones for some time with deprecation warning)
In addition to the current 'synchronous' CV method, I would like to add some time later a simple 'sequential' CV-wrapper for xgb.train, which would be better suited for the classic CV-based model evaluation. I'm mentioning it here because it might be done either as a separate strategy option within xgb.cv, or as a separate function (I prefer the latter). Since it might affect the current xgb.cv interface, I'm looking for some thoughts on that.

tqchen · 2016-06-10T00:06:05Z

LGTM, wait @hetong007 for review

hetong007 · 2016-06-10T07:19:50Z

Hi @khotilov, Thanks for the great contribution. I am reading the code, it is a bit long so I will discuss with you tomorrow.

It is an important improvement of the R-package, so please also add your name to the author list in DESCRIPTION as well :)

hetong007 · 2016-06-10T22:07:27Z

My thoughts on your questions:

predictions from CV: The first option might be the best, since I don't see any other benefit of exposing the models from each fold to users.
Number of iterations: If num_parallel_tree is set, then the number of trees is different from the number of iterations (@tqchen is this statement correct?). I think controlling the number of iterations makes more sense.
I prefer the style where else is wrapped by two brackets as } else { so that there is no chance for the "Unexpected ELSE error". Another minor point is, having brackets from the beginning makes future improvement easier.
It will be great if the parameter names are updated with deprecation warnings.
For the sequential CV-wrapper, can you explain more on that point? I might have not understand why it is "better suited for the classic CV-based model evaluation". Thank you.

Some comments on the TODOs in the code:

For cb.reset_parameters: If you think there are some parameters better be fixed, otherwise I guess we could just leave it unrestricted.
For print.xgb.Booster: The set of parameters is recorded in xgb.train. If you want to print the current value for each parameter, we could manually construct a function initialized with default value for parameters, then track the changes of the parameter in the process.

Another concern is since there are a lot of changes, we might need to mention that to all the users. A new tutorial would be very helpful after these have been merged.

tqchen · 2016-06-12T16:18:43Z

@hetong007 can you give a list of actionable items so after these are changed we can merge it in? This can help us to merge this change in quicker

hetong007 · 2016-06-13T17:47:35Z

@khotilov If you think my comments above make sense, then the list of actions before we merge it in will be:

Create a call back for prediction in CV
Wrap if statements with brackets
Change the parameter style, and add deprecation warnings
Document the return values for xgb.train/xgb.cv
Add your name to the DESCRIPTION file

I think the above changes makes this commit consistent, and we can merge them in at this point.

After the commit, there are several things for us to do:

Write a vignette about this change, preferrably with comprehensive introduction on a simple example.
Control the number of iterations instead of control the number of trees
Add 'sequential' CV
Add the function to print the parameters and their current values.

khotilov · 2016-06-13T21:08:30Z

Tong:
Thanks for reviewing! I am still travelling and will reply to your
questions in a day or two.
Regards,
Vadim
On Jun 13, 2016 1:47 PM, "Tong He" notifications@github.com wrote:

@khotilov https://github.com/khotilov If you think my comments above make
sense, then the list of actions before we merge it in will be:

Create a call back for prediction in CV
Wrap if statements with brackets
Change the parameter style, and add deprecation warnings
Document the return values for xgb.train/xgb.cv

I think the above changes makes this commit consistent, and we can merge
them in at this point.

After the commit, there are several things for us to do:

Write a vignette about this change, preferrably with comprehensive
introduction on a simple example.
Control the number of iterations instead of control the number of trees
Add 'sequential' CV
Add the function to print the parameters and their current values.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1264 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABie726IZBzLvHlPZYOV4yfdYR34OxEEks5qLZe9gaJpZM4IyAuZ
.

khotilov · 2016-06-16T07:32:37Z

Below are my comments and questions

On CV predictions: I'll make it into a callback with an option to keep the models. My interest in CV models is usually for the debugging purposes, but some folks seem to utilize them even more broadly.
#iterations vs #trees: I see I was too brief in my initial statement on the subject. The idea is to use, say, a nrounds paramater in predict as a primary user-controlled knob for limiting the number of boosting iterations. The ntreelimit would stil need to be there for things like random forest, and when it is set it could take priority over the nrounds. With ntreelimit only, it's a minimal but perhaps too compressed/loaded interface, that requires a user to be quite aware about the internal structure of a model.
On unifying the style of parameter names: I'll start with the obvious outliers and then we can discuss how far we might want to go. Other than just the style, my practical issue is that due to many parameter names being different, it is currently awkward to share pieces of code between R and Py.
On the style of callback names: I am seeing my mix of a dot and underscores in there as more and more ugly... Should I simply use the dots separators as in all the other xgboost method names?
Limiting the parameter set in cb.reset_parameters: it might probably be not worth it for now.
Function to print the parameters and their current values: do you mean here that we need a C++ function that can pull in all the current model parameters and their default values, or did I misunderstand?

On the synchronous vs sequential CV... The current synchronous approach has a few downsides:

high memory use due to creation of data copies for each split, that have to be simultaneously kept in memory
it's somewhat hard to inject a data pre-processing step
the procedure design makes the estimation of the optimal number of iterations simple, but we cannot perform, e.g., early stopping within each fold.

And what I mean by the "classic CV-based model evaluation" is when models for each fold are built in a very similar fashion as a non-CV model would be built, preventing any possibilities for information leakage between the folds. I understand that pre-packaged CV methods are usually somewhat limited, and I am not a stranger to writing my own custom CV-based workflows. But my hope is that it should be possible to have a fairly generic but simple xgb.train CV-wrapper, that would help to limit the amount of custom boilerplate code and would not have the mentioned downsides.

hetong007 · 2016-06-17T07:09:25Z

Hi @khotilov , thank you for the explanations.

To reply to your comments:

On sharing code between R and python: in my understanding, sharing the code between these two languages is usually just to share the training parameters. Therefore we only need to make sure the parameter names are the same or similar enough.
I prefer to use more dots in the callback names, so that people knows they are functions.
Function to print the parameters: we could do it in C++. Another faster way is to do it in R since we know the default values for these parameters, so that we could update the values from the input. Doing it in R also means easier maintainance.

For the CV-related:

The first two points make sense to me. I just don't get the point to do "early stopping within each fold", because in this case the number of iterations are likely to be different among all the folds, thus it cannot guide the following training steps.
I think this part proposal is meaningful, but it just doesn't belong to this specific callback topic. Can you please open a new issue for this topic? In this way our discussion could be found and organized easier.

Please let me know your thoughts.

tqchen · 2016-06-22T20:35:54Z

any updates on this thread?

khotilov · 2016-06-23T06:26:14Z

Last weekend was too hectic for me to get it out, but the actions items from the list are mostly done (except the documentation). I'll get on it tomorrow evening.

And I'll open separate issues for CV and parameters later.

…proper CB order; docs; minor fixes + changes

khotilov · 2016-06-27T08:08:39Z

I've added some more tests and fixed some kinks;
moved cv-prediction to a CB, and changed CB names to be dots.style ;
changed several parameters to underscore_style and added a deprecation warning method;
added some more documentation and examples;
made some code pieces simpler;
consolidated all importFrom's in one place;
added a reshape parameter to predict.

hetong007 · 2016-06-28T03:29:31Z

Hi @khotilov , thanks for the work! I am currently attending the useR! 2016 conference and busying with my slides :P

Will review the code asap.

tqchen · 2016-07-03T02:13:08Z

@hetong007 Let us aim to merge this soon, now that useR has pasted

hetong007 · 2016-07-03T06:27:45Z

@khotilov It seems good to me now. Can you merge the latest version and ping me? Then I will merge it. Thank you!

khotilov · 2016-07-03T20:41:26Z

@hetong007 the merge has completed

hetong007 · 2016-07-03T20:43:42Z

Merged, thanks.

tqchen · 2016-07-03T20:49:45Z

Thanks for bring this in @khotilov . As next step, do you mind create a tutorial explaining how this feature can be used for various advanced tasks? We can put it onto the document, as well as the blogpost

khotilov added 12 commits May 31, 2016 08:20

fix multiple evaluation metrics

64b9dcf

create DefaultEvalMetric only when really necessary

22ad94d

py test for dmlc#1239

ba04a1d

make travis happy

26a8262

Merge remote-tracking branch 'upstream/master'

264c222

print method; construct from initial xgb.Booster

bdf1400

protection against returning 0-length vector

754f3a6

R-callbacks refactor

422b000

R-callbacks docs

2e0ffcc

R-callbacks tests + other tests brushup

f34f9fb

print.xgb.cv fix

79704cd

print.xgb.cv fix - Rd too

4e1269b

tqchen mentioned this pull request Jun 12, 2016

R: in xgb.train() verbose=0 doesn't suppress evaluation information #1064

Closed

khotilov added 7 commits June 27, 2016 01:49

[R] CB naming change; cv-prediction as CB; add.cb function to ensure …

76650c0

…proper CB order; docs; minor fixes + changes

[R] add parameter deprecation related utilities; code style

c342614

[R] in predict: doc, examples, reshape parameter

b9aeeda

[R] consolidate importFrom-s; parameter style

8473b18

[R] simplified the code; parameter style consistency

56bd442

[R] parameter style consistency

e9eb34f

[R] docs update - callbacks and parameter style

a0aa305

khotilov added 6 commits June 27, 2016 02:00

[R] adopt demos and vignettes to a more consistent parameter style

3b6b344

[R] additional and modified tests

fd4300b

[R] rm renamed CB's docs

e1a52e8

fix merge conflicts

4b2eedc

added name to DESCRIPTION

ae0ca48

[R] disable for now some of the RF tests that fail in travis

344d7b4

khotilov mentioned this pull request Jul 2, 2016

eta decay in R #1310

Closed

Merge branch 'master' into r_callbacks

4fb1b8a

hetong007 merged commit 44ed6d5 into dmlc:master Jul 3, 2016

tqchen added this to the v0.6.1 milestone Jul 6, 2016

tqchen modified the milestones: v0.6, v0.6.1 Jul 29, 2016

tqchen mentioned this pull request Jul 29, 2016

Tag version 0.6 #1422

Merged

lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] callbacks per #892 #1264

[R-package] callbacks per #892 #1264

khotilov commented Jun 9, 2016

tqchen commented Jun 10, 2016

hetong007 commented Jun 10, 2016

hetong007 commented Jun 10, 2016

tqchen commented Jun 12, 2016

hetong007 commented Jun 13, 2016 •

edited

Loading

khotilov commented Jun 13, 2016

khotilov commented Jun 16, 2016

hetong007 commented Jun 17, 2016

tqchen commented Jun 22, 2016

khotilov commented Jun 23, 2016

khotilov commented Jun 27, 2016

hetong007 commented Jun 28, 2016

tqchen commented Jul 3, 2016

hetong007 commented Jul 3, 2016

khotilov commented Jul 3, 2016

hetong007 commented Jul 3, 2016

tqchen commented Jul 3, 2016

[R-package] callbacks per #892 #1264

[R-package] callbacks per #892 #1264

Conversation

khotilov commented Jun 9, 2016

tqchen commented Jun 10, 2016

hetong007 commented Jun 10, 2016

hetong007 commented Jun 10, 2016

tqchen commented Jun 12, 2016

hetong007 commented Jun 13, 2016 • edited Loading

khotilov commented Jun 13, 2016

khotilov commented Jun 16, 2016

hetong007 commented Jun 17, 2016

tqchen commented Jun 22, 2016

khotilov commented Jun 23, 2016

khotilov commented Jun 27, 2016

hetong007 commented Jun 28, 2016

tqchen commented Jul 3, 2016

hetong007 commented Jul 3, 2016

khotilov commented Jul 3, 2016

hetong007 commented Jul 3, 2016

tqchen commented Jul 3, 2016

hetong007 commented Jun 13, 2016 •

edited

Loading