You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The module lightgbm.dask contains model objects which have very similar APIs to those from lightgbm.sklearn, and which directly inherit from them. For example, lightgbm.dask.DaskLGBMRegressor inherits from lightgbm.sklearn.DaskLGBMRegressor.
We currently take advantage of this inheritance to also inherit doc strings. For example:
+'client : dask.distributed.Client or None, optional (default=None)\n'
+' '*12+'Dask client.\n'
+' '*8+_init_score+_after_init_score)
That code reads the parent class's __doc__ and then patches it with some Dask-specific details.
There are still many places in the API docs which should be updated to show information specific to the Dask module. For example, the docs for the X, y, group, and sample_weight arguments to .fit() and .predict() should be updated to say that they expect Dask inputs (Dask Array, Dask DataFrame, Dask Series).
Motivation
This change would make the API docs a reliable source of information for how to structure inputs for the Dask module, which would make it easier for users to get started. This has value as a complement to a long-form tutorial (#3814).
Description
I can think of a few different ways to accomplish this.
Option 1: custom templates
We could write documentation templates that are shared between the classes. Here's an oversimplified example.
__fit_params_template="""Build a gradient boosting model from the training set (X, y). Parameters ---------- X : {x_type} of shape = [n_samples, n_features] Input feature matrix. y : {y_type} shape = [n_samples] The target values (class labels in classification, real numbers in regression). sample_weight : {weight_type} of shape = [n_samples] or None, optional (default=None) Weights of training data. init_score : {init_score_type} of shape = [n_samples] or None, optional (default=None) Init score of training data."""__DaskCollectionDescription="A Dask DataFrame, Dask Array, or Dask Series"__LocalArrayDescription="A pandas dataframe, numpy array, or sparse matrix"classLGBMClassifier:
deffit(...):
passfit.__doc__=__fit_params_template.format(
x_type=__LocalArrayDescription,
y_type=__LocalArrayDescription,
weight_type=__LocalArrayDescription,
init_score_type=__LocalArrayDescription,
)
classDaskLGBMClassifier:
deffit(...):
passfit.__doc__=__fit_params_template.format(
x_type=__DaskCollectionDescription,
y_type=__DaskCollectionDescription,
weight_type=__DaskCollectionDescription,
init_score_type=__DaskCollectionDescription,
)
Something similar is done in the R package to avoid writing the same documentation in multiple places in source control, even though it is used in multiple places in the
If the only difference we anticipate in the API docs is about the types of inputs, we could stop relying on typehints in docstrings, and instead use sphinx-autodoc-typehints. This extension creates the typehints in Sphinx docs automatically based on the hints in the code, without you needing to write them in docstrings. I think that might allow us to just inherit docstrings and get the hinting stuff for free.
Option 3: just copy the docstrings and manually keep them in sync
We could stop having the Dask interface inherit the docstrings from lightgbm.sklearn, and instead just literally copy the sklearn docs into the Dask module. Then the type hints and any other information could be changed in the Dask module freely and easily.
For any of you reading this, let me know if you have strong opinions or can think of other options!
I personally prefer Option 1 or Option 3. Options that let us keep the same documentation style that's already in use seem the least disruptive. I don't like Option 2 because I expect that the differences between Dask and sklearn docs will be more than just type hint differences.
I understand why Dask does that...they are wrapping external libraries that can change without their control, and where people can have different combinations of versions installed (e.g. different versions of numpy and dask at any one time).
Since we're talking about only code inside lightgbm, I don't think we should settle for that. It's within our control to make the docs consistent and correct.
But how will it help to deal with different args?
Well client is the only extra arg we have right now, and I don't anticipate there being so many others that we need a very scalable solution to handling the set of keyword arguments being different. I don't think it would be too bad for the template to have a {client_doc} which is just set to an empty string in fit.__doc__.format() on lightgbm.sklearn.LGBMClassifier.fit() and other sklearn methods.
Summary
The module
lightgbm.dask
contains model objects which have very similar APIs to those fromlightgbm.sklearn
, and which directly inherit from them. For example,lightgbm.dask.DaskLGBMRegressor
inherits fromlightgbm.sklearn.DaskLGBMRegressor
.We currently take advantage of this inheritance to also inherit doc strings. For example:
LightGBM/python-package/lightgbm/dask.py
Lines 452 to 457 in b4b1b75
That code reads the parent class's
__doc__
and then patches it with some Dask-specific details.There are still many places in the API docs which should be updated to show information specific to the Dask module. For example, the docs for the
X
,y
,group
, andsample_weight
arguments to.fit()
and.predict()
should be updated to say that they expect Dask inputs (Dask Array, Dask DataFrame, Dask Series).Motivation
This change would make the API docs a reliable source of information for how to structure inputs for the Dask module, which would make it easier for users to get started. This has value as a complement to a long-form tutorial (#3814).
Description
I can think of a few different ways to accomplish this.
Option 1: custom templates
We could write documentation templates that are shared between the classes. Here's an oversimplified example.
Something similar is done in the R package to avoid writing the same documentation in multiple places in source control, even though it is used in multiple places in the
LightGBM/R-package/R/lightgbm.R
Line 1 in b4b1b75
LightGBM/R-package/R/lgb.train.R
Line 4 in b4b1b75
Option 2: use
sphinx-autodoc-typehints
If the only difference we anticipate in the API docs is about the types of inputs, we could stop relying on typehints in docstrings, and instead use sphinx-autodoc-typehints. This extension creates the typehints in Sphinx docs automatically based on the hints in the code, without you needing to write them in docstrings. I think that might allow us to just inherit docstrings and get the hinting stuff for free.
Option 3: just copy the docstrings and manually keep them in sync
We could stop having the Dask interface inherit the docstrings from
lightgbm.sklearn
, and instead just literally copy the sklearn docs into the Dask module. Then the type hints and any other information could be changed in the Dask module freely and easily.References
The text was updated successfully, but these errors were encountered: