Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [dask] [docs] Override details in API docs that differ between dask and sklearn interfaces #3871

Closed
jameslamb opened this issue Jan 27, 2021 · 3 comments

Comments

@jameslamb
Copy link
Collaborator

Summary

The module lightgbm.dask contains model objects which have very similar APIs to those from lightgbm.sklearn, and which directly inherit from them. For example, lightgbm.dask.DaskLGBMRegressor inherits from lightgbm.sklearn.DaskLGBMRegressor.

We currently take advantage of this inheritance to also inherit doc strings. For example:

_base_doc = LGBMClassifier.fit.__doc__
_before_init_score, _init_score, _after_init_score = _base_doc.partition('init_score :')
fit.__doc__ = (_before_init_score
+ 'client : dask.distributed.Client or None, optional (default=None)\n'
+ ' ' * 12 + 'Dask client.\n'
+ ' ' * 8 + _init_score + _after_init_score)

That code reads the parent class's __doc__ and then patches it with some Dask-specific details.

There are still many places in the API docs which should be updated to show information specific to the Dask module. For example, the docs for the X, y, group, and sample_weight arguments to .fit() and .predict() should be updated to say that they expect Dask inputs (Dask Array, Dask DataFrame, Dask Series).

image

Motivation

This change would make the API docs a reliable source of information for how to structure inputs for the Dask module, which would make it easier for users to get started. This has value as a complement to a long-form tutorial (#3814).

Description

I can think of a few different ways to accomplish this.

Option 1: custom templates

We could write documentation templates that are shared between the classes. Here's an oversimplified example.

__fit_params_template = """
Build a gradient boosting model from the training set (X, y).

    Parameters
    ----------
    X : {x_type} of shape = [n_samples, n_features]
        Input feature matrix.
    y : {y_type} shape = [n_samples]
        The target values (class labels in classification, real numbers in regression).
    sample_weight : {weight_type} of shape = [n_samples] or None, optional (default=None)
        Weights of training data.
    init_score : {init_score_type} of shape = [n_samples] or None, optional (default=None)
        Init score of training data.
"""

__DaskCollectionDescription = "A Dask DataFrame, Dask Array, or Dask Series"
__LocalArrayDescription = "A pandas dataframe, numpy array, or sparse matrix"

class LGBMClassifier:
    def fit(...):
        pass

    fit.__doc__ = __fit_params_template.format(
        x_type=__LocalArrayDescription,
        y_type=__LocalArrayDescription,
        weight_type=__LocalArrayDescription,
        init_score_type=__LocalArrayDescription,
    )

class DaskLGBMClassifier:
    def fit(...):
        pass

    fit.__doc__ = __fit_params_template.format(
        x_type=__DaskCollectionDescription,
        y_type=__DaskCollectionDescription,
        weight_type=__DaskCollectionDescription,
        init_score_type=__DaskCollectionDescription,
    )

Something similar is done in the R package to avoid writing the same documentation in multiple places in source control, even though it is used in multiple places in the

Option 2: use sphinx-autodoc-typehints

If the only difference we anticipate in the API docs is about the types of inputs, we could stop relying on typehints in docstrings, and instead use sphinx-autodoc-typehints. This extension creates the typehints in Sphinx docs automatically based on the hints in the code, without you needing to write them in docstrings. I think that might allow us to just inherit docstrings and get the hinting stuff for free.

Option 3: just copy the docstrings and manually keep them in sync

We could stop having the Dask interface inherit the docstrings from lightgbm.sklearn, and instead just literally copy the sklearn docs into the Dask module. Then the type hints and any other information could be changed in the Dask module freely and easily.

References

@jameslamb
Copy link
Collaborator Author

For any of you reading this, let me know if you have strong opinions or can think of other options!

I personally prefer Option 1 or Option 3. Options that let us keep the same documentation style that's already in use seem the least disruptive. I don't like Option 2 because I expect that the differences between Dask and sklearn docs will be more than just type hint differences.

@StrikerRUS
Copy link
Collaborator

I "like" how official Dask documentation deals with the problem! 😃

image

I think Option #1 is preferable for me. But how will it help to deal with different args? For example, client argument in Dask API.

@jameslamb
Copy link
Collaborator Author

I understand why Dask does that...they are wrapping external libraries that can change without their control, and where people can have different combinations of versions installed (e.g. different versions of numpy and dask at any one time).

Since we're talking about only code inside lightgbm, I don't think we should settle for that. It's within our control to make the docs consistent and correct.

But how will it help to deal with different args?

Well client is the only extra arg we have right now, and I don't anticipate there being so many others that we need a very scalable solution to handling the set of keyword arguments being different. I don't think it would be too bad for the template to have a {client_doc} which is just set to an empty string in fit.__doc__.format() on lightgbm.sklearn.LGBMClassifier.fit() and other sklearn methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants