[RFC] [dask] [docs] Override details in API docs that differ between dask and sklearn interfaces #3871

jameslamb · 2021-01-27T22:19:12Z

Summary

The module lightgbm.dask contains model objects which have very similar APIs to those from lightgbm.sklearn, and which directly inherit from them. For example, lightgbm.dask.DaskLGBMRegressor inherits from lightgbm.sklearn.DaskLGBMRegressor.

We currently take advantage of this inheritance to also inherit doc strings. For example:

LightGBM/python-package/lightgbm/dask.py

Lines 452 to 457 in b4b1b75

    
           _base_doc = LGBMClassifier.fit.__doc__ 
        
           _before_init_score, _init_score, _after_init_score = _base_doc.partition('init_score :') 
        
           fit.__doc__ = (_before_init_score 
        
                          + 'client : dask.distributed.Client or None, optional (default=None)\n' 
        
                          + ' ' * 12 + 'Dask client.\n' 
        
                          + ' ' * 8 + _init_score + _after_init_score)

That code reads the parent class's __doc__ and then patches it with some Dask-specific details.

There are still many places in the API docs which should be updated to show information specific to the Dask module. For example, the docs for the X, y, group, and sample_weight arguments to .fit() and .predict() should be updated to say that they expect Dask inputs (Dask Array, Dask DataFrame, Dask Series).

Motivation

This change would make the API docs a reliable source of information for how to structure inputs for the Dask module, which would make it easier for users to get started. This has value as a complement to a long-form tutorial (#3814).

Description

I can think of a few different ways to accomplish this.

Option 1: custom templates

We could write documentation templates that are shared between the classes. Here's an oversimplified example.

__fit_params_template = """
Build a gradient boosting model from the training set (X, y).

    Parameters
    ----------
    X : {x_type} of shape = [n_samples, n_features]
        Input feature matrix.
    y : {y_type} shape = [n_samples]
        The target values (class labels in classification, real numbers in regression).
    sample_weight : {weight_type} of shape = [n_samples] or None, optional (default=None)
        Weights of training data.
    init_score : {init_score_type} of shape = [n_samples] or None, optional (default=None)
        Init score of training data.
"""

__DaskCollectionDescription = "A Dask DataFrame, Dask Array, or Dask Series"
__LocalArrayDescription = "A pandas dataframe, numpy array, or sparse matrix"

class LGBMClassifier:
    def fit(...):
        pass

    fit.__doc__ = __fit_params_template.format(
        x_type=__LocalArrayDescription,
        y_type=__LocalArrayDescription,
        weight_type=__LocalArrayDescription,
        init_score_type=__LocalArrayDescription,
    )

class DaskLGBMClassifier:
    def fit(...):
        pass

    fit.__doc__ = __fit_params_template.format(
        x_type=__DaskCollectionDescription,
        y_type=__DaskCollectionDescription,
        weight_type=__DaskCollectionDescription,
        init_score_type=__DaskCollectionDescription,
    )

Something similar is done in the R package to avoid writing the same documentation in multiple places in source control, even though it is used in multiple places in the

LightGBM/R-package/R/lightgbm.R

Line 1 in b4b1b75

#' @name lgb_shared_params
LightGBM/R-package/R/lgb.train.R

Line 4 in b4b1b75

#' @inheritParams lgb_shared_params

Option 2: use `sphinx-autodoc-typehints`

If the only difference we anticipate in the API docs is about the types of inputs, we could stop relying on typehints in docstrings, and instead use sphinx-autodoc-typehints. This extension creates the typehints in Sphinx docs automatically based on the hints in the code, without you needing to write them in docstrings. I think that might allow us to just inherit docstrings and get the hinting stuff for free.

Option 3: just copy the docstrings and manually keep them in sync

We could stop having the Dask interface inherit the docstrings from lightgbm.sklearn, and instead just literally copy the sklearn docs into the Dask module. Then the type hints and any other information could be changed in the Dask module freely and easily.

References

DaskLGBMClassifier docs, for reference: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.DaskLGBMClassifier.html#lightgbm.DaskLGBMClassifier .
this issue was inspired by this comment: [dask] Add type hints in Dask package #3866 (comment)

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-01-27T22:20:45Z

For any of you reading this, let me know if you have strong opinions or can think of other options!

I personally prefer Option 1 or Option 3. Options that let us keep the same documentation style that's already in use seem the least disruptive. I don't like Option 2 because I expect that the differences between Dask and sklearn docs will be more than just type hint differences.

StrikerRUS · 2021-01-27T22:42:24Z

I "like" how official Dask documentation deals with the problem! 😃

I think Option #1 is preferable for me. But how will it help to deal with different args? For example, client argument in Dask API.

jameslamb · 2021-01-27T22:53:35Z

I understand why Dask does that...they are wrapping external libraries that can change without their control, and where people can have different combinations of versions installed (e.g. different versions of numpy and dask at any one time).

Since we're talking about only code inside lightgbm, I don't think we should settle for that. It's within our control to make the docs consistent and correct.

But how will it help to deal with different args?

Well client is the only extra arg we have right now, and I don't anticipate there being so many others that we need a very scalable solution to handling the set of keyword arguments being different. I don't think it would be too bad for the template to have a {client_doc} which is just set to an empty string in fit.__doc__.format() on lightgbm.sklearn.LGBMClassifier.fit() and other sklearn methods.

jameslamb added question feature request doc dask labels Jan 27, 2021

jameslamb mentioned this issue Jan 27, 2021

[dask] Add type hints in Dask package #3866

Merged

jameslamb mentioned this issue Jan 28, 2021

v3.2.0 release #3872

Merged

jameslamb mentioned this issue Feb 9, 2021

[dask] [docs] Fix inaccuracies in API docs for Dask module (fixes #3871) #3930

Merged

jameslamb closed this as completed in 06ed433 Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] [dask] [docs] Override details in API docs that differ between dask and sklearn interfaces #3871

[RFC] [dask] [docs] Override details in API docs that differ between dask and sklearn interfaces #3871

jameslamb commented Jan 27, 2021

jameslamb commented Jan 27, 2021

StrikerRUS commented Jan 27, 2021

jameslamb commented Jan 27, 2021

[RFC] [dask] [docs] Override details in API docs that differ between dask and sklearn interfaces #3871

[RFC] [dask] [docs] Override details in API docs that differ between dask and sklearn interfaces #3871

Comments

jameslamb commented Jan 27, 2021

Summary

Motivation

Description

Option 1: custom templates

Option 2: use sphinx-autodoc-typehints

Option 3: just copy the docstrings and manually keep them in sync

References

jameslamb commented Jan 27, 2021

StrikerRUS commented Jan 27, 2021

jameslamb commented Jan 27, 2021

Option 2: use `sphinx-autodoc-typehints`