Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] documented crash in case categorical values is bigger max int32 #1376

Merged
merged 13 commits into from
May 21, 2018
Merged
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,12 +105,11 @@ Microsoft Open Source Code of Conduct

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

Reference Paper
---------------
Reference Papers
----------------

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. "[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree)". In Advances in Neural Information Processing Systems (NIPS), pp. 3149-3157. 2017.

Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tieyan Liu. "[A Communication-Efficient Parallel Algorithm for Decision Tree](http://papers.nips.cc/paper/6380-a-communication-efficient-parallel-algorithm-for-decision-tree)". Advances in Neural Information Processing Systems 29 (NIPS 2016).

Huan Zhang, Si Si and Cho-Jui Hsieh. "[GPU Acceleration for Large-scale Tree Boosting](https://arxiv.org/abs/1706.08359)". arXiv:1706.08359, 2017.

6 changes: 3 additions & 3 deletions docs/Advanced-Topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ Missing Value Handle
Categorical Feature Support
---------------------------

- LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features.
Such an optimal split can provide the much better accuracy than one-hot coding solution.
- LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot encoding, LightGBM can find the optimal split of categorical features.
Such an optimal split can provide the much better accuracy than one-hot encoding solution.

- Use ``categorical_feature`` to specify the categorical features.
Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst>`__.

- Converting to ``int`` type is needed first, and there is support for non-negative numbers only.
- Converting to ``int`` type is needed first, and there is support for non-negative numbers only. Also, all values should be less than ``Int32.MaxValue`` (2147483647).
It is better to convert into continues ranges.

- Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting
Expand Down
8 changes: 8 additions & 0 deletions docs/FAQ.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,14 @@ LightGBM

--------------

- **Question 9**: When I'm trying to specify some column as categorical by using ``categorical_feature`` parameter, I get segmentation fault in LightGBM.

- **Solution 9**: Probably you're trying to pass via ``categorical_feature`` parameter a column with very large values. For instance, it can be some IDs.
In LightGBM categorical features are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features
(see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__.). You should convert them into integer range from zero to number of categories first.

--------------

R-package
~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion docs/Features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ So, LightGBM can use an additional parameter ``max_depth`` to limit depth of tre
Optimal Split for Categorical Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We often convert the categorical features into one-hot coding.
We often convert the categorical features into one-hot encoding.
However, it is not a good solution in tree learner.
The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.

Expand Down
2 changes: 2 additions & 0 deletions docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,8 @@ IO Parameters

- **Note**: only supports categorical with ``int`` type. Index starts from ``0``. And it doesn't count the label column

- **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)

- **Note**: the negative values will be treated as **missing values**

- ``predict_raw_score``, default=\ ``false``, type=bool, alias=\ ``raw_score``, ``is_predict_raw_score``
Expand Down
2 changes: 1 addition & 1 deletion docs/Quick-Start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Some columns could be ignored.
Categorical Feature Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~

LightGBM can use categorical features directly (without one-hot coding).
LightGBM can use categorical features directly (without one-hot encoding).
The experiment on `Expo data`_ shows about 8x speed-up compared with one-hot encoding.

For the setting details, please refer to `Parameters <./Parameters.rst>`__.
Expand Down
1 change: 1 addition & 0 deletions python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -603,6 +603,7 @@ def __init__(self, data, label=None, reference=None,
If list of int, interpreted as indices.
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values should be less than int32 max value (2147483647).
params: dict or None, optional (default=None)
Other parameters.
free_raw_data: bool, optional (default=True)
Expand Down
2 changes: 2 additions & 0 deletions python-package/lightgbm/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ def train(params, train_set, num_boost_round=100,
If list of int, interpreted as indices.
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values should be less than int32 max value (2147483647).
early_stopping_rounds: int or None, optional (default=None)
Activates early stopping. The model will train until the validation score stops improving.
Requires at least one validation data and one metric. If there's more than one, will check all of them.
Expand Down Expand Up @@ -354,6 +355,7 @@ def cv(params, train_set, num_boost_round=100,
If list of int, interpreted as indices.
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values should be less than int32 max value (2147483647).
early_stopping_rounds: int or None, optional (default=None)
Activates early stopping. CV error needs to decrease at least
every ``early_stopping_rounds`` round(s) to continue.
Expand Down
1 change: 1 addition & 0 deletions python-package/lightgbm/sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,7 @@ def fit(self, X, y,
If list of int, interpreted as indices.
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values should be less than int32 max value (2147483647).
callbacks : list of callback functions or None, optional (default=None)
List of callback functions that are applied at each iteration.
See Callbacks in Python API for more information.
Expand Down