microsoft · StrikerRUS · May 21, 2018 · May 17, 2018 · May 17, 2018 · May 18, 2018
@@ -105,12 +105,11 @@ Microsoft Open Source Code of Conduct
 
 This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
 
-Reference Paper
----------------
+Reference Papers
+----------------
 
 Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. "[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree)". In Advances in Neural Information Processing Systems (NIPS), pp. 3149-3157. 2017.
 
 Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tieyan Liu. "[A Communication-Efficient Parallel Algorithm for Decision Tree](http://papers.nips.cc/paper/6380-a-communication-efficient-parallel-algorithm-for-decision-tree)". Advances in Neural Information Processing Systems 29 (NIPS 2016).
 
 Huan Zhang, Si Si and Cho-Jui Hsieh. "[GPU Acceleration for Large-scale Tree Boosting](https://arxiv.org/abs/1706.08359)". arXiv:1706.08359, 2017.
-
@@ -15,13 +15,13 @@ Missing Value Handle
 Categorical Feature Support
 ---------------------------
 
--  LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features.
-   Such an optimal split can provide the much better accuracy than one-hot coding solution.
+-  LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot encoding, LightGBM can find the optimal split of categorical features.
+   Such an optimal split can provide the much better accuracy than one-hot encoding solution.
 
 -  Use ``categorical_feature`` to specify the categorical features.
    Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst>`__.
 
--  Converting to ``int`` type is needed first, and there is support for non-negative numbers only.
+-  Converting to ``int`` type is needed first, and there is support for non-negative numbers only. Also, all values should be less than ``Int32.MaxValue`` (2147483647).
    It is better to convert into continues ranges.
 
 -  Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting

@@ -107,6 +107,14 @@ LightGBM
 
 --------------
 
+-  **Question 9**: When I'm trying to specify some column as categorical by using ``categorical_feature`` parameter, I get segmentation fault in LightGBM.
+
+-  **Solution 9**: Probably you're trying to pass via ``categorical_feature`` parameter a column with very large values. For instance, it can be some IDs.
+   In LightGBM categorical features are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features
+   (see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__.). You should convert them into integer range from zero to number of categories first.
+
+--------------
+
 R-package
 ~~~~~~~~~
 

@@ -63,7 +63,7 @@ So, LightGBM can use an additional parameter ``max_depth`` to limit depth of tre
 Optimal Split for Categorical Features
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-We often convert the categorical features into one-hot coding.
+We often convert the categorical features into one-hot encoding.
 However, it is not a good solution in tree learner.
 The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.
 

@@ -441,6 +441,8 @@ IO Parameters
 
    -  **Note**: only supports categorical with ``int`` type. Index starts from ``0``. And it doesn't count the label column
 
+   -  **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
+
    -  **Note**: the negative values will be treated as **missing values**
 
 -  ``predict_raw_score``, default=\ ``false``, type=bool, alias=\ ``raw_score``, ``is_predict_raw_score``

@@ -29,7 +29,7 @@ Some columns could be ignored.
 Categorical Feature Support
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-LightGBM can use categorical features directly (without one-hot coding).
+LightGBM can use categorical features directly (without one-hot encoding).
 The experiment on `Expo data`_ shows about 8x speed-up compared with one-hot encoding.
 
 For the setting details, please refer to `Parameters <./Parameters.rst>`__.

@@ -603,6 +603,7 @@ def __init__(self, data, label=None, reference=None,
             If list of int, interpreted as indices.
             If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
             If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
+            All values should be less than int32 max value (2147483647).
         params: dict or None, optional (default=None)
             Other parameters.
         free_raw_data: bool, optional (default=True)

@@ -53,6 +53,7 @@ def train(params, train_set, num_boost_round=100,
         If list of int, interpreted as indices.
         If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
         If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
+        All values should be less than int32 max value (2147483647).
     early_stopping_rounds: int or None, optional (default=None)
         Activates early stopping. The model will train until the validation score stops improving.
         Requires at least one validation data and one metric. If there's more than one, will check all of them.
@@ -354,6 +355,7 @@ def cv(params, train_set, num_boost_round=100,
         If list of int, interpreted as indices.
         If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
         If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
+        All values should be less than int32 max value (2147483647).
     early_stopping_rounds: int or None, optional (default=None)
         Activates early stopping. CV error needs to decrease at least
         every ``early_stopping_rounds`` round(s) to continue.

@@ -341,6 +341,7 @@ def fit(self, X, y,
             If list of int, interpreted as indices.
             If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
             If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
+            All values should be less than int32 max value (2147483647).
         callbacks : list of callback functions or None, optional (default=None)
             List of callback functions that are applied at each iteration.
             See Callbacks in Python API for more information.