Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical features with large int cause segmentation fault #1359

Closed
qmick opened this issue May 5, 2018 · 2 comments · Fixed by #1376
Closed

Categorical features with large int cause segmentation fault #1359

qmick opened this issue May 5, 2018 · 2 comments · Fixed by #1376
Assignees

Comments

@qmick
Copy link

qmick commented May 5, 2018

Environment info

Operating System: Ubuntu server 16.04 64bit
CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz * 2
C++/Python/R version: Python 3.5.2

Error Message:

Python output:

/home/zhang/.local/lib/python3.5/site-packages/lightgbm/basic.py:1038: UserWarning: categorical_feature in Dataset is overridden. New categorical_feature is ['item_id', 'user_id']
warnings.warn('categorical_feature in Dataset is overridden. New categorical_feature is {}'.format(sorted(list(categorical_feature))))
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[1] 69368 segmentation fault python3 train.py

GDB output:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python3 train.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 LightGBM::BinMapper::FindBin (this=, values=, num_sample_values=, total_sample_cnt=3, max_bin=255, min_data_in_bin=3, min_split_data=20,
bin_type=LightGBM::CategoricalBin, use_missing=true, zero_as_missing=false) at /home/zhang/lightgbm/LightGBM/src/io/bin.cpp:322
322 if (distinct_values_int[0] == 0) {
[Current thread is 1 (Thread 0x7f22a3957700 (LWP 44144))]

Reproducible examples

import lightgbm as lgb
import pandas as pd

data = {'user_id':[4505772604969228686, 2692638157208937547, 5247924392014515924],
       'item_id': [3412720377098676069, 3412720377098676069, 3412720377098676069]}
df = pd.DataFrame(data=data)

lgb_train = lgb.Dataset(df, label=[0, 1, 1])
params = {
    'objective': 'binary',
    'metric': 'binary_logloss'
}

gbm = lgb.train(params, lgb_train, categorical_feature=['user_id', 'item_id'])

Steps to reproduce

  1. Run example above

Possible reason

Seems like it's caused by Python int to C++ int conversion error. Large Python int become negative in C++ side. If all values within a DataFrame column are too large, which is common in ID features, these values will be treated as missing values. Then vector distinct_values_int will be empty and distinct_values_int[0] will cause access violation.

Use sklearn.preprocessing,LabelEncoder can solve this problem. But I think this should be fixed or at least throw Python error message instead of segmentation fault since it will cause Python notebook kernel death.

@guolinke
Copy link
Collaborator

guolinke commented May 5, 2018

@StrikerRUS I think we can check this in python side.

@qmick For the categorical feature, use the continued integer from zero is the most efficient way for LightGBM. And we only support 32-bit int in cpp side. When its range exceed 32-bit, using categorical feature is very slow (so as other solutions).

@StrikerRUS
Copy link
Collaborator

@guolinke I'll try, but not promise to do it fast.

@StrikerRUS StrikerRUS self-assigned this May 6, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants