Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] documented crash in case categorical values is bigger max int32 #1376

Merged
merged 13 commits into from
May 21, 2018

Conversation

StrikerRUS
Copy link
Collaborator

@StrikerRUS StrikerRUS commented May 18, 2018

Closed #1359.

This PR doesn't fix the case when large categorical value comes from file (data is a string with path to the file). @guolinke is it possible to check on Python side? Also I don't know whether we need to check data for prediction task. I mean, if there are no such large values in train dataset, what is the chance to meet them in test dataset?

In additional, I'm in doubt about the necessity of these time-consuming checks...

@StrikerRUS StrikerRUS requested a review from guolinke May 18, 2018 11:46
@guolinke
Copy link
Collaborator

@StrikerRUS current checking is time-consuming ? as it will iterate all int values ?

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented May 19, 2018

@guolinke In my opinion, yes. On my ordinal home computer the results are:

import numpy as np
import scipy.sparse as sp

MAX_INT32 = (1 << 31) - 1

data = np.random.rand(100000, 100)*100
data[4][2] = 1 << 31

%%timeit
(data[:, [1,2,4]] > MAX_INT32).any()
100 loops, best of 3: 2.83 ms per loop
data = sp.random(100000, 100, density=0.4, format='csr')*100
data[4, 2] = 1 << 31

%%timeit
(data[:, [1,2,4]] > MAX_INT32).nnz
100 loops, best of 3: 16.8 ms per loop

I'll try to find more efficient solution.

Also, please share your thoughts about checking test data and data from file.

@Laurae2
Copy link
Contributor

Laurae2 commented May 19, 2018

@StrikerRUS In which case are we able to have a >int32 categorical? That's a huge cardinality (it should start from 0 anyway).

I could think about its usage when targeting users for ads but having 2+billion distinct users is a lot..

@StrikerRUS
Copy link
Collaborator Author

@Laurae2 The case was described here #1359. It was describing treating ids (which are random values >max_int32 and don't start from 0) as categories.

@Laurae2
Copy link
Contributor

Laurae2 commented May 19, 2018

@StrikerRUS This seems more a preprocessing issue than a LightGBM issue.

In R to avoid such issue we use a rule generator we can apply to new datasets: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.prepare_rules.R

It transforms the categorical features to numeric features starting from 0 (0 is NA, 1..cardinality are categorical values) so they can be used afterwards as categorical features in LightGBM and other libraries. This issue exists for many other machine learning libraries.

@StrikerRUS
Copy link
Collaborator Author

@Laurae2 The same things are done only for Pandas in Python-package at present. My opinion is that it's not LightGBM issue too, but at least we should document this.

@guolinke
Copy link
Collaborator

@StrikerRUS @Laurae2
I agreed. it is no reason to have such a great additional cost.

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented May 19, 2018

@guolinke I've just found a way to speed it up. But it's practically unnoticeable.

So, what's the conclusion? Reject all checks and enhance the documentation, right?

@guolinke
Copy link
Collaborator

@StrikerRUS
if the cost of checking is small, we can add it. otherwise, we enhance the documents.

@StrikerRUS
Copy link
Collaborator Author

@guolinke OK. Then I'll commit a little bit more efficient check and ask you to measure time cost on any real dataset with significant size (unfortunatelly, I can't do it right now because my SSD is completely full).

@StrikerRUS
Copy link
Collaborator Author

Done!
Please, compare time consuming of this branch and master on any real rather big dataset.

@guolinke
I think that enhancing documents is needed anyway, even if we'll perform checks, because we don't preprocess data loaded directly from file and any prediction data. I'll write notes soon.

@StrikerRUS
Copy link
Collaborator Author

No matter what will bring the results of time cost, notes about max values in categorical features have been added in docs and docstrings in last commit.

@StrikerRUS StrikerRUS changed the title [python] prevent crash in case categorical values is bigger max int32 [docs][python] prevent crash in case categorical values is bigger max int32 May 19, 2018
"""
Initialize data from a CSR matrix.
"""
if len(csr.indices) != len(csr.data):
raise ValueError('Length mismatch: {} vs {}'.format(len(csr.indices), len(csr.data)))
self.handle = ctypes.c_void_p()
if categorical_indices is not None and len(categorical_indices) != 0 and csr[:, list(categorical_indices)].max() > MAX_INT32:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this faster than csr.data.max() ?

@guolinke
Copy link
Collaborator

@StrikerRUS
Just test it.

case 1, without col indices:

import numpy as np
import scipy.sparse as sp
import time

MAX_INT32 = (1 << 31) - 1
loop = 200
data = np.random.rand(1000000, 100)*100

start = time.time()
for _ in range(loop):
    t = data.max()

end = time.time()
print((end - start)/loop)


data = sp.rand(100000,10000,0.01).tocsr()
start = time.time()
for _ in range(loop):
    t = data.max()

end = time.time()
print((end - start)/loop)


data = sp.rand(100000,10000,0.01).tocsc()
start = time.time()
for _ in range(loop):
    t = data.max()

end = time.time()
print((end - start)/loop)

all of them are fast, time cost is

0.0329028058052063
0.003496755361557007
0.00352044939994812

case 2, with col indices:

import numpy as np
import scipy.sparse as sp
import time

MAX_INT32 = (1 << 31) - 1
loop = 20
data = np.random.rand(1000000, 100)*100
catcols = list(range(0,10)) + list(range(50,60)) + list(range(80,100))

start = time.time()
for _ in range(loop):
    t = data[:,catcols].max()

end = time.time()
print((end - start)/loop)


data = sp.rand(100000,10000,0.01).tocsr()
start = time.time()
for _ in range(loop):
    t = data[:,catcols].max()

end = time.time()
print((end - start)/loop)


data = sp.rand(100000,10000,0.01).tocsc()
start = time.time()
for _ in range(loop):
    t = data[:,catcols].max()

end = time.time()
print((end - start)/loop)

it is about 10x slower for dense and csr:

0.3913775205612183
0.022434639930725097
0.002259492874145508

@StrikerRUS
Copy link
Collaborator Author

@guolinke Many thanks!

So what? Remove all checks and leave only notes in docs, right?

@guolinke
Copy link
Collaborator

yeah, sure

@StrikerRUS
Copy link
Collaborator Author

Done!

@StrikerRUS StrikerRUS changed the title [docs][python] prevent crash in case categorical values is bigger max int32 [docs] documented crash in case categorical values is bigger max int32 May 20, 2018
@StrikerRUS StrikerRUS merged commit a0c6941 into master May 21, 2018
@StrikerRUS StrikerRUS deleted the int32 branch May 21, 2018 22:54
@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Categorical features with large int cause segmentation fault
3 participants