[docs] documented crash in case categorical values is bigger max int32 #1376

StrikerRUS · 2018-05-18T11:46:56Z

Closed #1359.

This PR doesn't fix the case when large categorical value comes from file (data is a string with path to the file). @guolinke is it possible to check on Python side? Also I don't know whether we need to check data for prediction task. I mean, if there are no such large values in train dataset, what is the chance to meet them in test dataset?

In additional, I'm in doubt about the necessity of these time-consuming checks...

This reverts commit 289a426.

guolinke · 2018-05-19T05:03:13Z

@StrikerRUS current checking is time-consuming ? as it will iterate all int values ?

StrikerRUS · 2018-05-19T11:39:40Z

@guolinke In my opinion, yes. On my ordinal home computer the results are:

import numpy as np
import scipy.sparse as sp

MAX_INT32 = (1 << 31) - 1

data = np.random.rand(100000, 100)*100
data[4][2] = 1 << 31

%%timeit
(data[:, [1,2,4]] > MAX_INT32).any()

100 loops, best of 3: 2.83 ms per loop

data = sp.random(100000, 100, density=0.4, format='csr')*100
data[4, 2] = 1 << 31

%%timeit
(data[:, [1,2,4]] > MAX_INT32).nnz

100 loops, best of 3: 16.8 ms per loop

I'll try to find more efficient solution.

Also, please share your thoughts about checking test data and data from file.

Laurae2 · 2018-05-19T11:48:16Z

@StrikerRUS In which case are we able to have a >int32 categorical? That's a huge cardinality (it should start from 0 anyway).

I could think about its usage when targeting users for ads but having 2+billion distinct users is a lot..

StrikerRUS · 2018-05-19T12:15:47Z

@Laurae2 The case was described here #1359. It was describing treating ids (which are random values >max_int32 and don't start from 0) as categories.

Laurae2 · 2018-05-19T12:26:57Z

@StrikerRUS This seems more a preprocessing issue than a LightGBM issue.

In R to avoid such issue we use a rule generator we can apply to new datasets: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.prepare_rules.R

It transforms the categorical features to numeric features starting from 0 (0 is NA, 1..cardinality are categorical values) so they can be used afterwards as categorical features in LightGBM and other libraries. This issue exists for many other machine learning libraries.

StrikerRUS · 2018-05-19T12:40:59Z

@Laurae2 The same things are done only for Pandas in Python-package at present. My opinion is that it's not LightGBM issue too, but at least we should document this.

guolinke · 2018-05-19T13:07:29Z

@StrikerRUS @Laurae2
I agreed. it is no reason to have such a great additional cost.

StrikerRUS · 2018-05-19T13:14:49Z

@guolinke I've just found a way to speed it up. But it's practically unnoticeable.

So, what's the conclusion? Reject all checks and enhance the documentation, right?

guolinke · 2018-05-19T13:23:47Z

@StrikerRUS
if the cost of checking is small, we can add it. otherwise, we enhance the documents.

StrikerRUS · 2018-05-19T13:28:03Z

@guolinke OK. Then I'll commit a little bit more efficient check and ask you to measure time cost on any real dataset with significant size (unfortunatelly, I can't do it right now because my SSD is completely full).

StrikerRUS · 2018-05-19T14:19:03Z

Done!
Please, compare time consuming of this branch and master on any real rather big dataset.

@guolinke
I think that enhancing documents is needed anyway, even if we'll perform checks, because we don't preprocess data loaded directly from file and any prediction data. I'll write notes soon.

StrikerRUS · 2018-05-19T17:02:01Z

No matter what will bring the results of time cost, notes about max values in categorical features have been added in docs and docstrings in last commit.

guolinke · 2018-05-20T01:45:00Z

python-package/lightgbm/basic.py

        """
        Initialize data from a CSR matrix.
        """
        if len(csr.indices) != len(csr.data):
            raise ValueError('Length mismatch: {} vs {}'.format(len(csr.indices), len(csr.data)))
-        self.handle = ctypes.c_void_p()
+        if categorical_indices is not None and len(categorical_indices) != 0 and csr[:, list(categorical_indices)].max() > MAX_INT32:


will this faster than csr.data.max() ?

guolinke · 2018-05-20T02:10:58Z

@StrikerRUS
Just test it.

case 1, without col indices:

import numpy as np
import scipy.sparse as sp
import time

MAX_INT32 = (1 << 31) - 1
loop = 200
data = np.random.rand(1000000, 100)*100

start = time.time()
for _ in range(loop):
    t = data.max()

end = time.time()
print((end - start)/loop)


data = sp.rand(100000,10000,0.01).tocsr()
start = time.time()
for _ in range(loop):
    t = data.max()

end = time.time()
print((end - start)/loop)


data = sp.rand(100000,10000,0.01).tocsc()
start = time.time()
for _ in range(loop):
    t = data.max()

end = time.time()
print((end - start)/loop)

all of them are fast, time cost is

0.0329028058052063
0.003496755361557007
0.00352044939994812

case 2, with col indices:

import numpy as np
import scipy.sparse as sp
import time

MAX_INT32 = (1 << 31) - 1
loop = 20
data = np.random.rand(1000000, 100)*100
catcols = list(range(0,10)) + list(range(50,60)) + list(range(80,100))

start = time.time()
for _ in range(loop):
    t = data[:,catcols].max()

end = time.time()
print((end - start)/loop)


data = sp.rand(100000,10000,0.01).tocsr()
start = time.time()
for _ in range(loop):
    t = data[:,catcols].max()

end = time.time()
print((end - start)/loop)


data = sp.rand(100000,10000,0.01).tocsc()
start = time.time()
for _ in range(loop):
    t = data[:,catcols].max()

end = time.time()
print((end - start)/loop)

it is about 10x slower for dense and csr:

0.3913775205612183
0.022434639930725097
0.002259492874145508

StrikerRUS · 2018-05-20T02:48:07Z

@guolinke Many thanks!

So what? Remove all checks and leave only notes in docs, right?

guolinke · 2018-05-20T05:35:28Z

yeah, sure

This reverts commit bed8883.

This reverts commit a229e15.

This reverts commit 299e001.

This reverts commit 2cc7afa.

StrikerRUS · 2018-05-20T10:13:05Z

Done!

StrikerRUS added 6 commits May 18, 2018 02:06

added checks for categorical features > max_int32

2cc7afa

added tests

299e001

fixed pylint

a229e15

removed warnings about overridden categorical features

289a426

Revert "removed warnings about overridden categorical features"

cd1127d

This reverts commit 289a426.

Merge branch 'master' into int32

386e122

StrikerRUS requested a review from guolinke May 18, 2018 11:46

a little bit more efficient checks

bed8883

StrikerRUS force-pushed the int32 branch from 36c7dde to bed8883 Compare May 19, 2018 13:56

StrikerRUS added 2 commits May 19, 2018 18:51

Merge branch 'master' into int32

fa86c71

added notes about max values in categorical features

391d483

StrikerRUS changed the title ~~[python] prevent crash in case categorical values is bigger max int32~~ [docs][python] prevent crash in case categorical values is bigger max int32 May 19, 2018

guolinke reviewed May 20, 2018

View reviewed changes

StrikerRUS added 2 commits May 20, 2018 12:56

Revert "a little bit more efficient checks"

942896a

This reverts commit bed8883.

Revert "fixed pylint"

f2ba45d

This reverts commit a229e15.

StrikerRUS added 2 commits May 20, 2018 12:57

Revert "added tests"

5d83bba

This reverts commit 299e001.

Revert "added checks for categorical features > max_int32"

c5014ba

This reverts commit 2cc7afa.

StrikerRUS changed the title ~~[docs][python] prevent crash in case categorical values is bigger max int32~~ [docs] documented crash in case categorical values is bigger max int32 May 20, 2018

guolinke approved these changes May 20, 2018

View reviewed changes

StrikerRUS merged commit a0c6941 into master May 21, 2018

StrikerRUS deleted the int32 branch May 21, 2018 22:54

lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] documented crash in case categorical values is bigger max int32 #1376

[docs] documented crash in case categorical values is bigger max int32 #1376

StrikerRUS commented May 18, 2018 •

edited

Loading

guolinke commented May 19, 2018

StrikerRUS commented May 19, 2018 •

edited

Loading

Laurae2 commented May 19, 2018

StrikerRUS commented May 19, 2018

Laurae2 commented May 19, 2018

StrikerRUS commented May 19, 2018

guolinke commented May 19, 2018

StrikerRUS commented May 19, 2018 •

edited

Loading

guolinke commented May 19, 2018

StrikerRUS commented May 19, 2018

StrikerRUS commented May 19, 2018

StrikerRUS commented May 19, 2018

guolinke May 20, 2018

guolinke commented May 20, 2018

StrikerRUS commented May 20, 2018

guolinke commented May 20, 2018

StrikerRUS commented May 20, 2018

[docs] documented crash in case categorical values is bigger max int32 #1376

[docs] documented crash in case categorical values is bigger max int32 #1376

Conversation

StrikerRUS commented May 18, 2018 • edited Loading

guolinke commented May 19, 2018

StrikerRUS commented May 19, 2018 • edited Loading

Laurae2 commented May 19, 2018

StrikerRUS commented May 19, 2018

Laurae2 commented May 19, 2018

StrikerRUS commented May 19, 2018

guolinke commented May 19, 2018

StrikerRUS commented May 19, 2018 • edited Loading

guolinke commented May 19, 2018

StrikerRUS commented May 19, 2018

StrikerRUS commented May 19, 2018

StrikerRUS commented May 19, 2018

guolinke May 20, 2018

Choose a reason for hiding this comment

guolinke commented May 20, 2018

StrikerRUS commented May 20, 2018

guolinke commented May 20, 2018

StrikerRUS commented May 20, 2018

StrikerRUS commented May 18, 2018 •

edited

Loading

StrikerRUS commented May 19, 2018 •

edited

Loading

StrikerRUS commented May 19, 2018 •

edited

Loading