Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fixed] KeyError due to more extreme values in the test set (某些情况下测试集特征数值比训练集更大会导致KeyError) #1

Closed
songshijun007 opened this issue Feb 14, 2020 · 2 comments

Comments

@songshijun007
Copy link

hello,作者好!
我遇到一种情况是训练集上得到的特征离散化区间,在测试集上找不到对应的值,出现keyerror:450~inf错误。追查到错误后,我在LogisticRegressionScoreCard.py源代码上做了简单粗暴的修改:
def map_np(array, dictionary):
"""map function for numpy array
Parameters
----------
array: numpy.array, shape (number of examples,)
The array of data to map values to.

distionary: dict
        The distionary object.

Return
----------
result: numpy.array, shape (number of examples,)
        The mapped result.         
"""
result = []
for e in array:
    try:
        result.append(dictionary[e])
    except:
        result.append(0)
# return [dictionary[e] for e in array]
return result

测试可以跑通,奇怪的是看一个变量的离散化区间范围正无穷到负无穷,不会出现区间值找不到的情况,是不是训练集上该区间没有得到这个值?

最后,感谢写出这个实用的库!

@Lantianzz
Copy link
Owner

Lantianzz commented Apr 5, 2020

你好,非常非常抱歉,快两个月了才回复你,年后这段时间没有经常看这个邮箱,你的邮件就没淹没了。。。今天才偶然发现,不知道问题是否已经解决?非常感谢你使用我写的东西,没有及时回复,希望没有给你造成很大的不方便。。。

1.“keyerror:450~inf”

我看了一下相关代码,“keyerror:450~inf”是_applyScoreCard这个函数的scores = map_np(intervals, score_dict)这一行报错?仔细检查了一下,发现这是由于在少数情况下,用ChiMerge进行特征离散化的时候输出的不是负无穷到正无穷的连续区间。这处bug在最新版本中已被解决,可pip install scorecardbundle -U升级到最新版本。

这个问题的详细原因是,当某特征在训练集的唯一值数量较少(例如全都是0)时,用于分割区间的boundaries的最大值会等于特征的最大值,原先的代码虽然故意增加了inf,但由于本库通篇采用左开右闭区间,导致特征最大值分配到的区间是xxxmax,而不是xxxinf,如果测试集出现了更大的数字,就会导致WOE、评分卡等基于此区间的后续步骤报错,因此调整ChiMerge是解决此bug一劳永逸的路径。

  1. 离散化区间范围正无穷到负无穷

目前这个库是用ChiMerge算法做特征的离散化的,这个算法会考虑特征不同取值区间的因变量响应率的差异,将相似的区间合

并、最终保留统计上有显著差异的取值区间,如果一个特征的取值区间都没有显著差异,就会被ChiMerge合并为一个区间,也就是-inf~inf,此时可以认为此特征没有足够的区分度,可以去除;如果还是想保留这个特征,可以将ChiMerge的min_intervals参数设置为2或更大,这会在区间只剩下min_intervals个的时候停止合并,这样就可以输出这个特征的多个取值区间了。

最后非常感谢你的反馈,这个库是我第一次做开源,现在看维护频率有点太低了。。。再次抱歉哈。

@Lantianzz
Copy link
Owner

Lantianzz commented Apr 6, 2020

For others who encountered KeyError (e.g. KeyError 450~inf) as well due to more extreme values in the test set. This issue has been resolved in the newest release. To avoid this bug, please use pip install scorecardbundle --U to update to the newest version.

Here is the explaination on what happened. When a feature has unique values less than min_intervals parameter (e.g. all values of this feature is 0), the maximum interval boundaries may equal to the maximum value of the feature. In this case, although I have added 'inf' deliberately to the boundaries, the maximum value of the feature would still be assigned to interval “xxxmax” rather than "xxxinf" since all intervals used in this module are closed to the right. This would cause KeyError in WOE, Scorecard and other subsequent steps that rely on intervals when there are values in the test set larger than the maximum value in the training set. Therefore, adjusting ChiMerge can tackle this bug fundamentally.

Thanks songshijun007 again for bringing up this bug.

@Lantianzz Lantianzz changed the title 不在区间范围内 [Future Fix] KeyError due to more extreme values in the test set 某些情况下测试集特征数值比训练集更大会导致KeyError Apr 6, 2020
@Lantianzz Lantianzz changed the title [Future Fix] KeyError due to more extreme values in the test set 某些情况下测试集特征数值比训练集更大会导致KeyError [Future Fix] KeyError due to more extreme values in the test set (某些情况下测试集特征数值比训练集更大会导致KeyError) Apr 6, 2020
@Lantianzz Lantianzz changed the title [Future Fix] KeyError due to more extreme values in the test set (某些情况下测试集特征数值比训练集更大会导致KeyError) [Fixed] KeyError due to more extreme values in the test set (某些情况下测试集特征数值比训练集更大会导致KeyError) Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants