Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using MultiLabelBinarizer #79

Open
IdoZehori opened this issue Jan 17, 2018 · 7 comments
Open

Using MultiLabelBinarizer #79

IdoZehori opened this issue Jan 17, 2018 · 7 comments

Comments

@IdoZehori
Copy link

Hey,

The problem I've encountered is when trying to perform k-hot-encoding with sklearns MultiLabelBinarizer and got the following error.

how do you suggest dealing with columns with multiple categorical features?

Jan 17, 2018 4:52:08 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Jan 17, 2018 4:52:08 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 95 ms.
Jan 17, 2018 4:52:08 PM org.jpmml.sklearn.Main run
INFO: Converting..
Jan 17, 2018 4:52:08 PM sklearn2pmml.PMMLPipeline encodePMML
WARNING: Attribute 'sklearn2pmml.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
Jan 17, 2018 4:52:08 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: The value object (Python class sklearn.preprocessing.label.MultiLabelBinarizer) is not a supported Transformer
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
	at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:638)
	at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:72)
	at sklearn.Initializer.encodeFeatures(Initializer.java:53)
	at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:147)
	at org.jpmml.sklearn.Main.run(Main.java:145)
	at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
	at java.lang.Class.cast(Class.java:3369)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
	... 8 more

Exception in thread "main" java.lang.IllegalArgumentException: The value object (Python class sklearn.preprocessing.label.MultiLabelBinarizer) is not a supported Transformer
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
	at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:638)
	at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:72)
	at sklearn.Initializer.encodeFeatures(Initializer.java:53)
	at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:147)
	at org.jpmml.sklearn.Main.run(Main.java:145)
	at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
	at java.lang.Class.cast(Class.java:3369)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
	... 8 more
Process failed: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams
@vruusmann
Copy link
Member

how do you suggest dealing with columns with multiple categorical features?

I don't quite understand the inner workings of MultiLabelBinarizer. Can you 1) explain what's the main functional difference between LabelBinarizer and MultiLabelBinarizer, and 2) share a code example where MultiLabelBinarizer has legit use? For the latter, you could use the Audit dataset (binary classification problem Adjusted ~ .).

I'd be happy to introduce MultiLabelBinarizer support into SkLearn2PMML/JPMML-SkLearn after that.

@IdoZehori
Copy link
Author

Basically what i need is to transform a column with an iterate in it to a k-hot-encoding type mapping.
Here is a toy example:

    data = pd.DataFrame()
    iterColumn = [['a', 'b'], ['a', 'c'], ['b', 'c']]

    data['iterColumn'] = iterColumn
    data['y'] = 1
    print data
    print MultiLabelBinarizer().fit_transform(data['iterColumn'])

That prints:

  iterColumn  y
0     [a, b]  1
1     [a, c]  1
2     [b, c]  1

[[1 1 0]
 [1 0 1]
 [0 1 1]]

And you can than easily use some sklearn classifier from there.

@vruusmann
Copy link
Member

Thanks - I think I've got the basic idea of MultiLabelBinarizer now.

In a nutshell, "iterColumn" is a collection-type feature/column, and the MultiLabelBinarizer transformation performs a "collection contains"-query on it (the first column of transformation results corresponds to "collection contains a?", the second to "collection contains b?", etc).

Collection-type features are a bit problematic from the PMML perspective, because it (typically-) operates with scalar-type features only.

I guess the same "features should be scalars" limitation applies to the Scikit-Learn framework as well. You can have collection-type features in the incoming dataset, but you must transform them to scalar-type features in the very beginning of your Scikit-Learn pipeline.

Will need to think about possible technical solutions. I could probably introduce collection-type feature support into JPMML-family of software pretty easily, but it would be pretty difficult to get it approved by DMG.org (that is responsible for maintaining the PMML standard).

@vruusmann
Copy link
Member

Coming back to your original question - how to deal with columns with multiple categorical features - then the temporary workaround would be to employ the following two-stage workflow:

  1. Take the original dataset, and "explode" single collection-type columns to multiple scalar-type columns (eg. using the MultiLabelBinarizer transformation). Do not do any other feature engineering in this step.
  2. Take the "exploded" dataset, and work with it as usual (feature transformation, estimation).

SkLearn2PMML/JPMML-SkLearn is currently able to handle the second stage. You would need to maintain a separate Python/Java solution for handling the first stage.

Despite the bad situation/outlook, let's keep this issue open - will remind me to think more about it.

@vruusmann
Copy link
Member

Another issue, where the original dataset contains collection-type features: jpmml/jpmml-sklearn#62

@IdoZehori
Copy link
Author

Thank for the quick response!
Can you think of some workaround I can try? maybe changing the input type to a dictionary and instead of having [1, 2, 3] have {1:1, 2:1, 3:1}? or something of that nature?

@mathlf2015
Copy link

@IdoZehori i met the same problem ,could you tell me how did you finally deal with this problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants