Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CategoricalDomain decorator doesn't work with nullable pandas Int64 column #420

Open
ghost opened this issue May 14, 2024 · 3 comments
Open

Comments

@ghost
Copy link

ghost commented May 14, 2024

I'm working with a pandas DataFrame that is using pandas Int64 data type for integer columns (since there can be missing values represented as pd.NA).

I have reduced the data set for testing purposes to just 2 columns:

cat_col      Int64
num_col    float64

The code is as follows:

df_mapper = DataFrameMapper([
        (["num_col"], [ContinuousDomain()]),
        (["cat_col"], [CategoricalDomain(dtype="category")]),
    ], input_df = True, df_out = True)

xgb = XGBClassifier(enable_categorical=True)

pipeline = PMMLPipeline([
        ("mapper", df_mapper),
        ("classifier", xgb)
    ])

pipeline.fit(X, y)
sklearn2pmml(pipeline, "test.pmml")

which results in the following error:

Standard output is empty
Standard error:
Exception in thread "main" org.jpmml.python.AttributeException: Attribute 'pandas.core.indexes.base.data.data' has an unsupported value (Python class pandas.core.arrays.integer.IntegerArray)
	at org.jpmml.python.CastFunction.apply(CastFunction.java:48)
	at org.jpmml.python.PythonObject.get(PythonObject.java:180)
	at pandas.core.Index$NDArrayData.getData(Index.java:162)
	at pandas.core.Index$NDArrayData.getValues(Index.java:156)
	at pandas.core.Index.getValues(Index.java:76)
	at pandas.core.Index.getArrayContent(Index.java:52)
	at org.jpmml.python.PythonObject.getArray(PythonObject.java:324)
	at org.jpmml.python.PythonObject.getObjectArray(PythonObject.java:364)
	at sklearn2pmml.decoration.DiscreteDomain.getDataValues(DiscreteDomain.java:150)
	at sklearn2pmml.decoration.DiscreteDomain.getDataType(DiscreteDomain.java:66)
	at sklearn.Transformer.updateFeatures(Transformer.java:101)
	at sklearn.Transformer.encode(Transformer.java:75)
	at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:67)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:48)
	at sklearn.Initializer.encode(Initializer.java:59)
	at sklearn.Composite.encodeFeatures(Composite.java:112)
	at sklearn.Composite.initFeatures(Composite.java:255)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:113)
	at com.sklearn2pmml.Main.run(Main.java:80)
	at com.sklearn2pmml.Main.main(Main.java:65)
Caused by: java.lang.ClassCastException: Cannot cast pandas.core.MaskedArray to numpy.core.NDArray
	at java.base/java.lang.Class.cast(Class.java:4067)
	at org.jpmml.python.CastFunction.apply(CastFunction.java:45)

If I change the data type of cat_col to standard numpy int64, it works without any error. But I cannot change the source that is producing the DataFrame, it is always using pandas Int64 data type as there can be missing values in the data.
Also, if I use the ContinuousDomain decorator for my cat_col, the error disappears (but then the column is not treated as categorical anymore).

@vruusmann
Copy link
Member

My internal TODO list has quite a few open items about the pandas.Int64Dtype data type support.

The current issue can be corrected by initializing the CategoricalDomain.data_values_ attribute using the data_values constructor parameter:

cat_domain = CategoricalDomain(data_values = [[1, 2, 3]])

Alternatively, a problematic attribute may be manually simplified to a dense/unmasked numpy array:

cat_domain.data_values_ = numpy.asarray(cat_domain.data_values_.tolist())

Perhaps such simplification should be applied automatically by the CategoricalDomain.fit(X, y) method.

@ghost
Copy link
Author

ghost commented May 15, 2024

Thanks for the quick response.
The suggested solution resolves the issue.

@ghost ghost closed this as completed May 15, 2024
@vruusmann
Copy link
Member

Reopening, because the conversion should succeed without any manual intervention.

@vruusmann vruusmann reopened this May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant