Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExpressionTransformer should try to rectify feature type information #397

Open
woodly0 opened this issue Oct 6, 2023 · 5 comments
Open

Comments

@woodly0
Copy link

woodly0 commented Oct 6, 2023

Hello Villu,

it's been a while and I hope you're fine. I've come back more questions.
Let's start with some code:

# create some data
X = pd.DataFrame(
    {
        "numbers": [1, 2, 3, 40, 5],
        "colors": ["yellow ", "blue", "BLACK", "green", "red"],
    }
)

# create a simple mapper
mapper = DataFrameMapper(
    [
        (
            ["colors"],
            [
                # CategoricalDomain(dtype=str),
                ExpressionTransformer("X[0].lower()"),
                MatchesTransformer("green"),
            ],
            {"alias": "color_green"},
        )
    ],
    df_out=True,
    default=False,
)

The following pipeline doesn't make much sense from a machine learning poit of view, but it shows the issue very well:

pmml_pipe = PMMLPipeline(
    [
        ("mapper", mapper)
    ]
)
# fit and transform
pmml_pipe.fit_transform(X)

# export as PMML
sklearn2pmml(pmml_pipe, "output.pmml", with_repr=True)

In Python, everything works as expected. Now the issue is within the generated output.pmml file, where you can find the following:

<DataDictionary>
	<DataField name="colors" optype="continuous" dataType="double"/>
</DataDictionary>

Knowing that the input has an infinte amount of possible values, how can I set this data type to "string"?

@vruusmann
Copy link
Member

It's pretty unusual to start the pipeline with an ExpressionTransformer object. There should be some "clarifications" in front of it.

The simplest way to make a clarification is to give feature specification using one of SkLearn2PMML decorators (eg. sklearn2pmml.decoration.CategoricalDomain, OrdinalDomain or ContinuousDomain).

You already have CategoricalDomain in place, but have it commented out. You probably didn't like that it captured the valid value space of your X dataset:

<DataDictionary>
	<DataField name="colors" optype="categorical" dataType="string">
		<Value value="BLACK"/>
		<Value value="blue"/>
		<Value value="green"/>
		<Value value="red"/>
		<Value value="yellow "/>
	</DataField>
</DataDictionary>

Well, if you don't like the valid value space information, then simply disable it using the with_data = False flag:

color_transformers = [
	# THIS!
	CategoricalDomain(dtype = str, with_data = False),
	ExpressionTransformer("X[0].lower()"),
	MatchesTransformer("green"),
]

@vruusmann vruusmann changed the title Declare input dataType="string" ExpressionTransformer should try to rectify feature type information Oct 6, 2023
@vruusmann
Copy link
Member

vruusmann commented Oct 6, 2023

It's pretty unusual to start the pipeline with an ExpressionTransformer object. There should be some "clarifications" in front of it.

The ExpressionTransformer can triangulate its position in the pipeline by observing if there are any wildcard features (ie. org.jpmml.converter.WildcardFeature objects) among the arguments.

If there are, then it should make effort to rectify their types. For example, if there are string methods being called on a wildcard feature, then it's reasonable to assume that the type of this feature should be categorical+string (instead of continuous+double).

It is likely that such type rectification should happen during expression parsing phase, which means that the code change should land in the JPMML-Python library instead.

@woodly0
Copy link
Author

woodly0 commented Oct 10, 2023

Well, if you don't like the valid value space information, then simply disable it using the with_data = False flag

This was exactly what I was looking for. Thank you!

The ExpressionTransformer can triangulate its position in the pipeline by observing if there are any wildcard features (ie. org.jpmml.converter.WildcardFeature objects) among the arguments.

So you are saying that it is still OK to start the pipeline with an ExpressionTransformer or should it be generally avoided?

@vruusmann
Copy link
Member

So you are saying that it is still OK to start the pipeline with an ExpressionTransformer or should it be generally avoided?

The ExpressionTransformer works best with non-wildcard features.

The easiest way to convert a wildcard feature (has continuous+double type) to a non-wildcard feature is to use SkLearn2PMML decorators.

@vruusmann
Copy link
Member

Alternatively, the ExpressionTransformer should simply raise a value error when there are wildcard features among the arguments.

IMO, it's better to have the conversion fail, rather than to have it produce a invalid/incomplete PMML document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants