Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing predictions with modified feature values in a pmml pipeline #427

Closed
sulavtimilsina opened this issue Aug 3, 2024 · 5 comments
Closed

Comments

@sulavtimilsina
Copy link

Is there any way where i get to do two predcition. First the model has predicted with original day value ,
let say we get prediction p .Now suppose i want to replace 'Sun' with 'Thu' keeping all features same and
do prediction in pmml pipeline to get prediction q
Now we want to compare these p and q , if p<q then make p=q+1 .

df=sns.load_dataset('tips')
X=df.drop('tip',axis=1)
y=df['tip']
memory = Memory()
mapper=DataFrameMapper([
    (['total_bill'],ContinuousDomain()),
    (['sex'],[CategoricalDomain(), LabelEncoder()]),
    (['smoker'],[CategoricalDomain(), LabelEncoder()]),
    (['day'],[CategoricalDomain(),make_memorizer_union(memory, names = ["day-str"]), OrdinalEncoder()]),
    (['time'],[CategoricalDomain(), OrdinalEncoder()]),
    (['size'],[CategoricalDomain(), OrdinalEncoder()]),
],df_out=True)

Is there any way we can achive this within the pipeline, so we can save the pipeline in a file to infer later?

pipeline = PMMLPipeline([
    ('mapper',mapper),
    ("regressor", LinearRegression())
], predict_transformer = Pipeline([
    ('recaller1',make_recaller_union(memory = memory,names = ["day-str"], position = "first")),
    ('day_filter',ExpressionTransformer("X[0]='Thu' if X[0]=='Sun' else X[0]")) ## Idk how to implement it here , so can you help me in acheiving this inside this pipeline itself?
])
)
@sulavtimilsina sulavtimilsina changed the title Comparing Predictions with Modified Feature Values in a PMML Pipeline Comparing predictions with modified feature values in a pmml pipeline Aug 3, 2024
@vruusmann
Copy link
Member

This is a tricky one. The difficulty lies in the "make two predictions with the same model, first for the original dataset and then for the (slightly modified-) dataset" part, not in the post-processing part,

This task is especially difficult to solve using Scikit-Learn APIs.

Is there are way to simplify the pipeline somehow? What is the intended type of the estimator class? The example above uses LinearRegression; is this so, or will there be some other (non-linear-) model type in use? Also, what is the mining function - regression or classification?

I'm asking, because if I needed to solve this problem for the LinearRegression case, then I would train a simple model, and then extract the coefficients associated with day=Thu and day=Sun category levels from it. I would then calculate their difference, and conditionally adjust the predicted value by it during post-processing.

If the model type is something other than LinearRegression, then it would be necessary to develop a custom Scikit-Learn meta-estimator, and a matching SkLearn2PMML/JPMML-SkLearn converter.

@vruusmann
Copy link
Member

If the model type is something other than LinearRegression, then it would be necessary to develop a custom Scikit-Learn meta-estimator

Something along those lines:

class MyComparingPredictor(RegressorMixin):

  def __init__(self, estimator):
    self.estimator = estimator

  def fit(self, X, y):
    self.estimator_ = clone(self.estimator)
    self.estimator_.fit(X, y)
    return self

  def predict(self, X):
    y  = self.estimator_.predict(X)
    # Very simplified approach
    X['day'] = "Thu"
    y_thu = self.estimator_.predict(X)
    # Very simplified approach
    return numpy.maximum(y, y_thu)

The idea of introducing a MyComparingPredictor meta-estimator class is to make it possible to perform two elementary predict operations during a single (PMML)Pipeline.predict(X) method call.

I don't see any easy way of achieving the same using Scikit-Learn built-in estimator and meta-estimator classes.

@sulavtimilsina
Copy link
Author

Is there are way to simplify the pipeline somehow? What is the intended type of the estimator class? The example above uses LinearRegression; is this so, or will there be some other (non-linear-) model type in use? Also, what is the mining function - regression or classification?

I cannot see any alternative to this idea, I just want to implement the dynamic comparision of prediction between two inputs as per the day_of_week value. Make changes if it doesnt satisfy my condition. And wrap everything up inside the pipeline so that I can save it as xml file for future inference as a balck box model.

I want to implement it using XGBoost regressor right now. But if its possible to implement using Linear functions like linear regressor, it would be fine too.

@vruusmann
Copy link
Member

I cannot see any alternative to this idea

I'm not challenging your idea here. I'm just telling, that based on the mining function/type of your model, there are different paths available.

For example, when using LinearRegression model, then it's possible to make everything work using standard (PMML)Pipeline API. In contrast, when using XGBClassifier, then you need to implement a custom meta-estimator, as demonstrated above.

I just want to implement the dynamic comparision of prediction between two inputs as per the day_of_week value

How would you implement the "prediction with two inputs" part using Scikit-Learn core APIs? Feel free to ignore the PMML conversion part right now (because that's easy).

@vruusmann
Copy link
Member

There doesn't seem to be any interest to this issue anymore, so closing it with some final comments.

First, temporal features (such as "day", "time", etc) that are represented as strings, are not suitable input for Scikit-Learn's OrdinalEncoder transformer. The trouble is that OE sorts its input lexicographically, which destroys the intended ordering of category levels. This is hard to notice in Python, but is painfully obvious in PMML.

For example, the fitted OE transformer translates to the following PMML markup:

<DerivedField name="encoder(day)" optype="categorical" dataType="double">
	<MapValues outputColumn="data:output">
		<FieldColumnPair field="day" column="data:input"/>
		<InlineTable>
			<row>
				<data:input>Fri</data:input>
				<data:output>0.0</data:output>
			</row>
			<row>
				<data:input>Sat</data:input>
				<data:output>1.0</data:output>
			</row>
			<row>
				<data:input>Sun</data:input>
				<data:output>2.0</data:output>
			</row>
			<row>
				<data:input>Thur</data:input>
				<data:output>3.0</data:output>
			</row>
		</InlineTable>
	</MapValues>
</DerivedField>

Note the effective ordering of category levels: "Fri" < "Sat" < "Sun" < "Thur".

Anyway, the above means that the "Thur" level is one notch higher than the "Sun" level. It can be seen from the RegressionModel/RegressionTable element, that the beta coefficient associated with this OE-encoded feature is -0.005437194272218469:

<RegressionTable intercept="0.9125054862129849">
	<NumericPredictor name="total_bill" coefficient="0.09409416367475298"/>
	<NumericPredictor name="continuous(encoder(sex))" coefficient="-0.027662269634805225"/>
	<NumericPredictor name="continuous(encoder(smoker))" coefficient="-0.08640116920553004"/>
	<NumericPredictor name="continuous(encoder(day))" coefficient="-0.005437194272218469"/>
	<NumericPredictor name="continuous(encoder(time))" coefficient="0.0023380980487601316"/>
	<NumericPredictor name="continuous(encoder(size))" coefficient="0.18066261819092744"/>
</RegressionTable>

Now suppose i want to replace 'Sun' with 'Thu' keeping all features same and
do prediction in pmml pipeline to get prediction q

Putting the above two observations together, it can be seen that the prediction for "Thur" case (q) can be obtained by subtracting -0.005437194272218469 from the predicted value of the "Sun" case (p).

Now we want to compare these p and q , if p<q then make p=q+1 .

The condition p < q will always be true. So the post-processing logic is (X[0] + (1 - 0.005437194272218469)) if X['day'] == "Sun" else X[0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants