Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full Sklearn pipeline as External Model #53

Open
jp-varela opened this issue May 21, 2024 · 1 comment
Open

Full Sklearn pipeline as External Model #53

jp-varela opened this issue May 21, 2024 · 1 comment

Comments

@jp-varela
Copy link

Hello,
I am trying to register a custom sklearn in a PiML experience, but I am getting this error:
File "/tmp/ipykernel_35500/19077422.py", line 76, in objective exp.register(piml_pipeline, "pipeline") File "piml/api.py", line 2691, in piml.api.Experiment.register File "piml/workflow/model_train_api.py", line 61, in piml.workflow.model_train_api.ModelAPI.register_model File "piml/workflow/pipeline.py", line 123, in piml.workflow.pipeline.ModelPipeline.get_data ValueError: could not convert string to float: 'DUMMY STR'

It seems like the get_data expect the input data to be preprocessed, however all my preprocessing steps are included in the sklearn pipeline. I want to have the entire pipeline as single object as I am going to test multiple pipelines with distinct preprocessing methods. The point here seems to be that the is a categorical column, that should be a problem I think.

Here is the code I used:

  # Define model
  model_pipeline = Pipeline([("model", CatBoostClassifier(verbose=0, cat_features=cat_features_idxs))])

  pre_processing_pipeline = Pipeline([
    ('inmputers', 
        ColumnTransformer(transformers=[
            ('numerical_imputer', SimpleImputer(missing_values=np.nan, strategy='mean'), NUMERICAL_COLS),
            ('categorical_imputer', SimpleImputer(missing_values=None, strategy='most_frequent'), CATEGORICAL_COLS)
           ])
       ),
   ])

  # Concat Pipelines
  pipeline = Pipeline([
      ('pre_processing', pre_processing_pipeline),
      ('model', model_pipeline)
  ])

    # Fit the pipeline, predict and evaluate
    pipeline.fit(X_train_, y_train_)

    exp = Experiment()
    piml_pipeline = exp.make_pipeline(pipeline, task_type="classification", train_x=X_train_, train_y=y_train_, test_x=X_val_, test_y=y_val_)
    exp.register(piml_pipeline, "pipeline")

Is there a way for me to make it work?
Thanks 😄

@ZebinYang
Copy link
Collaborator

Hi @jp-varela

The issue is not likely related to Sklearn's pipeline. Basically, any form of models can be wrapped and registered into piml, see the example here https://selfexplainml.github.io/PiML-Toolbox/_build/html/auto_examples/1_train/plot_2_register_2_arbitrary.html#sphx-glr-auto-examples-1-train-plot-2-register-2-arbitrary-py.

The current version of PiML still assumes the data is float or integer. If the data has string values, you need to convert it to float or integer before registering it into PiML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants