XGBoostClassifier multiclass objective #9

paulbir · 2024-05-20T11:27:31Z

Currently "objective" parameter for the XGBoostClassifier is limited to "reg:squarederror", "reg:tweedie", "reg:linear", "reg:logistic", "binary:logistic", "binary:logitraw". And these values are even forced with validation code:

        if kkey == "objective":
            if not isinstance(parameters["objective"], str):
                raise TypeError("'objective' must be of type str")
            if parameters["objective"] not in [
                "reg:squarederror",
                "reg:tweedie",
                "reg:linear",
                "reg:logistic",
                "binary:logistic",
                "binary:logitraw",
            ]:
                raise ValueError(
                    """'objective' must either be 'reg:squarederror', """
                    """'reg:tweedie', 'reg:linear', 'reg:logistic', """
                    """'binary:logistic', or 'binary:logitraw'"""
                )

This code was clearly added when XGBoost already supported multiclass classification. So why can't I use an objective like "multi:softmax"? Or maybe there is some workaround for the multiclass classification?

The text was updated successfully, but these errors were encountered:

liuzicheng1987 · 2024-05-22T08:35:39Z

Hi @paulbir,

the issue is not so much XGBoost itself, but the feature learning algorithms. It can be very tricky to build features for very high-dimensional targets.

You can use the getml.data.make_target_columns(…), as exemplified in the CORA notebook:

https://nbviewer.org/github/getml/getml-demo/blob/master/cora.ipynb

paulbir · 2024-05-22T08:43:09Z

Hi @liuzicheng1987 , thanks for your reply.

I have target with 3 classes. This is a multiclass problem, so the objective should be multi:softmax or multi:softprob, but only binary targets are allowed.

srnnkls · 2024-05-22T11:27:54Z

Thanks for your question, @paulbir.

You can just materialize your features using transform, use the native XGBoost python API, and construct the DMatrix from the numpy arrays (or pandas dfs) returned by transform. This way, you can bring any ML-algorithm you want.

paulbir · 2024-05-22T16:33:19Z

Hi @srnnkls . My goal is to create new features using the Relboost method. In all the notebook examples I can see that when creating the pipeline like here:

pipe = getml.pipeline.Pipeline(
    data_model=time_series.data_model,
    tags=["memory=15", "logistic regression"],
    feature_learners=[feature_learner],
    predictors=[predictor],
)

And the docs do not explicitly state if the predictor is necessary for feature engineering itself. So I usually set it with getml.predictors.XGBoostClassifier. I assumed that it is necessary in some intermediate steps internally.

But what I thought now is do I really need to set the predictors parameter for feature engineering only?

liuzicheng1987 · 2024-05-22T17:06:16Z

@paulbir , no, if all you are interested in are the features, you don't really need predictors. It's nice to have predictors for things such as calculating feature importances, but they are not necessary for the feature engineering.

paulbir · 2024-05-22T17:08:05Z

@liuzicheng1987 thanks. So I have no issues with predictors anymore then.

Jogala · 2024-06-05T13:09:13Z

90.16Let is define the number of class labels as L.

@paulbir just to clarify:

You can do multiclass classification using getml and getml.predictors.XGBoostClassifier(objective="binary:logistic"). What getml does, it builds a set of features for each class label and trains a separate predictor for each of them. It performs a 1 vs all approach for each class label. If you call v=ipe.predict(container.test) on such a pipeline, then v_ij is the probability that the i-th row belongs to the label j and not any other label. This procedure requires declaring multiple targets as @liuzicheng1987 mentioned in his answer using the make_target_columns function.

You can also follow @srnnkls approach, that you use the getml pipeline for constructing the features and then use pipe.transform(container.test) to generate the flat population table containing the combined L features sets. Following that, you can train a XGBoost predictor using `"objective": "multi:softmax"``. Note however that you can not use a new split! You have to use the test/train partition that you used for generation the features in first place. Otherwise you will get a leak from the training data into the test data.

Example can be found here: https://github.com/Jogala/cora under scripts/ml_all.py

Note that using the L times 1 vs All approach I achieved slightly better results on that example. Overall, we outperform the best predictor of that ranking here: https://paperswithcode.com/sota/node-classification-on-cora Using the split of the leading paper (accuracy = 90.16%), we reach 91%.

If you have further questions, you can also drop me an email:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoostClassifier multiclass objective #9

XGBoostClassifier multiclass objective #9

paulbir commented May 20, 2024

liuzicheng1987 commented May 22, 2024

paulbir commented May 22, 2024

srnnkls commented May 22, 2024

paulbir commented May 22, 2024

liuzicheng1987 commented May 22, 2024 •

edited

Loading

paulbir commented May 22, 2024

Jogala commented Jun 5, 2024 •

edited

Loading

XGBoostClassifier multiclass objective #9

XGBoostClassifier multiclass objective #9

Comments

paulbir commented May 20, 2024

liuzicheng1987 commented May 22, 2024

paulbir commented May 22, 2024

srnnkls commented May 22, 2024

paulbir commented May 22, 2024

liuzicheng1987 commented May 22, 2024 • edited Loading

paulbir commented May 22, 2024

Jogala commented Jun 5, 2024 • edited Loading

liuzicheng1987 commented May 22, 2024 •

edited

Loading

Jogala commented Jun 5, 2024 •

edited

Loading