Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoostClassifier multiclass objective #9

Open
paulbir opened this issue May 20, 2024 · 7 comments
Open

XGBoostClassifier multiclass objective #9

paulbir opened this issue May 20, 2024 · 7 comments

Comments

@paulbir
Copy link

paulbir commented May 20, 2024

Currently "objective" parameter for the XGBoostClassifier is limited to "reg:squarederror", "reg:tweedie", "reg:linear", "reg:logistic", "binary:logistic", "binary:logitraw". And these values are even forced with validation code:

        if kkey == "objective":
            if not isinstance(parameters["objective"], str):
                raise TypeError("'objective' must be of type str")
            if parameters["objective"] not in [
                "reg:squarederror",
                "reg:tweedie",
                "reg:linear",
                "reg:logistic",
                "binary:logistic",
                "binary:logitraw",
            ]:
                raise ValueError(
                    """'objective' must either be 'reg:squarederror', """
                    """'reg:tweedie', 'reg:linear', 'reg:logistic', """
                    """'binary:logistic', or 'binary:logitraw'"""
                )

This code was clearly added when XGBoost already supported multiclass classification. So why can't I use an objective like "multi:softmax"? Or maybe there is some workaround for the multiclass classification?

@liuzicheng1987
Copy link
Contributor

Hi @paulbir,

the issue is not so much XGBoost itself, but the feature learning algorithms. It can be very tricky to build features for very high-dimensional targets.

You can use the getml.data.make_target_columns(…), as exemplified in the CORA notebook:

https://nbviewer.org/github/getml/getml-demo/blob/master/cora.ipynb

@paulbir
Copy link
Author

paulbir commented May 22, 2024

Hi @liuzicheng1987 , thanks for your reply.

I have target with 3 classes. This is a multiclass problem, so the objective should be multi:softmax or multi:softprob, but only binary targets are allowed.

@srnnkls
Copy link
Collaborator

srnnkls commented May 22, 2024

Thanks for your question, @paulbir.

You can just materialize your features using transform, use the native XGBoost python API, and construct the DMatrix from the numpy arrays (or pandas dfs) returned by transform. This way, you can bring any ML-algorithm you want.

@paulbir
Copy link
Author

paulbir commented May 22, 2024

Hi @srnnkls . My goal is to create new features using the Relboost method. In all the notebook examples I can see that when creating the pipeline like here:

pipe = getml.pipeline.Pipeline(
    data_model=time_series.data_model,
    tags=["memory=15", "logistic regression"],
    feature_learners=[feature_learner],
    predictors=[predictor],
)

And the docs do not explicitly state if the predictor is necessary for feature engineering itself. So I usually set it with getml.predictors.XGBoostClassifier. I assumed that it is necessary in some intermediate steps internally.

But what I thought now is do I really need to set the predictors parameter for feature engineering only?

@liuzicheng1987
Copy link
Contributor

liuzicheng1987 commented May 22, 2024

@paulbir , no, if all you are interested in are the features, you don't really need predictors. It's nice to have predictors for things such as calculating feature importances, but they are not necessary for the feature engineering.

@paulbir
Copy link
Author

paulbir commented May 22, 2024

@liuzicheng1987 thanks. So I have no issues with predictors anymore then.

@Jogala
Copy link
Collaborator

Jogala commented Jun 5, 2024

90.16Let is define the number of class labels as L.

@paulbir just to clarify:

You can do multiclass classification using getml and getml.predictors.XGBoostClassifier(objective="binary:logistic"). What getml does, it builds a set of features for each class label and trains a separate predictor for each of them. It performs a 1 vs all approach for each class label. If you call v=ipe.predict(container.test) on such a pipeline, then v_ij is the probability that the i-th row belongs to the label j and not any other label. This procedure requires declaring multiple targets as @liuzicheng1987 mentioned in his answer using the make_target_columns function.

You can also follow @srnnkls approach, that you use the getml pipeline for constructing the features and then use pipe.transform(container.test) to generate the flat population table containing the combined L features sets. Following that, you can train a XGBoost predictor using `"objective": "multi:softmax"``. Note however that you can not use a new split! You have to use the test/train partition that you used for generation the features in first place. Otherwise you will get a leak from the training data into the test data.

Example can be found here: https://github.com/Jogala/cora under scripts/ml_all.py

Note that using the L times 1 vs All approach I achieved slightly better results on that example. Overall, we outperform the best predictor of that ranking here: https://paperswithcode.com/sota/node-classification-on-cora Using the split of the leading paper (accuracy = 90.16%), we reach 91%.

If you have further questions, you can also drop me an email:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants