Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Persistence: Add support for LightGBM and XGBoost #244

Merged

Conversation

BenjaminBossan
Copy link
Collaborator

@BenjaminBossan BenjaminBossan commented Dec 8, 2022

EDITED

Resolves #226

Description

LightGBM already worked out of the box, so for this, only tests were added. For XGBoost and CatBoost, extra code was added to support them.

Implementation

Dependencies were extended to include xgboost, lightgbm, and catboost for testing, so that they should be covered by CI. If those libraries are not installed, the tests should be gracefully skipped.

For xgboost, the xgboost-specific persistence format was chosen, which can be used through estimator.save_model and estimator.load_model. I could not verify (yet) that this format is 100% secure. However, since it's a binary json file (.ubj), it should as secure as json. More information here:

https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html

The actual implementation for xgboost mirrors the one for numpy arrays, which also stores its data in an external file. However, I did not apply any caching here, as it seems it wouldn't be useful.

I decided to create a new test file, test_external.py, to put the new tests, instead of cramming more into test_persist.py. Since some functions from test_persist.py were required here, I moved them a new _utils.py module.

Interestingly, I found some potential bugs where some values returned by get_params() may differ for xgboost models after loading. This is independent of skops. Those edge cases are documented but were excluded from tests, as it appears that they should be inconsequential. We may want to report them.

Partially solves skops-dev#226

Description

LightGBM already worked out of the box, so for this, only tests were
added. For XGBoost, extra code was added to support it.

Implementation

Dependencies were extended to include xgboost and lightgbm for testing,
so that they should be covered by CI. If those libraries are not
installed, the tests should be gracefully skipped.

For XGBoost, the xgb-specific persistence format was chosen, which can
be used through estimator.save_model and estimator.load_model. I could
not verify (yet) that this format is 100% secure. However, since it's a
binary json file (.ubj), it should as secure as json. More information
here:

https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html

The actual implementation for xgboost mirrors the one for numpy arrays,
which also stores its data in an external file. However, I did not apply
any caching here, as it seems it wouldn't be useful.

I decided to create a new test file, test_external.py, to put the new
tests, instead of cramming more into test_persist.py. Since some
functions from test_persist were required here, I moved them a new
_utils.py module.

Interestingly, I found some potential bugs where some values returned by
get_params may differ for xgboost models after loading. This is
independent of skops. Those edge cases are documented but were excluded
from tests, as it appears that they should be inconsequential.
1.7 not supported for Python 3.7
Fails when no GPU is available and I couldn't find a way to detect from
Python if a GPU is available or not (short of trying and catching the
error).
@BenjaminBossan BenjaminBossan changed the title Add support for LightGBM and XGBoost Persistence: Add support for LightGBM and XGBoost Dec 8, 2022
@BenjaminBossan BenjaminBossan added the persistence Secure persistence feature label Dec 8, 2022
@BenjaminBossan
Copy link
Collaborator Author

I removed the xgboost bug (?) docstring, since I couldn't find how to exclude it from running with doctest. Replicating it here so that it doesn't get lost. Ideally, someone with more experience in xgboost could comment on whether this really is a bug or not. If so, we could submit it to the xgboost issue tracker.

>>> import xgboost
>>> print(xgboost.__version__)
1.7.1
>>> estimator = xgboost.XGBClassifier(tree_method=None)
>>> X, y = [[0, 1], [2, 3]], [0, 1]
>>> estimator.fit(X, y)
XGBClassifier(...)
>>> print(estimator.tree_method)
None
>>> print(estimator.get_params()["tree_method"])
exact
>>> # after save/load roundtrip, tree_method's get_params value changes from exact to auto
>>> import tempfile
>>> tmp_file = f"{tempfile.mkdtemp()}.ubj"
>>> estimator.save_model(tmp_file)
>>> estimator.load_model(tmp_file)
>>> print(estimator.tree_method)
None
>>> print(estimator.get_params()["tree_method"])
auto
>>> # a similar thing happens with gpu_id
>>> estimator = xgboost.XGBClassifier(tree_method='gpu_hist', booster='gbtree')
>>> estimator.fit(X, y)
XGBClassifier(...)
>>> print(estimator.gpu_id)
None
>>> print(estimator.get_params()["gpu_id"])
0
>>> # after save/load roundtrip, gpu_id's get_params value changes from 0 to -1
>>> estimator.save_model(tmp_file)
>>> # for gpu_id, the estimator needs to be re-initialized for the effect to occur
>>> estimator = xgboost.XGBClassifier()
>>> estimator.load_model(tmp_file)
>>> print(estimator.gpu_id)
None
>>> print(estimator.get_params()["gpu_id"])
-1

@@ -0,0 +1,145 @@
import warnings
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in here is unmodified, just moved around.

@BenjaminBossan
Copy link
Collaborator Author

CI is green, so ready for review @skops-dev/maintainers

I would take a look at catboost separately to keep the PR smaller. I included lightgbm here since it didn't require any code changes.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you maybe explain why going through the __dict__ wouldn't work for xgboost? I suppose because the attributes are not C extension types, and rather pure C types?

I would like us to understand what save/load of xgboost does, before adding this. We should document that somewhere.

Apart from this core, we plan to support machine learning libraries commonly
used be the community. So far, those are:

- `LightGBM <lightgbm.readthedocs.io/>`_ (scikit-learn API)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we add xgboost here as a result of this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Comment on lines 55 to 62
# list of constructors is hard-coded for higher security
constructors = {
"XGBClassifier": xgboost.XGBClassifier,
"XGBRegressor": xgboost.XGBRegressor,
"XGBRFClassifier": xgboost.XGBRFClassifier,
"XGBRFRegressor": xgboost.XGBRFRegressor,
"XGBRanker": xgboost.XGBRanker,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we use the type saved in __class__ here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that it's more secure, but we could also store the class itself and use that here.

try:
# xgboost doesn't allow to save to a memory buffer, so take roundtrip
# through temp file
tmp_file = f"{tempfile.mkdtemp()}.ubj"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a context manager to make sure it's cleaned up after the fact?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look.


if self.type == "xgboost-format": # fitted
# load_model works with bytearray, so no temp file necessary here
instance.load_model(bytearray(self.children["content"].getvalue()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this. I can't easily figure out what this actually does, and if it's anywhere near being safe

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the equivalent of what we do with numpy arrays (in fact, the whole approach is analogous):

class NdArrayNode(Node):
    def __init__(...):
        ...
        if self.type == "numpy":
            self.children = {
                "content": io.BytesIO(load_context.src.read(state["file"]))
            }

    def _construct(self):
        if self.type == "numpy":
            content = np.load(self.children["content"], allow_pickle=False)
        ...

The difference is that, unfortunately, load_model cannot work with bytes, instead the content first needs to be converted to a bytearray instead.

I could also load the model from a temp file, but I thought this approach would be better.

Copy link
Collaborator Author

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you maybe explain why going through the __dict__ wouldn't work for xgboost? I suppose because the attributes are not C extension types, and rather pure C types?

So without special treatment, xgboost goes through ObjectNode. Since it has __getstate__ and __setstate__, those are being used. After we call the latter, there is some kind of validation going on:

https://github.com/dmlc/xgboost/blob/8824b40961c2f381fcd7a9a6af1697cbf20b608f/python-package/xgboost/core.py#L1674-L1684

This leads us to the C++ part of the code base, which I honestly didn't even want to inspect.

My thinking is that trying to make that approach work is too error prone, we might get something to run by tinkering with the attributes, but we couldn't be sure that it would work with other parameters. However, the save_model way is officially recommended, is (presumably) well tested, and it is stable across versions, so more robust for users than if we were directly storing attributes.

I would like us to understand what save/load of xgboost does, before adding this. We should document that somewhere.

Regarding security, I couldn't find anything being mentioned explicitly, but the general advantages are explained in the linked document. As I said, since it's json format, I don't see how that would not be safe.


if self.type == "xgboost-format": # fitted
# load_model works with bytearray, so no temp file necessary here
instance.load_model(bytearray(self.children["content"].getvalue()))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the equivalent of what we do with numpy arrays (in fact, the whole approach is analogous):

class NdArrayNode(Node):
    def __init__(...):
        ...
        if self.type == "numpy":
            self.children = {
                "content": io.BytesIO(load_context.src.read(state["file"]))
            }

    def _construct(self):
        if self.type == "numpy":
            content = np.load(self.children["content"], allow_pickle=False)
        ...

The difference is that, unfortunately, load_model cannot work with bytes, instead the content first needs to be converted to a bytearray instead.

I could also load the model from a temp file, but I thought this approach would be better.

Comment on lines 55 to 62
# list of constructors is hard-coded for higher security
constructors = {
"XGBClassifier": xgboost.XGBClassifier,
"XGBRegressor": xgboost.XGBRegressor,
"XGBRFClassifier": xgboost.XGBRFClassifier,
"XGBRFRegressor": xgboost.XGBRFRegressor,
"XGBRanker": xgboost.XGBRanker,
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that it's more secure, but we could also store the class itself and use that here.

try:
# xgboost doesn't allow to save to a memory buffer, so take roundtrip
# through temp file
tmp_file = f"{tempfile.mkdtemp()}.ubj"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look.

Apart from this core, we plan to support machine learning libraries commonly
used be the community. So far, those are:

- `LightGBM <lightgbm.readthedocs.io/>`_ (scikit-learn API)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@adrinjalali
Copy link
Member

So looking at this again, I don't understand why you think using __setstate__ would be error prone. It can be the case that somebody creates a custom .skops file where they put odd attributes there, but if the user loads a file where the input to __setstate__ is exactly what we got from __getstate__, then there shouldn't be any issues.

It seems the issue we have is that we don't support bytearray, I think. So maybe a better solution here would be to add support for that, which in turn would fix the issue for xgb w/o us explicitly having to do an xgb specific kinda thing.

@BenjaminBossan
Copy link
Collaborator Author

So looking at this again, I don't understand why you think using __setstate__ would be error prone. It can be the case that somebody creates a custom .skops file where they put odd attributes there, but if the user loads a file where the input to __setstate__ is exactly what we got from __getstate__, then there shouldn't be any issues.

My interpretation of the xgboost documentation was that their custom format is the most portable. Say they add a private attribute from version X to version Y, it seems like their format would take care of correctly dealing with the missing attribute, whereas something like pickle would fail. What I cannot say with certainty is: If we use the output of their __getstate__ and then call __setstate__, is this portability is already being taken care of, or is the "magic" happening somewhere else? Naively, I'd think that if __getstate__ and __setstate__ contain the portability logic, then pickle would work just fine.

I think this problem touches a more general topic when dealing with 3rd party packages: Should we try to implement methods as generically as possible or should we make use of custom serialization formats if they are offered by a package? In this case, should we use bytearray* or should we use their format explicitly? Or should decide on a case by case basis?

*we should probably support bytes and bytearray no matter what :)

Added support for bytes and bytearray, which is sufficient to dump and
load xgboost. The xgboost specific code was commented out for now.

To serialize bytes in json, base64 encoding was used.
@BenjaminBossan
Copy link
Collaborator Author

@adrinjalali I tested your suggestion to add support for byterray and it is indeed sufficient for supporting xgboost. The more general point still stands to be decided.

@adrinjalali
Copy link
Member

The more general point still stands to be decided.

I agree. But I think for now we can do with only supporting bytes/byte array, and figure out supporting custom formats of 3d party libraries later.

One thing I wonder: should we store bytes in a binary file instead? I feel like it'd make save/load much faster, and keep the schema json file cleaner.

@BenjaminBossan
Copy link
Collaborator Author

I agree. But I think for now we can do with only supporting bytes/byte array, and figure out supporting custom formats of 3d party libraries later.

Okay, then let's start with that.

One thing I wonder: should we store bytes in a binary file instead? I feel like it'd make save/load much faster

It'll probably depend. In this case, maybe, but having the additional i/o is probably a big overhead if we have multiple small bytes.

@adrinjalali
Copy link
Member

It'll probably depend. In this case, maybe, but having the additional i/o is probably a big overhead if we have multiple small bytes.

I wanted to suggest having a threshold for having them in the schema or in the file, but then I realized another thing to think of, is that if we allow binary data in the json schema, then people can create .skops files where there is a ton of binary data there, and we'd read them. That doesn't sound like a good idea to me. I'd be more comfortable reading those only from files. WDYT?

@BenjaminBossan
Copy link
Collaborator Author

then people can create .skops files where there is a ton of binary data there, and we'd read them. That doesn't sound like a good idea to me

I'm not against your proposal, I just want to understand what the concern is. Is it only about readability? Since you could argue that even if there is a big blob in the json, it's only one line so doesn't distract much (it could slow down some editors though).

@adrinjalali
Copy link
Member

Slowing down an editor is one thing, loading a massive json into memory is another thing. I'm not sure how people could exploit this, but I kinda feel like keeping them in separate files is cleaner. Imagine people having images as bytes, and we'd be keeping them in the json file all together.

Besides, I don't think there are many cases where we'd encounter small byte arrays. So my guess is that in most cases those byte arrays are large rather than small, and therefore having all of them in files would speed things up on average.

@BenjaminBossan
Copy link
Collaborator Author

@adrinjalali I made the changes as you suggested, please review again. For good measure, I checked if adding the bytes support is sufficient for making catboost work and indeed it does, so I added tests for it as well.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool 🔥

trusted: bool | Sequence[str] = False,
) -> None:
super().__init__(state, load_context, trusted)
self.trusted = self._get_trusted(trusted, [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we trust bytes and bytearray?

assert_params_equal(estimator.get_params(), loaded.get_params())


def check_method_outputs(estimator, X, trusted):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this common between these tests and scikit-learn ones? Would be nice to avoid duplication as much as we can here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very close. For my understanding, is the suggestion to put this method into tests/_utils.py and use it in test_persist.py too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

Comment on lines 152 to 154
These discrepancies occur regardless of skops, so they're a problem in
xgboost itself. We assume that this has no practical consequences and thus
avoid testing these cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an upstream issue for this on the xgboost side? If not, it'd be nice to open one and link to it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented, I could open on issue but ideally someone who knows xgboost well could comment on whether it's really a bug or not. If there is no one around, I can just open no matter what.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think the devs on the xgboost repo can decide if it's a bug which they wanna fix or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Refactor method output testing to be utility function
- Add bytes and bytearray to trusted types
…jaminBossan/skops into persistence-support-lightgbm-xgboost
@BenjaminBossan
Copy link
Collaborator Author

@adrinjalali I addressed your comments, please review again.

Regarding the issue of forgetting to add bytes/bytearray to trusted: It's a bit of a sign that our test coverage could be improved, right? As is, those oversights were not caught with the standard set of tests we have.

@adrinjalali
Copy link
Member

tests failing @BenjaminBossan

@BenjaminBossan
Copy link
Collaborator Author

@adrinjalali sorry, I made a list minute change and rolled it back, introducing the bug. Tests are green now.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise happy to merge.

skops/io/tests/test_external.py Outdated Show resolved Hide resolved
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@adrinjalali adrinjalali changed the title Persistence: Add support for LightGBM and XGBoost ENH Persistence: Add support for LightGBM and XGBoost Dec 14, 2022
@adrinjalali adrinjalali merged commit 8e17cc5 into skops-dev:main Dec 14, 2022
@BenjaminBossan BenjaminBossan deleted the persistence-support-lightgbm-xgboost branch December 14, 2022 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
persistence Secure persistence feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make sure xgboost.XGBClassifier and xgboost.XGBRegressor can be persisted
2 participants