Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC Add secure persistence docs #145

Merged
merged 20 commits into from
Sep 29, 2022
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@ download or load the model.
and how it should be used. The model card can then be stored as the
``README.md`` file on the Hugging Face Hub, with pre-populated metadata to
help Hub understand the model.
- ``skops.io``: Secure persistence of sklearn estimators and more, without using
``pickle``. Visit `the docs
<https://skops.readthedocs.io/en/latest/persistence.html>`_ for more
information.

Please refer to our `documentation <https://skops.readthedocs.io/en/latest/>`_
on using the library as user, which includes user guides on the above topics as
Expand Down
4 changes: 4 additions & 0 deletions docs/changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ v0.3
- Add ``private`` as an optional argument to :meth:`.hub_utils.push` to
optionally set the visibility status of a repo when pushing to the hub.
:pr:`130` by `Adrin Jalali`_.
- First release of the skops secure persistence feature (:pr:`128`) by `Adrin
Jalali`_ and `Benjamin Bossan`_. Visit :ref:`persistence` for more
information. This feature is not production ready yet but we're happy to
receive feedback from users.

v0.2
----
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ User Guide / API Reference
installation
hf_hub
model_card
persistence
modules/classes
changes

Expand Down
6 changes: 5 additions & 1 deletion docs/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,15 @@ This is the class and function reference of skops.

:mod:`skops.hf_hub`: Hugging Face Hub Integration
=================================================

.. automodule:: skops.hub_utils
:members:

:mod:`skops.card`: Model Card Utilities
=======================================
.. automodule:: skops.card
:members:

:mod:`skops.io`: Secure persistence
===================================
.. automodule:: skops.io
:members:
84 changes: 84 additions & 0 deletions docs/persistence.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
.. _persistence:

Secure persistence with skops
=============================

.. warning::

This feature is very early in development, which means the API is
unstable and it is **not secure** at the moment. Therefore, use the same
caution as you would for ``pickle``: Don't load from sources that you
don't trust. In the future, more security will be added.
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

Skops offers a way to save and load sklearn models without using :mod:`pickle`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this opening would be much stronger if you presented it as an issue with pickle + how skops addresses it. It may not be obvious to a user why they shouldn't use pickle especially if it is so common. Something like:

The pickle module is not secure, but with skops, you can securely save and load sklearn models without using pickle.

The ``pickle`` module is not secure, but with skops, you can securely save and
load sklearn models without using ``pickle``.

``Pickle`` is the standard serialization format for sklearn and for Python in
general. One of the main advantages of ``pickle`` is that it can be used for
almost all Python code but this flexibility also makes it inherently insecure.
This is because loading certain types of objects requires the ability to run
arbitrary code, which can be misused for malicious purposes. For example, an
attacker can use it to steal secrets from your machine or install a virus. As
the Python docs say:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something like this would be enough?

Suggested change
attacker can use it to steal secrets from your machine or install a virus. As
the Python docs say:
attacker can use it to steal secrets from your machine or install a virus. As
the `Python docs <https://docs.python.org/3/library/pickle.html#module-pickle>`__ say:

This whole paragraph doesn't have any links to pickle.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in the exact same link as :mod:`pickle` in the paragraph directly above, but anyway, I added it


.. warning::

The pickle module is not secure. Only unpickle data you trust. It is
possible to construct malicious pickle data which will execute arbitrary
code during unpickling. Never unpickle data that could have come from an
untrusted source, or that could have been tampered with.
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

In contrast to ``pickle``, the :func:`skops.io.save` and :func:`skops.io.load`
functions cannot be used to save arbitrary Python code, but they bypass
``pickle`` and are thus more secure.

Usage
-----

The code snippet below illustrates how to use :func:`skops.io.save` and
:func:`skops.io.load`:

.. code:: python

from sklearn.linear_model import LogisticRegression
from skops.io import load, save

clf = LogisticRegression(random_state=0, solver="liblinear")
clf.fit(X_train, y_train)
save(clf, "my-logistic-regression.skops")
# ...
loaded = load("my-logistic-regression.skops")
loaded.predict(X_test)

At the moment, we support the vast majority of sklearn estimators. This includes
complex use cases such as :class:`sklearn.pipeline.Pipeline`,
:class:`sklearn.model_selection.GridSearchCV`, classes using Cython code, such
as :class:`sklearn.tree.DecisionTreeClassifier`, and more. If you discover an sklearn
estimator that does not work, please open an issue on the skops `GitHub page
<https://github.com/skops-dev/skops/issues>`_ and let us know.

In contrast to ``pickle``, skops cannot persist arbitrary Python code. This
means if you have custom functions (say, a custom function to be used with
:class:`sklearn.preprocessing.FunctionTransformer`), it will not work. However,
most ``numpy`` and ``scipy`` functions should work. Therefore, you can actually
save built-in functions like``numpy.sqrt``.

Roadmap
-------

Currently, it is still possible to run insecure code when using skops
persistence. For example, it's possible to load a save file that evaluates arbitrary
code using :func:`eval`. However, we have concrete plans on how to mitigate
this, so please stay updated.

On top of trying to support persisting all relevant sklearn objects, we plan on
making persistence extensible for other libraries. As a user, this means that if
you trust a certain library, you will be able to tell skops to load code from
that library. As a library author, there will be a clear path of what needs to
be done to add secure persistence to your library, such that skops can save and
load code from your library.

To follow what features are currently planned, filter for the `"persistence"
label <https://github.com/skops-dev/skops/labels/persistence>`_ in our GitHub
issues.
48 changes: 48 additions & 0 deletions skops/io/_persist.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,30 @@


def save(obj, file):
"""Save an object using the skops persistence format.

Skops aims at providing a secure persistence feature that does not rely on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to put warnings only to load rather than save, but feel free to ignore if you think it doesn't make sense😅

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I went overboard with warnings, but I really wanted to make sure users don't miss the fact that we're still early in development and know that it's not secure yet. Hopefully, we can fill these gaps soon and then remove the warnings. Until then, I'd rather leave them in.

:mod:`pickle`, which is inherently insecure. For more information, please
visit the :ref:`persistence` documentation.

.. warning::

This feature is very early in development, which means the API is
unstable and it is **not secure** at the moment. Therefore, use the same
caution as you would for ``pickle``: Don't load from sources that you
don't trust. In the future, more security will be added.

Parameters
----------
obj: object
The object to be saved. Usually a scikit-learn compatible model.

file: str
The file name. A zip archive will automatically created. As a matter of
convention, we recommend to use the ".skops" file extension, e.g.
``save(model, "my-model.skops")``.

"""
with tempfile.TemporaryDirectory() as dst:
with open(Path(dst) / "schema.json", "w") as f:
state = get_state(obj, dst)
Expand All @@ -43,6 +67,30 @@ def save(obj, file):


def load(file):
"""Load an object saved with the skops persistence format.

Skops aims at providing a secure persistence feature that does not rely on
:mod:`pickle`, which is inherently insecure. For more information, please
visit the :ref:`persistence` documentation.

.. warning::

This feature is very early in development, which means the API is
unstable and it is **not secure** at the moment. Therefore, use the same
caution as you would for ``pickle``: Don't load from sources that you
don't trust. In the future, more security will be added.

Parameters
----------
file: str
The file name of the object to be loaded.

Returns
-------
instance: object
The loaded object.

"""
with ZipFile(file, "r") as input_zip:
schema = input_zip.read("schema.json")
instance = get_instance(json.loads(schema), input_zip)
Expand Down