ENH Support more array-like data types for tabular data and list-like data types for text data #179

anferico · 2022-10-09T16:02:45Z

This pull requests is an attempt towards addressing #65. Note that this is my first ever public contribution, so don't be too hard on me 😉

For tabular-classification and tabular-regression tasks, we're currently supporting the following data types:

pandas.DataFrame
numpy.ndarray

Should this PR get merged, we will support any other data type that can be converted to a 2D numpy.ndarray using np.asarray.

For text-classification and text-regression tasks, we're currently supporting the following data types:

List of strings

Should this PR get merged, we will support any iterable of strings.

xref: https://github.com/skops-dev/skops/pull/45/files#r933221139

BenjaminBossan · 2022-10-10T08:50:06Z

Thanks a lot @anferico for taking this.

I have only taken a quick look so far and noticed that you haven't added any unit tests for the changes. Could you please do that? Let us know if you need any help with that.

Also, please add an entry to the changes doc.

anferico · 2022-10-11T19:00:06Z

I haven't added any unit test for the 2 small helper functions I added, though I've modified existing ones in accordance to the changes I made. Not sure what's your policy here, but if needed I can add unit tests for those as well. I also added some docstrings and modified existing ones.

Everything should be fine since all unit tests were passing on my local machine, but let's wait until the pytest workflow runs successfully.

BenjaminBossan

Thank you very much for working on this. To me, this looks already quite good and the overall approach can stay the same. However, I have some small comments, mostly regarding efficiency and some missing tests. Please take a look.

docs/changes.rst

BenjaminBossan · 2022-10-12T08:43:54Z

skops/hub_utils/_hf_hub.py

+        The numpy.ndarray object obtained by converting data.
+    """
+    data_array = np.asarray(data)
+    if len(data_array.shape) != 2:


data_array.ndim is more idiomatic.

BenjaminBossan · 2022-10-12T08:59:07Z

skops/hub_utils/_hf_hub.py

-        return {f"x{x}": data[:3, x].tolist() for x in range(data.shape[1])}
-
-    raise ValueError("The data is not a pandas.DataFrame or a numpy.ndarray.")
+    data_array = _convert_to_2d_numpy_array(data)


Here we're first converting all the data to a numpy array, only to throw away everything but the first 3 lines. Is it possible to rearrange the code to first slice, then convert? It would depend on what kind of data is passed here, but I think we can be a bit restrictive here. E.g. I don't think we need to support generators here (since we're not dealing with a text task).

skops/hub_utils/_hf_hub.py

BenjaminBossan · 2022-10-12T09:00:23Z

skops/hub_utils/_hf_hub.py

+        # data is not iterable
+        return False


Same, can we have a test?

BenjaminBossan · 2022-10-12T09:04:24Z

skops/hub_utils/_hf_hub.py

+    try:
+        # needed in case data is an iterator or a generator
+        data, data_copy = itertools.tee(data, 2)
+        return all(isinstance(x, str) for x in data_copy)


I'm undecided on this one. Here we're testing all the data, even though we only need the first 3 samples. Of course, we can argue that fitting the model should not work if some rows have strange types, but is it really our job here to verify this?

…ypes

anferico · 2022-10-13T19:20:52Z

@BenjaminBossan I've tried addressing your comments, see if it's good now. One minor detail: I renamed the function _is_iterable_of_strings to _is_sequence_of_strings because now I'm not expecting generic iterables anymore as inputs to that function, which led me to remove the logic that prevents iterators and generators from being exhausted when iterated over. Without that logic, it just felt wrong to me calling it _is_iterable_of_strings, so I went for _is_sequence_of_strings.

BenjaminBossan

Thanks a lot for making the changes, the feature is almost there. I still have a few comments left, some are just suggestions that could be tackled later. Please also update with the main branch, as there is a merge commit.

BenjaminBossan · 2022-10-14T10:20:37Z

skops/hub_utils/_hf_hub.py

+    data_slice = data[:3]
+    data_slice_array = _convert_to_2d_numpy_array(data_slice)
+    return {
+        f"x{x}": data_slice_array[:3, x].tolist()


Now the data is sliced twice

BenjaminBossan · 2022-10-14T10:31:31Z

skops/hub_utils/tests/test_hf_hub.py

+    ],
+)
+def test_head(data, n, expected_output):
+    assert _head(data, n) == expected_output


I think it would be nice to additionally test:

a numpy array of strings

an empty container

for generators, that they're not consumed (could be a separate test)

BenjaminBossan · 2022-10-14T10:32:54Z

skops/hub_utils/tests/test_hf_hub.py

+    ],
+)
+def test_is_sequence_of_strings(data, is_sequence_of_strings):
+    assert _is_sequence_of_strings(data) == is_sequence_of_strings


Could you please add an additional test for numpy arrays of strings and for empty containers?

BenjaminBossan · 2022-10-14T10:37:34Z

skops/hub_utils/_hf_hub.py

+        error_message = "The data needs to be an iterable of strings."
+        try:
+            data_head = _head(data, n=3)
+            if _is_sequence_of_strings(data_head):
+                config["sklearn"]["example_input"] = {"data": data_head}
+            else:
+                raise ValueError(error_message)
+        except TypeError as e:
+            raise ValueError(error_message) from e


This part is getting rather complex and distracts from the actual work being done by _create_config. I wonder if it would be possible to factor this code into _get_example_input, so that this whole block here could be simplified to:

config["sklearn"]["example_input"] = _get_example_input(data) if "tabular" in task: config["sklearn"]["columns"] = _get_column_names(data)

If that's not really possible, just leave it as is.

How about something like this?

config["sklearn"]["example_input"] = _get_example_input(data, task) if "tabular" in task: config["sklearn"]["columns"] = _get_column_names(data)

Cause this would make it easier for me, I think. Alternatively, I'm thinking of something like:

if "tabular" in task: config["sklearn"]["example_input"] = _get_example_input_from_tabular_data(data) config["sklearn"]["columns"] = _get_column_names(data) elif "text" in task: config["sklearn"]["example_input"] = _get_example_input_from_text_data(data)

although the latter would imply renaming an existing function

config["sklearn"]["example_input"] = _get_example_input(data, task)

Looks good to me.

BenjaminBossan · 2022-10-14T10:39:31Z

skops/hub_utils/_hf_hub.py


    dump_json(Path(dst) / "config.json", config)


+def _head(data: Iterable, n: int):


At some point, we will probably have to factor out functions like _validate_folder, _head, _is_sequence_of_strings etc. to a utils module. It doesn't need to be in this PR though.

adrinjalali

Thanks for the PR @anferico

adrinjalali · 2022-10-17T09:47:09Z

docs/changes.rst

@@ -11,6 +11,7 @@ skops Changelog

 v0.3
 ----
+- Support more array-like data types for tabular data and list-like data types for text data. :pr:`179` by `Francesco Cariaggi`_.


line too long (limit to 79 chars)

adrinjalali · 2022-10-17T09:48:24Z

skops/hub_utils/_hf_hub.py

@@ -73,6 +74,34 @@ def _validate_folder(path: Union[str, Path]) -> None:
        raise TypeError(f"Model file {model_path} does not exist.")


+def _convert_to_2d_numpy_array(data):


we should use sklearn.utils.validation.check_array instead of writing our own, unless we handle more input types.

adrinjalali · 2022-10-17T09:50:23Z

skops/hub_utils/_hf_hub.py

+        The input needs to be anything that can be converted to a 2D
+        ``numpy.ndarray``, including a ``pandas.DataFrame``. The first 3 rows
+        are used as example input.


we shouldn't just convert to numpy, for instance, if the input is xarray, we can extract column names. The same with cuDF.

Do you have a specific list of data types that you'd like to support? Cause there are countless variants of numpy arrays and pandas dataframes, but I don't think we can find a general solution that works for each of them

I don't have a list, but I'm not sure if a silent conversion is a good idea. It might be a better idea to fail on containers which we don't understand, and slowly add them as requests come, WDYT?

Ok, so if I understood correctly you'd rather fail on, say, cuDF dataframes rather than silently converting them to NumPy arrays, and the reason is because you don't want to risk throwing away information about the column names. Is that right?

Yes. And the reason I don't want to miss that information is the following:

We use these feature names to figure out how to pass the information to the model in the inference side. And that API receives data in this format: {'colA': [list of values], ...}. If we lose the feature name information here, the user wouldn't know how to send data to the inference backend and the whole inference side of things break. Since saving feature names here serves pretty much no other purpose than making the inference backend, it'd be a pity to silently convert them to numpy arrays and lose that information.

So I think it's better to fail on things we don't understand, and explicitly add the more popular ones. WDYT?

Makes total sense to me, however this means we need to explicitly check the type of the inputs we get, right? So, basically, right now we're supporting only:

DataFrames (for which we can also extract column names)

np.arrays

[list|tuple] of [list|tuple|np.array] (can be silently converted to np.arrays using sklearn's check_array)

Anything else we should add for now?

I think this would be a good start. I'm curious to see what people are gonna use. We should ideally raise for all other types or at least raise a warning.

adrinjalali · 2022-10-17T09:53:43Z

skops/hub_utils/_hf_hub.py

+    return list(itertools.islice(data_copy, n))
+
+
+def _is_sequence_of_strings(data):


to me it seems better to remove this method and have it in the function where it's called.

adrinjalali · 2022-10-17T09:53:55Z

skops/hub_utils/_hf_hub.py


    dump_json(Path(dst) / "config.json", config)


+def _head(data: Iterable, n: int):


same here, not sure if we need a method for this.

BenjaminBossan · 2022-11-04T11:40:07Z

@anferico are you still working on this?

anferico · 2022-11-04T13:37:02Z

@anferico are you still working on this?

I am, it's just that I haven't had much free time recently. I'm happy to continue working on this, but if it's really high priority please feel free to let someone else take over. Sorry about that

BenjaminBossan · 2022-11-04T16:49:09Z

Hey, no worries, take the time you need. I just asked because sometimes people just forget or are waiting on some input from the maintainers.

…ypes

anferico · 2022-12-08T10:58:59Z

@BenjaminBossan I think I've addressed all the points above, however now I'm facing another problem.
I noticed that when calling the test_get_column_names_pandas_not_installed function, I incur into a weird recursion error. Obviously, as a result, the test fails.
Here's the command I ran:

python -m pytest -s -v --cov-report=xml skops/hub_utils/tests/test_hf_hub.py::test_get_column_names_pandas_not_installed

and here is part of the output I get (see log_test.txt for the full output):

  .
  .
  .
  File "/home/anfri/MyProjects/skops/skops/conftest.py", line 13, in mock_import
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1103, in __call__
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1111, in _increment_mock_call
  File "/home/anfri/MyProjects/skops/skops/conftest.py", line 13, in mock_import
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1103, in __call__
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1111, in _increment_mock_call
RecursionError: maximum recursion depth exceeded

I'm not sure what's happening here, because I haven't touched conftest.py. My suspicion is that the line from sklearn.utils import check_array that I added to _hf_hub.py could be the culprit because there are some pandas imports in the source file containing that function, so maybe us patching the import builtin in such a way that it throws an ImportError while trying to import pandas could be a problem, but I'm really not sure.

BenjaminBossan · 2022-12-08T16:49:08Z

I'm not sure what's happening here, because I haven't touched conftest.py. My suspicion is that the line from sklearn.utils import check_array that I added to _hf_hub.py could be the culprit because there are some pandas imports in the source file containing that function, so maybe us patching the import builtin in such a way that it throws an ImportError while trying to import pandas could be a problem, but I'm really not sure.

It seems likely that this is the explanation, but I'm not sure why that would lead to an infinite recursion. I'd have to dig deeper. @adrinjalali any ideas?

Meanwhile, since some time has passed since this PR was opened, the changes.rst has been updated. You would need to move your entry to the current version. Could you please do that?

adrinjalali · 2022-12-09T12:13:24Z

No idea about the infinite recursion, but looking at this, two things come to my mind:

it might be a good idea for us to have a CI job where we have minimal dependencies installed, instead of patching load_module.
same issues as Added support for Structured Arrays #211 would apply here: we should check if what we do would work on the inference API side.

It might be a good idea to reduce the scope of this PR to list of strings only, to make it easier to merge and tackle other things in a separate PR.

BenjaminBossan · 2022-12-09T12:27:51Z

@anferico I figured out the recursion error. Honestly, I'm a bit surprised that it didn't occur previously, but patching can be tricky. Please change the pandas_not_installed fixture like so:

@pytest.fixture
def pandas_not_installed():
    # patch import so that it raises an ImportError when trying to import
    # pandas. This works because pandas is only imported lazily.
    orig_import = __import__

    def mock_import(name, *args, **kwargs):
        if name == "pandas":
            raise ImportError
        return orig_import(name, *args, **kwargs)

    with patch("builtins.__import__", side_effect=mock_import):
        yield

BenjaminBossan · 2022-12-09T12:35:22Z

it might be a good idea for us to have a CI job where we have minimal dependencies installed, instead of patching load_module.

You mean a CI job where, for instance, pandas is not installed? Yes, that could be useful, but I still like to have a test that patches import, so that I can e.g. quickly test the functionality locally without the need to juggle multiple environments.

The extra CI environment would come in handy in case that the test itself is buggy or incomplete. However, having such a test environment would probably require adjusting a bunch of other tests too that currently assume all tests libraries to be installed.

same issues as Added support for Structured Arrays #211 would apply here: we should check if what we do would work on the inference API side.

Would it be possible to add this in a separate PR?

adrinjalali · 2022-12-09T12:37:55Z

Would it be possible to add this in a separate PR?

what does "this" here refer to?

I can e.g. quickly test the functionality locally without the need to juggle multiple environments.

Sure, I personally don't bother and let the CI do that for me.

BenjaminBossan · 2022-12-09T12:48:51Z

what does "this" here refer to?

I mean checking if what we do would work on the inference API side.

adrinjalali · 2022-12-09T12:52:01Z

I mean checking if what we do would work on the inference API side.

Yes and no 😁 we should at least check if we train a model with these data types, scikit-learn understands them (in terms of feature names etc), and that we can predict using these types and don't get errors/warnings from sklearn.

Right now sklearn supports feature names mostly through pandas only.

…ypes

anferico · 2023-01-08T18:29:12Z

@BenjaminBossan thanks, the recursion error just got fixed. I also fixed the changes.rst file.

A couple of comments regarding my latest commit:

I removed support for generators as valid containers for text-based tasks. The reason is that I couldn't find a way to peek at a generator without consuming it. Let me know if this is ok
The function get_example_input_from_text_data does not fail on empty containers. Is it desirable or not?

BenjaminBossan

Overall, I'm quite happy with the PR. I think it has developed quite well and the code should be easy to maintain.

I have a couple of comments which should hopefully be easy to address, please take a look.

I removed support for generators as valid containers for text-based tasks. The reason is that I couldn't find a way to peek at a generator without consuming it. Let me know if this is ok

Yes, it's okay. We can add them later if we want to. It is possible to peek at the head of a generator without consuming it completely, but of course the first few elements would need to be loaded at that point. One way to get there is by using tee.

The function get_example_input_from_text_data does not fail on empty containers. Is it desirable or not?

Good question. I think it's fine for it not to raise an error. Maybe we can warn, not sure. @skops-dev/maintainers WDYT?

BenjaminBossan · 2023-01-16T10:29:12Z

skops/hub_utils/_hf_hub.py

+    """
+
+    def _head(data, n):
+        def is_subscriptable(data):


I don't think it's necessary to have an extra one-line local function defined for this purpose. This creates overhead when it's not necessary and can make the code harder to read, because I cannot simply read from top to bottom but have to jump around.

My guess is that you do this for readability, so that you can later do:

if is_subscriptable(data): return data[:n]

instead of

if hasattr(data, "__getitem__"): return data[:n]

which can be a bit more difficult to understand. However, the same effect can be achieved by assigning an extra variable:

is_subscriptable = hasattr(data, "__getitem__") if is_subscriptable: return data[:n]

Personally, I would also be happy with the 2nd case, but 3rd one is fine.

For the other locally defined functions, I think a similar argument can be made, though it's not quite as clear cut. I leave it up to you.

Makes sense, thanks for the feedback. If you don't mind, I would change your code slightly so that the sentence has also a subject, namely:

is_data_subscriptable = hasattr(data, "__getitem__") if is_data_subscriptable: return data[:n]

That's perfectly fine.

BenjaminBossan · 2023-01-16T10:30:05Z

skops/hub_utils/_hf_hub.py

+    def _is_sequence_of_strings(data):
+        return not isinstance(data, str) and all(isinstance(x, str) for x in data)
+
+    error_message = "The data needs to be a sequence of strings."


There isn't really a need to assign the error message here. It is better to move it into the except part below.

The reason I assigned the error message there is because otherwise I would have to duplicate the same string in the else and except branches, like so:

try: data_head = _head(data, n=3) if _is_sequence_of_strings(data_head): return {"data": data_head} else: raise ValueError("The data needs to be a sequence of strings.") except TypeError as e: raise ValueError("The data needs to be a sequence of strings.") from e

Another way to avoid this would be to raise a TypeError inside the else branch, although that would feel pretty hacky.
Do you think it's fine to duplicate the error message?

Sorry, I have missed that it's used at two places.

BenjaminBossan · 2023-01-16T10:31:37Z

skops/hub_utils/_hf_hub.py

@@ -195,6 +257,7 @@ def _create_config(
        - ``"pickle"`` if the extension is one of ``{".pickle", ".pkl", ".joblib"}``
        - ``"skops"`` if the extension is ``".skops"``

+


Suggested change

BenjaminBossan · 2023-01-16T10:34:52Z

skops/hub_utils/tests/test_hf_hub.py

    assert len(examples) == 10
    assert len(examples["column0"]) == 3


+def test_get_example_input_from_text_data():


I think this test could also be parametrized. Furthermore, could you please add a test for tuples? I think they are much more likely to be used than sets. Also, how about one example with only 1 element?

anferico · 2023-01-17T20:41:47Z

@BenjaminBossan I've addressed the latest comments, let me know if this can be merged.

BenjaminBossan

Apart from the one comment below, this LGTM. Before merging, I would just wait to see if @adrinjalali also wants to review or is fine with going ahead.

BenjaminBossan · 2023-01-18T11:04:44Z

docs/changes.rst

@@ -26,6 +26,8 @@ v0.4
  section/New section": "content"})`` to add "content" a new subsection called
  "New section" to an existing section called "Existing section". :pr:`203` by
  `Benjamin Bossan`_.
+- Support more array-like data types for tabular data and list-like data types


I didn't notice earlier, but since we released v0.4 in December, this should be added to a new section "v0.5".

adrinjalali

Nice! Other than the changelog issue, LGTM.

anferico · 2023-01-19T17:38:42Z

Done, thanks again for your support!

add support for more data types

782b9e0

anferico changed the title ~~Support more array-like data types for tabular data and list-like data types for text data (addresses issue #65)~~ Support more array-like data types for tabular data and list-like data types for text data Oct 9, 2022

anferico added 4 commits October 11, 2022 18:25

Add docstrings

fe798d8

Update unit tests

a8f5e4d

Revert some erroneous changes

5d7e75d

Update changelog

ef89103

BenjaminBossan requested changes Oct 12, 2022

View reviewed changes

adrinjalali self-requested a review October 12, 2022 09:37

anferico added 3 commits October 13, 2022 18:33

Add myself as contributor

def297b

Address PR review comments

100169e

Merge remote-tracking branch 'upstream/main' into support-more-data-t…

f0cd56c

…ypes

anferico requested review from BenjaminBossan and removed request for adrinjalali October 13, 2022 19:23

BenjaminBossan requested changes Oct 14, 2022

View reviewed changes

Merge changes from main

b0944e5

adrinjalali requested changes Oct 17, 2022

View reviewed changes

anferico added 4 commits November 5, 2022 19:38

Merge remote-tracking branch 'upstream/main' into support-more-data-t…

6e89d06

…ypes

Merge remote-tracking branch 'upstream/main' into support-more-data-t…

2208b61

…ypes

Update changes.rst

eab3c29

Address PR comments

e6f5d2e

anferico added 2 commits January 8, 2023 18:25

Merge remote-tracking branch 'upstream/main' into support-more-data-t…

09ffc00

…ypes

Remove support for generators in text tasks + small fixes

ecbdf38

Merge branch 'main' into support-more-data-types

c0cf3c5

BenjaminBossan requested changes Jan 16, 2023

View reviewed changes

anferico added 2 commits January 17, 2023 21:22

Merge branch 'main' into support-more-data-types

c659e99

Remove inner function + parametrize test

fa42fe4

BenjaminBossan approved these changes Jan 18, 2023

View reviewed changes

adrinjalali approved these changes Jan 19, 2023

View reviewed changes

Add new section to changelog (v0.5)

6b41b69

Merge branch 'main' into support-more-data-types

6c5763e

adrinjalali changed the title ~~Support more array-like data types for tabular data and list-like data types for text data~~ ENH Support more array-like data types for tabular data and list-like data types for text data Jan 20, 2023

adrinjalali merged commit 6d35b5c into skops-dev:main Jan 20, 2023

anferico deleted the support-more-data-types branch January 20, 2023 09:48


		dump_json(Path(dst) / "config.json", config)


		def _head(data: Iterable, n: int):

		@@ -73,6 +74,34 @@ def _validate_folder(path: Union[str, Path]) -> None:
		raise TypeError(f"Model file {model_path} does not exist.")


		def _convert_to_2d_numpy_array(data):

		return list(itertools.islice(data_copy, n))


		def _is_sequence_of_strings(data):

		@@ -195,6 +257,7 @@ def _create_config(
		- ``"pickle"`` if the extension is one of ``{".pickle", ".pkl", ".joblib"}``
		- ``"skops"`` if the extension is ``".skops"``

ENH Support more array-like data types for tabular data and list-like data types for text data #179

ENH Support more array-like data types for tabular data and list-like data types for text data #179

Conversation

anferico commented Oct 9, 2022 • edited Loading

BenjaminBossan commented Oct 10, 2022

anferico commented Oct 11, 2022

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anferico commented Oct 13, 2022

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Nov 4, 2022

anferico commented Nov 4, 2022

BenjaminBossan commented Nov 4, 2022

anferico commented Dec 8, 2022 • edited Loading

BenjaminBossan commented Dec 8, 2022

adrinjalali commented Dec 9, 2022

BenjaminBossan commented Dec 9, 2022

BenjaminBossan commented Dec 9, 2022

adrinjalali commented Dec 9, 2022 • edited Loading

BenjaminBossan commented Dec 9, 2022

adrinjalali commented Dec 9, 2022

anferico commented Jan 8, 2023

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anferico commented Jan 17, 2023

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

anferico commented Jan 19, 2023

anferico commented Oct 9, 2022 •

edited

Loading

anferico commented Dec 8, 2022 •

edited

Loading

adrinjalali commented Dec 9, 2022 •

edited

Loading