Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Support more array-like data types for tabular data and list-like data types for text data #179

Merged
merged 20 commits into from
Jan 20, 2023

Conversation

anferico
Copy link
Contributor

@anferico anferico commented Oct 9, 2022

This pull requests is an attempt towards addressing #65. Note that this is my first ever public contribution, so don't be too hard on me 😉

For tabular-classification and tabular-regression tasks, we're currently supporting the following data types:

  • pandas.DataFrame
  • numpy.ndarray

Should this PR get merged, we will support any other data type that can be converted to a 2D numpy.ndarray using np.asarray.


For text-classification and text-regression tasks, we're currently supporting the following data types:

  • List of strings

Should this PR get merged, we will support any iterable of strings.

xref: https://github.com/skops-dev/skops/pull/45/files#r933221139

@anferico anferico changed the title Support more array-like data types for tabular data and list-like data types for text data (addresses issue #65) Support more array-like data types for tabular data and list-like data types for text data Oct 9, 2022
@BenjaminBossan
Copy link
Collaborator

Thanks a lot @anferico for taking this.

I have only taken a quick look so far and noticed that you haven't added any unit tests for the changes. Could you please do that? Let us know if you need any help with that.

Also, please add an entry to the changes doc.

@anferico
Copy link
Contributor Author

I haven't added any unit test for the 2 small helper functions I added, though I've modified existing ones in accordance to the changes I made. Not sure what's your policy here, but if needed I can add unit tests for those as well. I also added some docstrings and modified existing ones.

Everything should be fine since all unit tests were passing on my local machine, but let's wait until the pytest workflow runs successfully.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for working on this. To me, this looks already quite good and the overall approach can stay the same. However, I have some small comments, mostly regarding efficiency and some missing tests. Please take a look.

docs/changes.rst Outdated Show resolved Hide resolved
The numpy.ndarray object obtained by converting data.
"""
data_array = np.asarray(data)
if len(data_array.shape) != 2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_array.ndim is more idiomatic.

return {f"x{x}": data[:3, x].tolist() for x in range(data.shape[1])}

raise ValueError("The data is not a pandas.DataFrame or a numpy.ndarray.")
data_array = _convert_to_2d_numpy_array(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're first converting all the data to a numpy array, only to throw away everything but the first 3 lines. Is it possible to rearrange the code to first slice, then convert? It would depend on what kind of data is passed here, but I think we can be a bit restrictive here. E.g. I don't think we need to support generators here (since we're not dealing with a text task).

skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved
Comment on lines 268 to 269
# data is not iterable
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, can we have a test?

try:
# needed in case data is an iterator or a generator
data, data_copy = itertools.tee(data, 2)
return all(isinstance(x, str) for x in data_copy)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undecided on this one. Here we're testing all the data, even though we only need the first 3 samples. Of course, we can argue that fitting the model should not work if some rows have strange types, but is it really our job here to verify this?

@adrinjalali adrinjalali self-requested a review October 12, 2022 09:37
@anferico
Copy link
Contributor Author

@BenjaminBossan I've tried addressing your comments, see if it's good now. One minor detail: I renamed the function _is_iterable_of_strings to _is_sequence_of_strings because now I'm not expecting generic iterables anymore as inputs to that function, which led me to remove the logic that prevents iterators and generators from being exhausted when iterated over. Without that logic, it just felt wrong to me calling it _is_iterable_of_strings, so I went for _is_sequence_of_strings.

@anferico anferico requested review from BenjaminBossan and removed request for adrinjalali October 13, 2022 19:23
Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for making the changes, the feature is almost there. I still have a few comments left, some are just suggestions that could be tackled later. Please also update with the main branch, as there is a merge commit.

Comment on lines 134 to 137
data_slice = data[:3]
data_slice_array = _convert_to_2d_numpy_array(data_slice)
return {
f"x{x}": data_slice_array[:3, x].tolist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the data is sliced twice

],
)
def test_head(data, n, expected_output):
assert _head(data, n) == expected_output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to additionally test:

  • a numpy array of strings
  • an empty container
  • for generators, that they're not consumed (could be a separate test)

],
)
def test_is_sequence_of_strings(data, is_sequence_of_strings):
assert _is_sequence_of_strings(data) == is_sequence_of_strings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an additional test for numpy arrays of strings and for empty containers?

Comment on lines 241 to 249
error_message = "The data needs to be an iterable of strings."
try:
data_head = _head(data, n=3)
if _is_sequence_of_strings(data_head):
config["sklearn"]["example_input"] = {"data": data_head}
else:
raise ValueError(error_message)
except TypeError as e:
raise ValueError(error_message) from e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is getting rather complex and distracts from the actual work being done by _create_config. I wonder if it would be possible to factor this code into _get_example_input, so that this whole block here could be simplified to:

    config["sklearn"]["example_input"] = _get_example_input(data)
    if "tabular" in task:
        config["sklearn"]["columns"] = _get_column_names(data)

If that's not really possible, just leave it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this?

config["sklearn"]["example_input"] = _get_example_input(data, task)
if "tabular" in task:
    config["sklearn"]["columns"] = _get_column_names(data)

Cause this would make it easier for me, I think. Alternatively, I'm thinking of something like:

if "tabular" in task:
    config["sklearn"]["example_input"] = _get_example_input_from_tabular_data(data)
    config["sklearn"]["columns"] = _get_column_names(data)
elif "text" in task:
    config["sklearn"]["example_input"] = _get_example_input_from_text_data(data)

although the latter would imply renaming an existing function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config["sklearn"]["example_input"] = _get_example_input(data, task)

Looks good to me.


dump_json(Path(dst) / "config.json", config)


def _head(data: Iterable, n: int):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we will probably have to factor out functions like _validate_folder, _head, _is_sequence_of_strings etc. to a utils module. It doesn't need to be in this PR though.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @anferico

docs/changes.rst Outdated
@@ -11,6 +11,7 @@ skops Changelog

v0.3
----
- Support more array-like data types for tabular data and list-like data types for text data. :pr:`179` by `Francesco Cariaggi`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long (limit to 79 chars)

@@ -73,6 +74,34 @@ def _validate_folder(path: Union[str, Path]) -> None:
raise TypeError(f"Model file {model_path} does not exist.")


def _convert_to_2d_numpy_array(data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use sklearn.utils.validation.check_array instead of writing our own, unless we handle more input types.

Comment on lines 114 to 116
The input needs to be anything that can be converted to a 2D
``numpy.ndarray``, including a ``pandas.DataFrame``. The first 3 rows
are used as example input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't just convert to numpy, for instance, if the input is xarray, we can extract column names. The same with cuDF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a specific list of data types that you'd like to support? Cause there are countless variants of numpy arrays and pandas dataframes, but I don't think we can find a general solution that works for each of them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a list, but I'm not sure if a silent conversion is a good idea. It might be a better idea to fail on containers which we don't understand, and slowly add them as requests come, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so if I understood correctly you'd rather fail on, say, cuDF dataframes rather than silently converting them to NumPy arrays, and the reason is because you don't want to risk throwing away information about the column names. Is that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And the reason I don't want to miss that information is the following:

We use these feature names to figure out how to pass the information to the model in the inference side. And that API receives data in this format: {'colA': [list of values], ...}. If we lose the feature name information here, the user wouldn't know how to send data to the inference backend and the whole inference side of things break. Since saving feature names here serves pretty much no other purpose than making the inference backend, it'd be a pity to silently convert them to numpy arrays and lose that information.

So I think it's better to fail on things we don't understand, and explicitly add the more popular ones. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense to me, however this means we need to explicitly check the type of the inputs we get, right? So, basically, right now we're supporting only:

  • DataFrames (for which we can also extract column names)
  • np.arrays
  • [list|tuple] of [list|tuple|np.array] (can be silently converted to np.arrays using sklearn's check_array)

Anything else we should add for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a good start. I'm curious to see what people are gonna use. We should ideally raise for all other types or at least raise a warning.

return list(itertools.islice(data_copy, n))


def _is_sequence_of_strings(data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me it seems better to remove this method and have it in the function where it's called.


dump_json(Path(dst) / "config.json", config)


def _head(data: Iterable, n: int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, not sure if we need a method for this.

@BenjaminBossan
Copy link
Collaborator

@anferico are you still working on this?

@anferico
Copy link
Contributor Author

anferico commented Nov 4, 2022

@anferico are you still working on this?

I am, it's just that I haven't had much free time recently. I'm happy to continue working on this, but if it's really high priority please feel free to let someone else take over. Sorry about that

@BenjaminBossan
Copy link
Collaborator

Hey, no worries, take the time you need. I just asked because sometimes people just forget or are waiting on some input from the maintainers.

@anferico
Copy link
Contributor Author

anferico commented Dec 8, 2022

@BenjaminBossan I think I've addressed all the points above, however now I'm facing another problem.
I noticed that when calling the test_get_column_names_pandas_not_installed function, I incur into a weird recursion error. Obviously, as a result, the test fails.
Here's the command I ran:

python -m pytest -s -v --cov-report=xml skops/hub_utils/tests/test_hf_hub.py::test_get_column_names_pandas_not_installed

and here is part of the output I get (see log_test.txt for the full output):

  .
  .
  .
  File "/home/anfri/MyProjects/skops/skops/conftest.py", line 13, in mock_import
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1103, in __call__
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1111, in _increment_mock_call
  File "/home/anfri/MyProjects/skops/skops/conftest.py", line 13, in mock_import
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1103, in __call__
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1111, in _increment_mock_call
RecursionError: maximum recursion depth exceeded

I'm not sure what's happening here, because I haven't touched conftest.py. My suspicion is that the line from sklearn.utils import check_array that I added to _hf_hub.py could be the culprit because there are some pandas imports in the source file containing that function, so maybe us patching the import builtin in such a way that it throws an ImportError while trying to import pandas could be a problem, but I'm really not sure.

@BenjaminBossan
Copy link
Collaborator

I'm not sure what's happening here, because I haven't touched conftest.py. My suspicion is that the line from sklearn.utils import check_array that I added to _hf_hub.py could be the culprit because there are some pandas imports in the source file containing that function, so maybe us patching the import builtin in such a way that it throws an ImportError while trying to import pandas could be a problem, but I'm really not sure.

It seems likely that this is the explanation, but I'm not sure why that would lead to an infinite recursion. I'd have to dig deeper. @adrinjalali any ideas?

Meanwhile, since some time has passed since this PR was opened, the changes.rst has been updated. You would need to move your entry to the current version. Could you please do that?

@adrinjalali
Copy link
Member

No idea about the infinite recursion, but looking at this, two things come to my mind:

  • it might be a good idea for us to have a CI job where we have minimal dependencies installed, instead of patching load_module.
  • same issues as Added support for Structured Arrays #211 would apply here: we should check if what we do would work on the inference API side.

It might be a good idea to reduce the scope of this PR to list of strings only, to make it easier to merge and tackle other things in a separate PR.

@BenjaminBossan
Copy link
Collaborator

@anferico I figured out the recursion error. Honestly, I'm a bit surprised that it didn't occur previously, but patching can be tricky. Please change the pandas_not_installed fixture like so:

@pytest.fixture
def pandas_not_installed():
    # patch import so that it raises an ImportError when trying to import
    # pandas. This works because pandas is only imported lazily.
    orig_import = __import__

    def mock_import(name, *args, **kwargs):
        if name == "pandas":
            raise ImportError
        return orig_import(name, *args, **kwargs)

    with patch("builtins.__import__", side_effect=mock_import):
        yield

@BenjaminBossan
Copy link
Collaborator

  • it might be a good idea for us to have a CI job where we have minimal dependencies installed, instead of patching load_module.

You mean a CI job where, for instance, pandas is not installed? Yes, that could be useful, but I still like to have a test that patches import, so that I can e.g. quickly test the functionality locally without the need to juggle multiple environments.

The extra CI environment would come in handy in case that the test itself is buggy or incomplete. However, having such a test environment would probably require adjusting a bunch of other tests too that currently assume all tests libraries to be installed.

Would it be possible to add this in a separate PR?

@adrinjalali
Copy link
Member

adrinjalali commented Dec 9, 2022

Would it be possible to add this in a separate PR?

what does "this" here refer to?

I can e.g. quickly test the functionality locally without the need to juggle multiple environments.

Sure, I personally don't bother and let the CI do that for me.

@BenjaminBossan
Copy link
Collaborator

what does "this" here refer to?

I mean checking if what we do would work on the inference API side.

@adrinjalali
Copy link
Member

I mean checking if what we do would work on the inference API side.

Yes and no 😁 we should at least check if we train a model with these data types, scikit-learn understands them (in terms of feature names etc), and that we can predict using these types and don't get errors/warnings from sklearn.

Right now sklearn supports feature names mostly through pandas only.

@anferico
Copy link
Contributor Author

anferico commented Jan 8, 2023

@BenjaminBossan thanks, the recursion error just got fixed. I also fixed the changes.rst file.

A couple of comments regarding my latest commit:

  • I removed support for generators as valid containers for text-based tasks. The reason is that I couldn't find a way to peek at a generator without consuming it. Let me know if this is ok
  • The function get_example_input_from_text_data does not fail on empty containers. Is it desirable or not?

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'm quite happy with the PR. I think it has developed quite well and the code should be easy to maintain.

I have a couple of comments which should hopefully be easy to address, please take a look.

I removed support for generators as valid containers for text-based tasks. The reason is that I couldn't find a way to peek at a generator without consuming it. Let me know if this is ok

Yes, it's okay. We can add them later if we want to. It is possible to peek at the head of a generator without consuming it completely, but of course the first few elements would need to be loaded at that point. One way to get there is by using tee.

The function get_example_input_from_text_data does not fail on empty containers. Is it desirable or not?

Good question. I think it's fine for it not to raise an error. Maybe we can warn, not sure. @skops-dev/maintainers WDYT?

"""

def _head(data, n):
def is_subscriptable(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to have an extra one-line local function defined for this purpose. This creates overhead when it's not necessary and can make the code harder to read, because I cannot simply read from top to bottom but have to jump around.

My guess is that you do this for readability, so that you can later do:

if is_subscriptable(data):
    return data[:n]

instead of

if hasattr(data, "__getitem__"):
    return data[:n]

which can be a bit more difficult to understand. However, the same effect can be achieved by assigning an extra variable:

is_subscriptable = hasattr(data, "__getitem__")
if is_subscriptable:
    return data[:n]

Personally, I would also be happy with the 2nd case, but 3rd one is fine.

For the other locally defined functions, I think a similar argument can be made, though it's not quite as clear cut. I leave it up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the feedback. If you don't mind, I would change your code slightly so that the sentence has also a subject, namely:

is_data_subscriptable = hasattr(data, "__getitem__")
if is_data_subscriptable:
    return data[:n]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perfectly fine.

def _is_sequence_of_strings(data):
return not isinstance(data, str) and all(isinstance(x, str) for x in data)

error_message = "The data needs to be a sequence of strings."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't really a need to assign the error message here. It is better to move it into the except part below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I assigned the error message there is because otherwise I would have to duplicate the same string in the else and except branches, like so:

try:
    data_head = _head(data, n=3)
    if _is_sequence_of_strings(data_head):
        return {"data": data_head}
    else:
        raise ValueError("The data needs to be a sequence of strings.")
except TypeError as e:
    raise ValueError("The data needs to be a sequence of strings.") from e

Another way to avoid this would be to raise a TypeError inside the else branch, although that would feel pretty hacky.
Do you think it's fine to duplicate the error message?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have missed that it's used at two places.

@@ -195,6 +257,7 @@ def _create_config(
- ``"pickle"`` if the extension is one of ``{".pickle", ".pkl", ".joblib"}``
- ``"skops"`` if the extension is ``".skops"``


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

assert len(examples) == 10
assert len(examples["column0"]) == 3


def test_get_example_input_from_text_data():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test could also be parametrized. Furthermore, could you please add a test for tuples? I think they are much more likely to be used than sets. Also, how about one example with only 1 element?

@anferico
Copy link
Contributor Author

@BenjaminBossan I've addressed the latest comments, let me know if this can be merged.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the one comment below, this LGTM. Before merging, I would just wait to see if @adrinjalali also wants to review or is fine with going ahead.

docs/changes.rst Outdated
@@ -26,6 +26,8 @@ v0.4
section/New section": "content"})`` to add "content" a new subsection called
"New section" to an existing section called "Existing section". :pr:`203` by
`Benjamin Bossan`_.
- Support more array-like data types for tabular data and list-like data types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice earlier, but since we released v0.4 in December, this should be added to a new section "v0.5".

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Other than the changelog issue, LGTM.

@anferico
Copy link
Contributor Author

Done, thanks again for your support!

@adrinjalali adrinjalali changed the title Support more array-like data types for tabular data and list-like data types for text data ENH Support more array-like data types for tabular data and list-like data types for text data Jan 20, 2023
@adrinjalali adrinjalali merged commit 6d35b5c into skops-dev:main Jan 20, 2023
@anferico anferico deleted the support-more-data-types branch January 20, 2023 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants