-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Support more array-like data types for tabular data and list-like data types for text data #179
ENH Support more array-like data types for tabular data and list-like data types for text data #179
Conversation
Thanks a lot @anferico for taking this. I have only taken a quick look so far and noticed that you haven't added any unit tests for the changes. Could you please do that? Let us know if you need any help with that. Also, please add an entry to the changes doc. |
I haven't added any unit test for the 2 small helper functions I added, though I've modified existing ones in accordance to the changes I made. Not sure what's your policy here, but if needed I can add unit tests for those as well. I also added some docstrings and modified existing ones. Everything should be fine since all unit tests were passing on my local machine, but let's wait until the pytest workflow runs successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for working on this. To me, this looks already quite good and the overall approach can stay the same. However, I have some small comments, mostly regarding efficiency and some missing tests. Please take a look.
skops/hub_utils/_hf_hub.py
Outdated
The numpy.ndarray object obtained by converting data. | ||
""" | ||
data_array = np.asarray(data) | ||
if len(data_array.shape) != 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_array.ndim
is more idiomatic.
skops/hub_utils/_hf_hub.py
Outdated
return {f"x{x}": data[:3, x].tolist() for x in range(data.shape[1])} | ||
|
||
raise ValueError("The data is not a pandas.DataFrame or a numpy.ndarray.") | ||
data_array = _convert_to_2d_numpy_array(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're first converting all the data to a numpy array, only to throw away everything but the first 3 lines. Is it possible to rearrange the code to first slice, then convert? It would depend on what kind of data is passed here, but I think we can be a bit restrictive here. E.g. I don't think we need to support generators here (since we're not dealing with a text task).
skops/hub_utils/_hf_hub.py
Outdated
# data is not iterable | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, can we have a test?
skops/hub_utils/_hf_hub.py
Outdated
try: | ||
# needed in case data is an iterator or a generator | ||
data, data_copy = itertools.tee(data, 2) | ||
return all(isinstance(x, str) for x in data_copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm undecided on this one. Here we're testing all the data, even though we only need the first 3 samples. Of course, we can argue that fitting the model should not work if some rows have strange types, but is it really our job here to verify this?
@BenjaminBossan I've tried addressing your comments, see if it's good now. One minor detail: I renamed the function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for making the changes, the feature is almost there. I still have a few comments left, some are just suggestions that could be tackled later. Please also update with the main branch, as there is a merge commit.
skops/hub_utils/_hf_hub.py
Outdated
data_slice = data[:3] | ||
data_slice_array = _convert_to_2d_numpy_array(data_slice) | ||
return { | ||
f"x{x}": data_slice_array[:3, x].tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the data is sliced twice
skops/hub_utils/tests/test_hf_hub.py
Outdated
], | ||
) | ||
def test_head(data, n, expected_output): | ||
assert _head(data, n) == expected_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to additionally test:
- a numpy array of strings
- an empty container
- for generators, that they're not consumed (could be a separate test)
skops/hub_utils/tests/test_hf_hub.py
Outdated
], | ||
) | ||
def test_is_sequence_of_strings(data, is_sequence_of_strings): | ||
assert _is_sequence_of_strings(data) == is_sequence_of_strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add an additional test for numpy arrays of strings and for empty containers?
skops/hub_utils/_hf_hub.py
Outdated
error_message = "The data needs to be an iterable of strings." | ||
try: | ||
data_head = _head(data, n=3) | ||
if _is_sequence_of_strings(data_head): | ||
config["sklearn"]["example_input"] = {"data": data_head} | ||
else: | ||
raise ValueError(error_message) | ||
except TypeError as e: | ||
raise ValueError(error_message) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is getting rather complex and distracts from the actual work being done by _create_config
. I wonder if it would be possible to factor this code into _get_example_input
, so that this whole block here could be simplified to:
config["sklearn"]["example_input"] = _get_example_input(data)
if "tabular" in task:
config["sklearn"]["columns"] = _get_column_names(data)
If that's not really possible, just leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like this?
config["sklearn"]["example_input"] = _get_example_input(data, task)
if "tabular" in task:
config["sklearn"]["columns"] = _get_column_names(data)
Cause this would make it easier for me, I think. Alternatively, I'm thinking of something like:
if "tabular" in task:
config["sklearn"]["example_input"] = _get_example_input_from_tabular_data(data)
config["sklearn"]["columns"] = _get_column_names(data)
elif "text" in task:
config["sklearn"]["example_input"] = _get_example_input_from_text_data(data)
although the latter would imply renaming an existing function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config["sklearn"]["example_input"] = _get_example_input(data, task)
Looks good to me.
skops/hub_utils/_hf_hub.py
Outdated
|
||
dump_json(Path(dst) / "config.json", config) | ||
|
||
|
||
def _head(data: Iterable, n: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, we will probably have to factor out functions like _validate_folder
, _head
, _is_sequence_of_strings
etc. to a utils module. It doesn't need to be in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @anferico
docs/changes.rst
Outdated
@@ -11,6 +11,7 @@ skops Changelog | |||
|
|||
v0.3 | |||
---- | |||
- Support more array-like data types for tabular data and list-like data types for text data. :pr:`179` by `Francesco Cariaggi`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line too long (limit to 79 chars)
skops/hub_utils/_hf_hub.py
Outdated
@@ -73,6 +74,34 @@ def _validate_folder(path: Union[str, Path]) -> None: | |||
raise TypeError(f"Model file {model_path} does not exist.") | |||
|
|||
|
|||
def _convert_to_2d_numpy_array(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use sklearn.utils.validation.check_array
instead of writing our own, unless we handle more input types.
skops/hub_utils/_hf_hub.py
Outdated
The input needs to be anything that can be converted to a 2D | ||
``numpy.ndarray``, including a ``pandas.DataFrame``. The first 3 rows | ||
are used as example input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't just convert to numpy, for instance, if the input is xarray, we can extract column names. The same with cuDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a specific list of data types that you'd like to support? Cause there are countless variants of numpy arrays and pandas dataframes, but I don't think we can find a general solution that works for each of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a list, but I'm not sure if a silent conversion is a good idea. It might be a better idea to fail on containers which we don't understand, and slowly add them as requests come, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so if I understood correctly you'd rather fail on, say, cuDF dataframes rather than silently converting them to NumPy arrays, and the reason is because you don't want to risk throwing away information about the column names. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. And the reason I don't want to miss that information is the following:
We use these feature names to figure out how to pass the information to the model in the inference side. And that API receives data in this format: {'colA': [list of values], ...}
. If we lose the feature name information here, the user wouldn't know how to send data to the inference backend and the whole inference side of things break. Since saving feature names here serves pretty much no other purpose than making the inference backend, it'd be a pity to silently convert them to numpy arrays and lose that information.
So I think it's better to fail on things we don't understand, and explicitly add the more popular ones. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes total sense to me, however this means we need to explicitly check the type of the inputs we get, right? So, basically, right now we're supporting only:
DataFrame
s (for which we can also extract column names)np.array
s- [list|tuple] of [list|tuple|np.array] (can be silently converted to
np.array
s using sklearn'scheck_array
)
Anything else we should add for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a good start. I'm curious to see what people are gonna use. We should ideally raise for all other types or at least raise a warning.
skops/hub_utils/_hf_hub.py
Outdated
return list(itertools.islice(data_copy, n)) | ||
|
||
|
||
def _is_sequence_of_strings(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me it seems better to remove this method and have it in the function where it's called.
skops/hub_utils/_hf_hub.py
Outdated
|
||
dump_json(Path(dst) / "config.json", config) | ||
|
||
|
||
def _head(data: Iterable, n: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, not sure if we need a method for this.
@anferico are you still working on this? |
I am, it's just that I haven't had much free time recently. I'm happy to continue working on this, but if it's really high priority please feel free to let someone else take over. Sorry about that |
Hey, no worries, take the time you need. I just asked because sometimes people just forget or are waiting on some input from the maintainers. |
@BenjaminBossan I think I've addressed all the points above, however now I'm facing another problem. python -m pytest -s -v --cov-report=xml skops/hub_utils/tests/test_hf_hub.py::test_get_column_names_pandas_not_installed and here is part of the output I get (see log_test.txt for the full output):
I'm not sure what's happening here, because I haven't touched |
It seems likely that this is the explanation, but I'm not sure why that would lead to an infinite recursion. I'd have to dig deeper. @adrinjalali any ideas? Meanwhile, since some time has passed since this PR was opened, the |
No idea about the infinite recursion, but looking at this, two things come to my mind:
It might be a good idea to reduce the scope of this PR to list of strings only, to make it easier to merge and tackle other things in a separate PR. |
@anferico I figured out the recursion error. Honestly, I'm a bit surprised that it didn't occur previously, but patching can be tricky. Please change the @pytest.fixture
def pandas_not_installed():
# patch import so that it raises an ImportError when trying to import
# pandas. This works because pandas is only imported lazily.
orig_import = __import__
def mock_import(name, *args, **kwargs):
if name == "pandas":
raise ImportError
return orig_import(name, *args, **kwargs)
with patch("builtins.__import__", side_effect=mock_import):
yield |
You mean a CI job where, for instance, pandas is not installed? Yes, that could be useful, but I still like to have a test that patches The extra CI environment would come in handy in case that the test itself is buggy or incomplete. However, having such a test environment would probably require adjusting a bunch of other tests too that currently assume all
Would it be possible to add this in a separate PR? |
what does "this" here refer to?
Sure, I personally don't bother and let the CI do that for me. |
I mean checking if what we do would work on the inference API side. |
Yes and no 😁 we should at least check if we train a model with these data types, scikit-learn understands them (in terms of feature names etc), and that we can predict using these types and don't get errors/warnings from sklearn. Right now sklearn supports feature names mostly through pandas only. |
@BenjaminBossan thanks, the recursion error just got fixed. I also fixed the A couple of comments regarding my latest commit:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I'm quite happy with the PR. I think it has developed quite well and the code should be easy to maintain.
I have a couple of comments which should hopefully be easy to address, please take a look.
I removed support for generators as valid containers for text-based tasks. The reason is that I couldn't find a way to peek at a generator without consuming it. Let me know if this is ok
Yes, it's okay. We can add them later if we want to. It is possible to peek at the head of a generator without consuming it completely, but of course the first few elements would need to be loaded at that point. One way to get there is by using tee
.
The function
get_example_input_from_text_data
does not fail on empty containers. Is it desirable or not?
Good question. I think it's fine for it not to raise an error. Maybe we can warn, not sure. @skops-dev/maintainers WDYT?
skops/hub_utils/_hf_hub.py
Outdated
""" | ||
|
||
def _head(data, n): | ||
def is_subscriptable(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to have an extra one-line local function defined for this purpose. This creates overhead when it's not necessary and can make the code harder to read, because I cannot simply read from top to bottom but have to jump around.
My guess is that you do this for readability, so that you can later do:
if is_subscriptable(data):
return data[:n]
instead of
if hasattr(data, "__getitem__"):
return data[:n]
which can be a bit more difficult to understand. However, the same effect can be achieved by assigning an extra variable:
is_subscriptable = hasattr(data, "__getitem__")
if is_subscriptable:
return data[:n]
Personally, I would also be happy with the 2nd case, but 3rd one is fine.
For the other locally defined functions, I think a similar argument can be made, though it's not quite as clear cut. I leave it up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the feedback. If you don't mind, I would change your code slightly so that the sentence has also a subject, namely:
is_data_subscriptable = hasattr(data, "__getitem__")
if is_data_subscriptable:
return data[:n]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's perfectly fine.
def _is_sequence_of_strings(data): | ||
return not isinstance(data, str) and all(isinstance(x, str) for x in data) | ||
|
||
error_message = "The data needs to be a sequence of strings." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't really a need to assign the error message here. It is better to move it into the except
part below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I assigned the error message there is because otherwise I would have to duplicate the same string in the else
and except
branches, like so:
try:
data_head = _head(data, n=3)
if _is_sequence_of_strings(data_head):
return {"data": data_head}
else:
raise ValueError("The data needs to be a sequence of strings.")
except TypeError as e:
raise ValueError("The data needs to be a sequence of strings.") from e
Another way to avoid this would be to raise a TypeError
inside the else
branch, although that would feel pretty hacky.
Do you think it's fine to duplicate the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I have missed that it's used at two places.
skops/hub_utils/_hf_hub.py
Outdated
@@ -195,6 +257,7 @@ def _create_config( | |||
- ``"pickle"`` if the extension is one of ``{".pickle", ".pkl", ".joblib"}`` | |||
- ``"skops"`` if the extension is ``".skops"`` | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skops/hub_utils/tests/test_hf_hub.py
Outdated
assert len(examples) == 10 | ||
assert len(examples["column0"]) == 3 | ||
|
||
|
||
def test_get_example_input_from_text_data(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test could also be parametrized. Furthermore, could you please add a test for tuples? I think they are much more likely to be used than sets. Also, how about one example with only 1 element?
@BenjaminBossan I've addressed the latest comments, let me know if this can be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the one comment below, this LGTM. Before merging, I would just wait to see if @adrinjalali also wants to review or is fine with going ahead.
docs/changes.rst
Outdated
@@ -26,6 +26,8 @@ v0.4 | |||
section/New section": "content"})`` to add "content" a new subsection called | |||
"New section" to an existing section called "Existing section". :pr:`203` by | |||
`Benjamin Bossan`_. | |||
- Support more array-like data types for tabular data and list-like data types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice earlier, but since we released v0.4 in December, this should be added to a new section "v0.5".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Other than the changelog issue, LGTM.
Done, thanks again for your support! |
This pull requests is an attempt towards addressing #65. Note that this is my first ever public contribution, so don't be too hard on me 😉
For
tabular-classification
andtabular-regression
tasks, we're currently supporting the following data types:Should this PR get merged, we will support any other data type that can be converted to a 2D numpy.ndarray using
np.asarray
.For
text-classification
andtext-regression
tasks, we're currently supporting the following data types:Should this PR get merged, we will support any iterable of strings.
xref: https://github.com/skops-dev/skops/pull/45/files#r933221139