ENH add examples and dtypes to CardData and config.json #45

adrinjalali · 2022-07-20T15:22:51Z

This PR adds functionality to store an example input and the input columns and their dtypes, so that the backend can better understand how to pass data to the model.

adrinjalali · 2022-07-20T15:23:29Z

@merveenoyan you can see the kind of change I was suggesting on the model card side in this PR.

BenjaminBossan

Good addition, thanks.

I have a few minor comments and questions, please take a look. Apart from that, could you please add unit tests to check if the dtypes are correctly determined and stored?

skops/card/_model_card.py

skops/hub_utils/_hf_hub.py

examples/plot_hf_hub.py

skops/card/_model_card.py

merveenoyan · 2022-07-21T11:10:23Z

skops/card/_model_card.py

+    # Load the model from the existing directory.
+    config = get_config(path)
+    model_path = Path(path) / config["sklearn"]["model"]["file"]
+    with open(model_path, "rb") as f:


this limits the user imo.

I'm not sure how it's limiting the user though, can you explain?

skops/card/_model_card.py

skops/hub_utils/_hf_hub.py

adrinjalali

Some conversations are still open, need to address those before I add tests.

examples/plot_hf_hub.py

skops/card/_model_card.py

adrinjalali · 2022-07-22T11:46:17Z

skops/card/_model_card.py

+    # Load the model from the existing directory.
+    config = get_config(path)
+    model_path = Path(path) / config["sklearn"]["model"]["file"]
+    with open(model_path, "rb") as f:


I'm not sure how it's limiting the user though, can you explain?

skops/card/_model_card.py

skops/hub_utils/_hf_hub.py

adrinjalali

I'm still not clear what you mean here @merveenoyan : https://github.com/skops-dev/skops/pull/45/files#r926549925

skops/card/_model_card.py

skops/hub_utils/_hf_hub.py

adrinjalali

The tests fail because persisting dtypes as strings is not trivial. I'm thinking of removing them from this PR and marking it as future work. WDYT @BenjaminBossan ?

adrinjalali · 2022-07-25T12:42:38Z

skops/hub_utils/_hf_hub.py

+
+        if isinstance(data, pd.DataFrame):
+            return {x: data[x][:3].to_list() for x in data.columns}
+    except ImportError:


See #53 for discussion on pandas dependency.

BenjaminBossan · 2022-07-25T13:58:31Z

The tests fail because persisting dtypes as strings is not trivial. I'm thinking of removing them from this PR and marking it as future work.

Let me try to understand the problem: We would like to encode the numpy dtype in a JSON. The only sensible JSON type would be string. Therefore, we need a mapping from dtype to str and vice versa. Can we use this collection? I checked and numpy dtypes are hashable, so we could create a dict.

adrinjalali · 2022-07-25T14:03:41Z

Yes, that would be an idea, but that list doesn't include the string types, and it also doesn't include the little/big endian info. the dtype object actually includes quite a bit of information: https://numpy.org/doc/stable/reference/arrays.dtypes.html

I was motivated to include this partly because we need to have a list with the order of the columns, and I thought I could as well include the dtypes.

But now with the complexity I see, I rather only include the columns names here, and move the dtype support to another PR, which itself would be quite a sizable PR.

adrinjalali · 2022-07-25T14:08:18Z

we need the list of column names with the order because on the inference side we get a dict which is not necessarily with columns in the same order as the original data.

BenjaminBossan · 2022-07-25T14:22:55Z

Yes, that would be an idea, but that list doesn't include the string types, and it also doesn't include the little/big endian info. the dtype object actually includes quite a bit of information

Got it, then let's add this later. For my understanding, how are dtypes selected right now?

Looking forward, let's assume we add dtypes in a later step after the official release, would the current implementation ensure that the current config.json files would still work?

Note: I think some docstrings still refer to the dtype being stored, please check.

adrinjalali · 2022-07-25T16:43:22Z

Got it, then let's add this later. For my understanding, how are dtypes selected right now?

What do you mean by selected? Where?

Looking forward, let's assume we add dtypes in a later step after the official release, would the current implementation ensure that the current config.json files would still work?

Yes.

BenjaminBossan · 2022-07-26T07:59:27Z

What do you mean by selected? Where?

I meant by the backend, referring to the original description.

This PR adds functionality to store an example input and the input columns and their dtypes, so that the backend can better understand how to pass data to the model.

So for now, any possible dtype conversion has to be handled by the user and we defer providing functionality around it to a future PR. Is this understanding correct?

Yes.

Okay, great.

adrinjalali

I think this is ready for a final review.

Once this is merged, we need to fix things up on the api-inference-community side.

adrinjalali · 2022-07-26T10:37:01Z

skops/card/_model_card.py

+    path,
+    card_data=None,


Are we okay with the change introduced here?

@merveenoyan ?

adrinjalali · 2022-07-26T10:44:04Z

I meant by the backend, referring to the original description.

This is where input curation is happening: https://github.com/adrinjalali/api-inference-community/blob/7051256b61f82592216af5b4c7b11796a2fb7218/docker_images/sklearn/app/pipelines/tabular_classification.py#L45

That piece of code would need to use the info we're providing here, otherwise it's kinda giving the model very unreliable data.

So for now, any possible dtype conversion has to be handled by the user and we defer providing functionality around it to a future PR. Is this understanding correct?

yes, or for now we'll see how it works.

BenjaminBossan · 2022-07-26T11:32:53Z

That piece of code would need to use the info we're providing here, otherwise it's kinda giving the model very unreliable data.

Yes, makes sense, thanks for the pointer.

adrinjalali · 2022-07-26T15:20:33Z

Due to merge conflicts, #37 should be merged before this one.

adrinjalali · 2022-07-28T13:17:14Z

I've removed the changes to the model card here, and will open another PR specifically for that, to make this PR smaller.

This is ready for a review @BenjaminBossan

BenjaminBossan · 2022-07-28T14:41:00Z

@adrinjalali The codecov check fails. Should we ignore it, fix it, or remove the check entirely?

adrinjalali · 2022-07-28T14:48:28Z

I wouldn't remove it, and the one which is more important is the patch rather than project. It fails because there are certain parts, like _min_dependencies.py which are not tested. I find it quite useful, but I see it as a guideline rather than something which has to be green before we merge. I actually use the patch report to make sure new code is covered as much as we can.

adrinjalali · 2022-07-28T14:59:13Z

Already failing due to mypy, don't like this. Gonna check tomorrow.

BenjaminBossan · 2022-07-28T15:36:26Z

Already failing due to mypy, don't like this. Gonna check tomorrow.

Ah, that stings, but it's pretty harmless, probably failing due to missing imports after merging. I changed some instances of List[...] to list[...] because it's nicer IMO, but it does require the __future__ import for Python 3.7.

BenjaminBossan

LGTM overall.

I have a few comments which are more of aesthetic nature, and one about what kind of data we want to allow for text tasks, where the current implementation might be too restrictive.

BenjaminBossan · 2022-07-29T12:33:16Z

skops/hub_utils/tests/test_hf_hub.py

 from skops.hub_utils.tests.common import HF_HUB_TOKEN
 from skops.utils.fixes import metadata, path_unlink

+iris = load_iris(as_frame=True, return_X_y=False)


How about putting this into a fixture? (Would require splitting up test_create_config as suggested below)

Data was in a fixture, but we can't put fixtures in pytest.mark.parametrize, so I moved it to a variable. I like it this way, easy to follow/understand as well. Fixtures can make the code look too much like magic sometimes.

It is possible to achieve this effect using the request fixture but it's a bit difficult to understand.

Yes I saw that, rather have things simple.

skops/hub_utils/_hf_hub.py

BenjaminBossan · 2022-07-29T12:38:28Z

skops/hub_utils/tests/test_hf_hub.py

+        ),
+    ],
+)
+def test_create_config(data, task, expected_config):


Probably a matter of taste, but I had a bit of a hard time parsing this test. It would have probably been easier if it was broken down into 2 or 3 tests: one for tabular and one for text classification. The expected_config could be put into a fixture if it would otherwise have to be duplicated.

there are going to be more configs, so rather not put it in a fixture. Also don't really like repeating tests, I rather parametrize tests as much as we can, makes it easier to add new tests.

skops/hub_utils/tests/test_hf_hub.py

BenjaminBossan · 2022-07-29T12:47:54Z

skops/hub_utils/_hf_hub.py

+        config["sklearn"]["example_input"] = _get_example_input(data)
+        config["sklearn"]["columns"] = _get_column_names(data)
+    elif "text" in task:
+        if isinstance(data, list) and all(isinstance(x, str) for x in data):


According to CountVectorizer et al, the input data needs to be an "iterable", and the docstring here says "array-like". Therefore, a user may pass, for instance, a numpy array and should work fine. Do we still want to constrain the type to be a list here?

Say we don't require lists, the code below assumes that data is Sequence (can be index). If we don't want to make that assumption, we could go for list(itertools.islice(data, 3))

Yes I know we're being very restrictive with the types we accept here. We also don't accept all array-like things accepted by sklearn. Adding support for those is nice, and necessary, but I was hoping to have that in a future PR and have something basic here, which doesn't necessarily limit users. They can still pass a list to this method even if their data is originally in a Series form.

Opened #65 to track this.

adrinjalali · 2022-08-01T07:57:57Z

Ping @BenjaminBossan , can we merge this now?

ENH add examples and dtypes to CardData and config.json

0bbaf42

adrinjalali requested review from merveenoyan and BenjaminBossan July 20, 2022 15:43

BenjaminBossan reviewed Jul 21, 2022

View reviewed changes

merveenoyan reviewed Jul 21, 2022

View reviewed changes

examples/plot_hf_hub.py Show resolved Hide resolved

merveenoyan reviewed Jul 21, 2022

View reviewed changes

skops/card/_model_card.py Outdated Show resolved Hide resolved

merveenoyan reviewed Jul 21, 2022

View reviewed changes

skops/card/_model_card.py Outdated Show resolved Hide resolved

merveenoyan reviewed Jul 21, 2022

View reviewed changes

skops/hub_utils/_hf_hub.py Show resolved Hide resolved

adrinjalali mentioned this pull request Jul 22, 2022

ENH Refactor Model Card #37

Merged

adrinjalali added 2 commits July 22, 2022 14:07

address some feedback

65b191c

Merge remote-tracking branch 'upstream/main' into dtypes

c7a742a

adrinjalali commented Jul 22, 2022

View reviewed changes

adrinjalali commented Jul 25, 2022

View reviewed changes

skops/card/_model_card.py Outdated Show resolved Hide resolved

skops/hub_utils/_hf_hub.py Show resolved Hide resolved

skops/hub_utils/_hf_hub.py Show resolved Hide resolved

adrinjalali added 7 commits July 25, 2022 12:39

address more comments

475bc43

Merge remote-tracking branch 'upstream/main' into dtypes

c1f3a10

improve tests

d85a7f4

fix model card tests

01c507b

add typing_extensions to python 3.7

4b6b9c7

add comment

a9bc358

fix model card example

95f47a2

adrinjalali mentioned this pull request Jul 25, 2022

Pandas: hard or soft dependency #53

Closed

add dtype and get_example tests

a5c02f6

adrinjalali commented Jul 25, 2022

View reviewed changes

remove dtypes

b20b939

adrinjalali marked this pull request as ready for review July 25, 2022 14:08

adrinjalali added 3 commits July 26, 2022 12:29

improve coverage

080af4e

Merge remote-tracking branch 'upstream/main' into dtypes

0694caf

remove dtype from docstring

a35f8a8

adrinjalali commented Jul 26, 2022

View reviewed changes

adrinjalali mentioned this pull request Jul 26, 2022

MNT make init atomic #60

Merged

adrinjalali added 2 commits July 28, 2022 15:08

Merge remote-tracking branch 'upstream/main' into dtypes

240211e

fix sphinx import

1d63e35

fix plot_model_card init call

cee60eb

Merge branch 'main' into dtypes

85713c4

make mypy shut up

b056cb8

adrinjalali requested a review from BenjaminBossan July 29, 2022 12:18

BenjaminBossan reviewed Jul 29, 2022

View reviewed changes

adrinjalali mentioned this pull request Jul 29, 2022

Support more array-like and list-like string-like data #65

Closed

rename test

2c174c3

BenjaminBossan approved these changes Aug 1, 2022

View reviewed changes

BenjaminBossan merged commit cebe9be into skops-dev:main Aug 1, 2022

adrinjalali deleted the dtypes branch August 1, 2022 08:40

ENH add examples and dtypes to CardData and config.json #45

ENH add examples and dtypes to CardData and config.json #45

Conversation

adrinjalali commented Jul 20, 2022

adrinjalali commented Jul 20, 2022 • edited Loading

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Jul 25, 2022

adrinjalali commented Jul 25, 2022

adrinjalali commented Jul 25, 2022

BenjaminBossan commented Jul 25, 2022

adrinjalali commented Jul 25, 2022

BenjaminBossan commented Jul 26, 2022

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Jul 26, 2022

BenjaminBossan commented Jul 26, 2022

adrinjalali commented Jul 26, 2022

adrinjalali commented Jul 28, 2022

BenjaminBossan commented Jul 28, 2022

adrinjalali commented Jul 28, 2022

adrinjalali commented Jul 28, 2022

BenjaminBossan commented Jul 28, 2022 • edited Loading

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Aug 1, 2022

adrinjalali commented Jul 20, 2022 •

edited

Loading

BenjaminBossan commented Jul 28, 2022 •

edited

Loading