Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add examples and dtypes to CardData and config.json #45

Merged
merged 21 commits into from
Aug 1, 2022

Conversation

adrinjalali
Copy link
Member

This PR adds functionality to store an example input and the input columns and their dtypes, so that the backend can better understand how to pass data to the model.

@adrinjalali
Copy link
Member Author

adrinjalali commented Jul 20, 2022

@merveenoyan you can see the kind of change I was suggesting on the model card side in this PR.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition, thanks.

I have a few minor comments and questions, please take a look. Apart from that, could you please add unit tests to check if the dtypes are correctly determined and stored?

skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Show resolved Hide resolved
# Load the model from the existing directory.
config = get_config(path)
model_path = Path(path) / config["sklearn"]["model"]["file"]
with open(model_path, "rb") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this limits the user imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how it's limiting the user though, can you explain?

Copy link
Member Author

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some conversations are still open, need to address those before I add tests.

examples/plot_hf_hub.py Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
# Load the model from the existing directory.
config = get_config(path)
model_path = Path(path) / config["sklearn"]["model"]["file"]
with open(model_path, "rb") as f:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how it's limiting the user though, can you explain?

skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Show resolved Hide resolved
Copy link
Member Author

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Show resolved Hide resolved
skops/hub_utils/_hf_hub.py Show resolved Hide resolved
Copy link
Member Author

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests fail because persisting dtypes as strings is not trivial. I'm thinking of removing them from this PR and marking it as future work. WDYT @BenjaminBossan ?


if isinstance(data, pd.DataFrame):
return {x: data[x][:3].to_list() for x in data.columns}
except ImportError:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #53 for discussion on pandas dependency.

@BenjaminBossan
Copy link
Collaborator

The tests fail because persisting dtypes as strings is not trivial. I'm thinking of removing them from this PR and marking it as future work.

Let me try to understand the problem: We would like to encode the numpy dtype in a JSON. The only sensible JSON type would be string. Therefore, we need a mapping from dtype to str and vice versa. Can we use this collection? I checked and numpy dtypes are hashable, so we could create a dict.

@adrinjalali
Copy link
Member Author

Yes, that would be an idea, but that list doesn't include the string types, and it also doesn't include the little/big endian info. the dtype object actually includes quite a bit of information: https://numpy.org/doc/stable/reference/arrays.dtypes.html

I was motivated to include this partly because we need to have a list with the order of the columns, and I thought I could as well include the dtypes.

But now with the complexity I see, I rather only include the columns names here, and move the dtype support to another PR, which itself would be quite a sizable PR.

@adrinjalali
Copy link
Member Author

  • we need the list of column names with the order because on the inference side we get a dict which is not necessarily with columns in the same order as the original data.

@adrinjalali adrinjalali marked this pull request as ready for review July 25, 2022 14:08
@BenjaminBossan
Copy link
Collaborator

Yes, that would be an idea, but that list doesn't include the string types, and it also doesn't include the little/big endian info. the dtype object actually includes quite a bit of information

Got it, then let's add this later. For my understanding, how are dtypes selected right now?

Looking forward, let's assume we add dtypes in a later step after the official release, would the current implementation ensure that the current config.json files would still work?

Note: I think some docstrings still refer to the dtype being stored, please check.

@adrinjalali
Copy link
Member Author

Got it, then let's add this later. For my understanding, how are dtypes selected right now?

What do you mean by selected? Where?

Looking forward, let's assume we add dtypes in a later step after the official release, would the current implementation ensure that the current config.json files would still work?

Yes.

@BenjaminBossan
Copy link
Collaborator

What do you mean by selected? Where?

I meant by the backend, referring to the original description.

This PR adds functionality to store an example input and the input columns and their dtypes, so that the backend can better understand how to pass data to the model.

So for now, any possible dtype conversion has to be handled by the user and we defer providing functionality around it to a future PR. Is this understanding correct?

Yes.

Okay, great.

Copy link
Member Author

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready for a final review.

Once this is merged, we need to fix things up on the api-inference-community side.

Comment on lines 34 to 35
path,
card_data=None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we okay with the change introduced here?

@merveenoyan ?

@adrinjalali
Copy link
Member Author

I meant by the backend, referring to the original description.

This is where input curation is happening: https://github.com/adrinjalali/api-inference-community/blob/7051256b61f82592216af5b4c7b11796a2fb7218/docker_images/sklearn/app/pipelines/tabular_classification.py#L45

That piece of code would need to use the info we're providing here, otherwise it's kinda giving the model very unreliable data.

So for now, any possible dtype conversion has to be handled by the user and we defer providing functionality around it to a future PR. Is this understanding correct?

yes, or for now we'll see how it works.

@BenjaminBossan
Copy link
Collaborator

That piece of code would need to use the info we're providing here, otherwise it's kinda giving the model very unreliable data.

Yes, makes sense, thanks for the pointer.

@adrinjalali
Copy link
Member Author

Due to merge conflicts, #37 should be merged before this one.

@adrinjalali adrinjalali mentioned this pull request Jul 26, 2022
@adrinjalali
Copy link
Member Author

I've removed the changes to the model card here, and will open another PR specifically for that, to make this PR smaller.

This is ready for a review @BenjaminBossan

@BenjaminBossan
Copy link
Collaborator

@adrinjalali The codecov check fails. Should we ignore it, fix it, or remove the check entirely?

@adrinjalali
Copy link
Member Author

I wouldn't remove it, and the one which is more important is the patch rather than project. It fails because there are certain parts, like _min_dependencies.py which are not tested. I find it quite useful, but I see it as a guideline rather than something which has to be green before we merge. I actually use the patch report to make sure new code is covered as much as we can.

@adrinjalali
Copy link
Member Author

Already failing due to mypy, don't like this. Gonna check tomorrow.

@BenjaminBossan
Copy link
Collaborator

BenjaminBossan commented Jul 28, 2022

Already failing due to mypy, don't like this. Gonna check tomorrow.

Ah, that stings, but it's pretty harmless, probably failing due to missing imports after merging. I changed some instances of List[...] to list[...] because it's nicer IMO, but it does require the __future__ import for Python 3.7.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.

I have a few comments which are more of aesthetic nature, and one about what kind of data we want to allow for text tasks, where the current implementation might be too restrictive.

from skops.hub_utils.tests.common import HF_HUB_TOKEN
from skops.utils.fixes import metadata, path_unlink

iris = load_iris(as_frame=True, return_X_y=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting this into a fixture? (Would require splitting up test_create_config as suggested below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data was in a fixture, but we can't put fixtures in pytest.mark.parametrize, so I moved it to a variable. I like it this way, easy to follow/understand as well. Fixtures can make the code look too much like magic sometimes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to achieve this effect using the request fixture but it's a bit difficult to understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I saw that, rather have things simple.

skops/hub_utils/_hf_hub.py Show resolved Hide resolved
),
],
)
def test_create_config(data, task, expected_config):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a matter of taste, but I had a bit of a hard time parsing this test. It would have probably been easier if it was broken down into 2 or 3 tests: one for tabular and one for text classification. The expected_config could be put into a fixture if it would otherwise have to be duplicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are going to be more configs, so rather not put it in a fixture. Also don't really like repeating tests, I rather parametrize tests as much as we can, makes it easier to add new tests.

skops/hub_utils/tests/test_hf_hub.py Outdated Show resolved Hide resolved
config["sklearn"]["example_input"] = _get_example_input(data)
config["sklearn"]["columns"] = _get_column_names(data)
elif "text" in task:
if isinstance(data, list) and all(isinstance(x, str) for x in data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to CountVectorizer et al, the input data needs to be an "iterable", and the docstring here says "array-like". Therefore, a user may pass, for instance, a numpy array and should work fine. Do we still want to constrain the type to be a list here?

Say we don't require lists, the code below assumes that data is Sequence (can be index). If we don't want to make that assumption, we could go for list(itertools.islice(data, 3))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know we're being very restrictive with the types we accept here. We also don't accept all array-like things accepted by sklearn. Adding support for those is nice, and necessary, but I was hoping to have that in a future PR and have something basic here, which doesn't necessarily limit users. They can still pass a list to this method even if their data is originally in a Series form.

Opened #65 to track this.

@adrinjalali
Copy link
Member Author

Ping @BenjaminBossan , can we merge this now?

@BenjaminBossan BenjaminBossan merged commit cebe9be into skops-dev:main Aug 1, 2022
@adrinjalali adrinjalali deleted the dtypes branch August 1, 2022 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants