-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add examples and dtypes to CardData and config.json #45
Conversation
@merveenoyan you can see the kind of change I was suggesting on the model card side in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good addition, thanks.
I have a few minor comments and questions, please take a look. Apart from that, could you please add unit tests to check if the dtypes are correctly determined and stored?
skops/card/_model_card.py
Outdated
# Load the model from the existing directory. | ||
config = get_config(path) | ||
model_path = Path(path) / config["sklearn"]["model"]["file"] | ||
with open(model_path, "rb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this limits the user imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how it's limiting the user though, can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some conversations are still open, need to address those before I add tests.
skops/card/_model_card.py
Outdated
# Load the model from the existing directory. | ||
config = get_config(path) | ||
model_path = Path(path) / config["sklearn"]["model"]["file"] | ||
with open(model_path, "rb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how it's limiting the user though, can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not clear what you mean here @merveenoyan : https://github.com/skops-dev/skops/pull/45/files#r926549925
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests fail because persisting dtypes as strings is not trivial. I'm thinking of removing them from this PR and marking it as future work. WDYT @BenjaminBossan ?
|
||
if isinstance(data, pd.DataFrame): | ||
return {x: data[x][:3].to_list() for x in data.columns} | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #53 for discussion on pandas dependency.
Let me try to understand the problem: We would like to encode the numpy dtype in a JSON. The only sensible JSON type would be string. Therefore, we need a mapping from dtype to str and vice versa. Can we use this collection? I checked and numpy dtypes are hashable, so we could create a dict. |
Yes, that would be an idea, but that list doesn't include the string types, and it also doesn't include the little/big endian info. the dtype object actually includes quite a bit of information: https://numpy.org/doc/stable/reference/arrays.dtypes.html I was motivated to include this partly because we need to have a list with the order of the columns, and I thought I could as well include the dtypes. But now with the complexity I see, I rather only include the columns names here, and move the dtype support to another PR, which itself would be quite a sizable PR. |
|
Got it, then let's add this later. For my understanding, how are dtypes selected right now? Looking forward, let's assume we add dtypes in a later step after the official release, would the current implementation ensure that the current config.json files would still work? Note: I think some docstrings still refer to the dtype being stored, please check. |
What do you mean by selected? Where?
Yes. |
I meant by the backend, referring to the original description.
So for now, any possible dtype conversion has to be handled by the user and we defer providing functionality around it to a future PR. Is this understanding correct?
Okay, great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ready for a final review.
Once this is merged, we need to fix things up on the api-inference-community side.
skops/card/_model_card.py
Outdated
path, | ||
card_data=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we okay with the change introduced here?
This is where input curation is happening: https://github.com/adrinjalali/api-inference-community/blob/7051256b61f82592216af5b4c7b11796a2fb7218/docker_images/sklearn/app/pipelines/tabular_classification.py#L45 That piece of code would need to use the info we're providing here, otherwise it's kinda giving the model very unreliable data.
yes, or for now we'll see how it works. |
Yes, makes sense, thanks for the pointer. |
Due to merge conflicts, #37 should be merged before this one. |
I've removed the changes to the model card here, and will open another PR specifically for that, to make this PR smaller. This is ready for a review @BenjaminBossan |
@adrinjalali The codecov check fails. Should we ignore it, fix it, or remove the check entirely? |
I wouldn't remove it, and the one which is more important is the |
Already failing due to mypy, don't like this. Gonna check tomorrow. |
Ah, that stings, but it's pretty harmless, probably failing due to missing imports after merging. I changed some instances of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall.
I have a few comments which are more of aesthetic nature, and one about what kind of data we want to allow for text tasks, where the current implementation might be too restrictive.
from skops.hub_utils.tests.common import HF_HUB_TOKEN | ||
from skops.utils.fixes import metadata, path_unlink | ||
|
||
iris = load_iris(as_frame=True, return_X_y=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about putting this into a fixture? (Would require splitting up test_create_config
as suggested below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data was in a fixture, but we can't put fixtures in pytest.mark.parametrize
, so I moved it to a variable. I like it this way, easy to follow/understand as well. Fixtures can make the code look too much like magic sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to achieve this effect using the request
fixture but it's a bit difficult to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I saw that, rather have things simple.
), | ||
], | ||
) | ||
def test_create_config(data, task, expected_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a matter of taste, but I had a bit of a hard time parsing this test. It would have probably been easier if it was broken down into 2 or 3 tests: one for tabular and one for text classification. The expected_config
could be put into a fixture if it would otherwise have to be duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are going to be more configs, so rather not put it in a fixture. Also don't really like repeating tests, I rather parametrize tests as much as we can, makes it easier to add new tests.
config["sklearn"]["example_input"] = _get_example_input(data) | ||
config["sklearn"]["columns"] = _get_column_names(data) | ||
elif "text" in task: | ||
if isinstance(data, list) and all(isinstance(x, str) for x in data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to CountVectorizer
et al, the input data needs to be an "iterable", and the docstring here says "array-like". Therefore, a user may pass, for instance, a numpy array and should work fine. Do we still want to constrain the type to be a list here?
Say we don't require lists, the code below assumes that data is Sequence (can be index). If we don't want to make that assumption, we could go for list(itertools.islice(data, 3))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I know we're being very restrictive with the types we accept here. We also don't accept all array-like things accepted by sklearn. Adding support for those is nice, and necessary, but I was hoping to have that in a future PR and have something basic here, which doesn't necessarily limit users. They can still pass a list to this method even if their data is originally in a Series
form.
Opened #65 to track this.
Ping @BenjaminBossan , can we merge this now? |
This PR adds functionality to store an example input and the input columns and their dtypes, so that the backend can better understand how to pass data to the model.