-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate minimal README.md file in repository initialization (issue #113) #207
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @jucamohedano
skops/hub_utils/_hf_hub.py
Outdated
@@ -213,6 +213,83 @@ def recursively_default_dict() -> MutableMapping: | |||
dump_json(Path(dst) / "config.json", config) | |||
|
|||
|
|||
def _create_readme( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this function.
We can probably simply add such two lines to the init
:
model_card = card.Card(model, metadata=card.metadata_from_config(Path(local_repo)))
model_card.save(Path(local_repo) / "README.md")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That simplifies it a lot! I applied the change :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, I left a very general feedback 🙂
skops/hub_utils/tests/test_hf_hub.py
Outdated
data=iris.data, | ||
) | ||
_validate_folder(path=dir_path) | ||
assert os.path.isfile(Path(dir_path) / "README.md") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should simply add it under test_init
to deduplicate (essentially everything before this step is same and this behavior will be a part of init anyway)? WDYT? @adrinjalali @jucamohedano
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, this specific one can be in test_init
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Made the change.
skops/hub_utils/tests/test_hf_hub.py
Outdated
t1 = os.path.getmtime(Path(dir_path) / "README.md") | ||
|
||
# compare the times at which the files were last modified | ||
assert t0 != t1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of this we can render the content of model card and compare whether they're same or not, WDYT? @adrinjalali @jucamohedano
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy either way!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to make sure content is modified so it would be nice to have the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jucamohedano , it's almost ready to be merged :)
skops/hub_utils/tests/test_hf_hub.py
Outdated
@@ -395,7 +429,6 @@ def repo_path_for_inference(): | |||
|
|||
|
|||
@pytest.mark.network | |||
@pytest.mark.inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably removed by mistake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I removed that somehow by mistake. I added it back.
skops/hub_utils/tests/test_hf_hub.py
Outdated
# test inference backend for classifier and regressor models. This test can | ||
# take a lot of time and be flaky. | ||
# test inference backend for classifier and regressor models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again! I added it back :)
@BenjaminBossan any idea why the CI fails here? It's really odd to me. |
I didn't follow this PR along. Is the expectation here that the error should be caught by the |
Oh I see where the issue comes from. The error is raised inside this test: So far, we have allowed people to pass an empty model file, we just raise a warning if they do so. Now, this PR tries to actually load the model file and do things with it, which means it expects it to be a valid model file. So, @BenjaminBossan , the question is: do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT? |
I would be okay with that, don't really see a use case where you'd start out without a model. |
#214 should fix the issue. |
@jucamohedano note that I've merge with latest |
@jucamohedano would you be able to debug the issue? |
One thing to add. In my last commit I have added a fix to the file reading logic in |
As a user, I would be surprised to find out that Overall, this has become much more complicated than I initially expected. I wondered if we could just use a dummy model here, since the model card needs to be later edited by the user anyway, but I think this could also be confusing. So in sum, I'm not sure what the best way is. @adrinjalali what would you suggest? |
These are all good points. It makes me think that probably the best way is to get a model object and the path to where it's supposed to have been saved, and not load the model in |
I agree with the suggestion from @adrinjalali, otherwise it looks like we would have to hard code the |
Yes, that might be the better solution. I'm not 100% sure if there might not be new issues with that approach, but it's worth trying. |
Maybe we run into some issue, but we'll see. Shall we keep working on this PR or a new open one? |
Do whatever you feel more comfortable with. |
I have written an implementation to test in my local setup where I have introduced a new argument to the I will remove unnecessary changes from the diff in future commits! |
I would prefer for the change to not be backwards incompatible, i.e. all existing tests should still pass. I have a suggestion, not sure if there is something I missed that would invalidate it, let me know: The |
That doesn't sound too bad to me, at some point we might want to deprecate passing a path to the model maybe. |
I totally agree with the implementation and that's why I decided to go ahead and implement it. |
Hi, I'm not sure what to do about the failing example in the docs build. Since
|
Maybe I misunderstand your code, but it seems like it doesn't do what we discussed, resulting in the error? IIUC,
|
…f model is an estimator to create README
Thank you for pointing that out @BenjaminBossan, I realised the code wasn't doing what was discussed. I reverted back to the behaviour of creating a README if model is a str/Path, otherwise handle it if model is an estimator object. I believe the code should be doing that now. However, docs still fail to build on the example Mistakenly closed the PR while scrolling. |
Thanks for the updates @jucamohedano. This is not a full on review, as I think there are still some bigger changes needed. Most notably, I feel like
Maybe I misunderstand, but shouldn't it be the other way round? What I mean is that the README.md created by the example is the one that contains all the important information and should thus take precedence over the README.md that was automatically created. |
I will continue working on the issue starting next week and give some feedback on my work. Thank you for the feedback! |
Hey! Thanks for the suggestion about
Thanks for making it clear. The call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing updates to bring this PR closer to be merged. I have made a few comments and had some questions for clarification, please take a look.
We could address this problem in hub_utils.add_files to account for the existing file created by the call to hub_utils.init.
add_files
already has an option, exist_ok
, which can be set to True
to allow overriding.
model = _load_model(model, trusted=True) | ||
model_card = card.Card(model, metadata=card.metadata_from_config(dst)) | ||
model_card.save(dst / "README.md") | ||
elif isinstance(model, BaseEstimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need no check for BaseEstimator
here, so this can just be else
and the error below can be removed. There could be valid models here that don't inherit from BaseEstimator
. It is the user's responsibility to provide an sklearn-compatible model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, got it!
if model_format == "auto": | ||
model_format = "skops" | ||
elif model_format in ["pkl", "pickle", "joblib"]: | ||
model_format = "pickle" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an else
clause below because if the user passes model_format="skosp"
or another typo, that would just be accepted as is, even though it is invalid. Then further below, the model would not be saved without any indication that something went wrong.
Essentially, the model format checking logic should be the same as inside _create_config
. And since checking the model format is new performed in init
, we should also be able to remove it from _create_config
completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, I can remove that logic from _create_config
and have it just in init
. I actually started this PR by looking at _create_config
. Let me know if you would like to make this change now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that change should be good.
|
||
version = metadata.version("scikit-learn") | ||
_, model_format = config_json | ||
# joblib type falls unders auto format, explicityly set to auto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# joblib type falls unders auto format, explicityly set to auto | |
# joblib type falls under auto format, explicitly set to auto |
version = metadata.version("scikit-learn") | ||
_, model_format = config_json | ||
# joblib type falls unders auto format, explicityly set to auto | ||
# because we can't repeat key "auto" in CONFIG dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand that, could you please explain further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that comment isn't self-explainable. The dictionary CONFIG
in test_hf_hub
contains 3 types of models. I wanted to test for a model with name model.joblib
, so I added that model to the dictionary. If I recall correctly, adding the key:value pair for the joblib model made other tests fail. But I could fix this by changing the joblib
model type auto
and still have the name model.joblib
to test for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess it depends on why the other tests fail. If this uncovers a bug with existing code, that would be good to know. Otherwise, I think it's okay to have a separate test for joblib and not changing CONFIG
.
model_card = RepoCard.load(Path(dir_path) / "README.md") | ||
model_card.data.license | ||
|
||
# override existent modelcard created by init with license attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my better understanding, is this testing some new behavior added by this PR or some general behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is not a new behavior added by this PR. Now that you call this out I see that maybe we can make this test simpler by checking that the README.md exists after the call to init
and that's it? I think that at the time of writing the test, I decided to update the README.md as a way of checking that it exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can make this test simpler by checking that the README.md exists after the call to init and that's it?
Yes, that sounds like it is sufficient as a test.
This PR references issue #113
Implemented the function
_create_readme
which is called inhub_utils.init
. It creates a minimal README.md file. The input to this function is the same as the input to_create_config
function.At the moment, the function
_create_readme
only supports atabular
task for the widget. I haven't added thetext
task since other functions such asmetadata_from_config
in_model_card.py
don't give support for thetext
task yet. Let me know if this can be added otherwise.I have manually tested the initialization of a repository with the example script
plot_hf_hub.py
and it seemed to work as expected.