Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate minimal README.md file in repository initialization (issue #113) #207

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

jucamohedano
Copy link
Contributor

This PR references issue #113

Implemented the function _create_readme which is called in hub_utils.init. It creates a minimal README.md file. The input to this function is the same as the input to _create_config function.
At the moment, the function _create_readme only supports a tabular task for the widget. I haven't added the text task since other functions such as metadata_from_config in _model_card.py don't give support for the text task yet. Let me know if this can be added otherwise.

I have manually tested the initialization of a repository with the example script plot_hf_hub.py and it seemed to work as expected.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jucamohedano

@@ -213,6 +213,83 @@ def recursively_default_dict() -> MutableMapping:
dump_json(Path(dst) / "config.json", config)


def _create_readme(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this function.

We can probably simply add such two lines to the init:

model_card = card.Card(model, metadata=card.metadata_from_config(Path(local_repo)))
model_card.save(Path(local_repo) / "README.md")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That simplifies it a lot! I applied the change :)

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, I left a very general feedback 🙂

skops/hub_utils/tests/test_hf_hub.py Show resolved Hide resolved
data=iris.data,
)
_validate_folder(path=dir_path)
assert os.path.isfile(Path(dir_path) / "README.md")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should simply add it under test_init to deduplicate (essentially everything before this step is same and this behavior will be a part of init anyway)? WDYT? @adrinjalali @jucamohedano

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, this specific one can be in test_init.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Made the change.

t1 = os.path.getmtime(Path(dir_path) / "README.md")

# compare the times at which the files were last modified
assert t0 != t1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of this we can render the content of model card and compare whether they're same or not, WDYT? @adrinjalali @jucamohedano

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy either way!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to make sure content is modified so it would be nice to have the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change :)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jucamohedano , it's almost ready to be merged :)

@@ -395,7 +429,6 @@ def repo_path_for_inference():


@pytest.mark.network
@pytest.mark.inference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably removed by mistake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I removed that somehow by mistake. I added it back.

Comment on lines 418 to 451
# test inference backend for classifier and regressor models. This test can
# take a lot of time and be flaky.
# test inference backend for classifier and regressor models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! I added it back :)

@adrinjalali
Copy link
Member

@BenjaminBossan any idea why the CI fails here? It's really odd to me.

@BenjaminBossan
Copy link
Collaborator

any idea why the CI fails here? It's really odd to me.

I didn't follow this PR along. Is the expectation here that the error should be caught by the try?

@adrinjalali
Copy link
Member

Oh I see where the issue comes from. The error is raised inside this test: test_init_empty_model_file_warns

So far, we have allowed people to pass an empty model file, we just raise a warning if they do so.

Now, this PR tries to actually load the model file and do things with it, which means it expects it to be a valid model file.

So, @BenjaminBossan , the question is: do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT?

@BenjaminBossan
Copy link
Collaborator

do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT?

I would be okay with that, don't really see a use case where you'd start out without a model.

@adrinjalali
Copy link
Member

#214 should fix the issue.

@adrinjalali
Copy link
Member

@jucamohedano note that I've merge with latest main, you'd need to do a git pull before applying more changes to have this change applied locally on your machine.

@adrinjalali
Copy link
Member

@jucamohedano would you be able to debug the issue?

@jucamohedano
Copy link
Contributor Author

One thing to add. In my last commit I have added a fix to the file reading logic in init so that the plot_tabular_regression.py example runs without errors, otherwise it throws an untrusted type error for numpy.float64. However, I'm questioning whether the trusted arg should be added where I have added it in the last commit, or maybe it should be added if we know for sure that the model_format=='skops'. If we go with the latter, then we have to modify the example to specify that the model format is skops. Please let me know if my explanation is clear enough.

@BenjaminBossan
Copy link
Collaborator

BenjaminBossan commented May 2, 2023

However, I'm questioning whether the trusted arg should be added where I have added it in the last commit, or maybe it should be added if we know for sure that the model_format=='skops'. If we go with the latter, then we have to modify the example to specify that the model format is skops. Please let me know if my explanation is clear enough.

As a user, I would be surprised to find out that model_format="skops" does not auto-trust all types but model_format="auto" does. It should be one or the other. Presumably, we need to always trust everything, otherwise running init would not work for certain models if it uses the skops format. At the same time, trusting all types is something we want to discourage, after all, this is not much better than using pickle. We could add a trusted argument to init, but the function is already quite big as is and at first glance, it's not apparent why it would need that argument.

Overall, this has become much more complicated than I initially expected. I wondered if we could just use a dummy model here, since the model card needs to be later edited by the user anyway, but I think this could also be confusing. So in sum, I'm not sure what the best way is. @adrinjalali what would you suggest?

@adrinjalali
Copy link
Member

These are all good points. It makes me think that probably the best way is to get a model object and the path to where it's supposed to have been saved, and not load the model in init afterall. This makes the user responsible to properly have loaded the model before giving it to init.

@jucamohedano
Copy link
Contributor Author

I agree with the suggestion from @adrinjalali, otherwise it looks like we would have to hard code the model_format arg. What do you think about his suggestion @BenjaminBossan ?

@BenjaminBossan
Copy link
Collaborator

Yes, that might be the better solution. I'm not 100% sure if there might not be new issues with that approach, but it's worth trying.

@jucamohedano
Copy link
Contributor Author

Maybe we run into some issue, but we'll see. Shall we keep working on this PR or a new open one?

@BenjaminBossan
Copy link
Collaborator

Maybe we run into some issue, but we'll see. Shall we keep working on this PR or a new open one?

Do whatever you feel more comfortable with.

@jucamohedano
Copy link
Contributor Author

I have written an implementation to test in my local setup where I have introduced a new argument to the _hf_hub_.init function. the new argument is a model object. However, as expected many tests fail. Maybe we could use the model object if it is given, and if not we could load the model. If the latter raises an error, we just let the user know about this to fix incompatibilties with the saved model object the user's trying to load. WDYT?

I will remove unnecessary changes from the diff in future commits!

@BenjaminBossan
Copy link
Collaborator

I have written an implementation to test in my local setup where I have introduced a new argument to the _hf_hub_.init function. the new argument is a model object. However, as expected many tests fail. Maybe we could use the model object if it is given, and if not we could load the model. If the latter raises an error, we just let the user know about this to fix incompatibilties with the saved model object the user's trying to load. WDYT?

I would prefer for the change to not be backwards incompatible, i.e. all existing tests should still pass. I have a suggestion, not sure if there is something I missed that would invalidate it, let me know:

The model argument to init will, in the future, also allow an sklearn estimator to be passed. If model is a str or Path, nothing changes, everything stays the same. If model is an estimator, use it to create a README.md (if there isn't already one). Also, save the model in the destination directory using a generic name, like "model.pkl" or "model.skops" (depending on the model_format argument). WDYT about that?

@adrinjalali
Copy link
Member

The model argument to init will, in the future, also allow an sklearn estimator to be passed. If model is a str or Path, nothing changes, everything stays the same. If model is an estimator, use it to create a README.md (if there isn't already one)

That doesn't sound too bad to me, at some point we might want to deprecate passing a path to the model maybe.

@jucamohedano
Copy link
Contributor Author

I totally agree with the implementation and that's why I decided to go ahead and implement it.
Docs build is currently failing. I will go over that later today and fix it!

@jucamohedano
Copy link
Contributor Author

Hi, I'm not sure what to do about the failing example in the docs build. Since hub_utils.init is creating a README.md file by default when creating the repository, hub_utils.add_files fails to execute because an existing README.md is already there. I would assume it's safe to overwrite it?
This is the error from the docs build failed:

Unexpected failing examples:
/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_california_housing.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_california_housing.py", line 1852, in <module>
    hub_utils.add_files(
  File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/skops/hub_utils/_hf_hub.py", line 531, in add_files
    raise FileExistsError(msg)
FileExistsError: File 'README.md' already found at '/tmp/tmpwmea7ezv'.

@BenjaminBossan
Copy link
Collaborator

Maybe I misunderstand your code, but it seems like it doesn't do what we discussed, resulting in the error? IIUC, init will always create a README.md. My suggestion was

If model is a str or Path, nothing changes, everything stays the same. If model is an estimator, use it to create a README.md (if there isn't already one).

@jucamohedano jucamohedano reopened this Jun 2, 2023
@jucamohedano
Copy link
Contributor Author

Thank you for pointing that out @BenjaminBossan, I realised the code wasn't doing what was discussed. I reverted back to the behaviour of creating a README if model is a str/Path, otherwise handle it if model is an estimator object. I believe the code should be doing that now. However, docs still fail to build on the example plot_california_housing.py. I actually think that the example needs to be updated to account for the creation of a README at the initialization of the repository, and avoid adding the README file manually to the repository. WDT?

Mistakenly closed the PR while scrolling.

@BenjaminBossan
Copy link
Collaborator

Thanks for the updates @jucamohedano.

This is not a full on review, as I think there are still some bigger changes needed. Most notably, I feel like init has become too complicated, it is hard to track what is going on. Let's try to simplify it. A good start is to take a look at _load_model, which I believe you can re-use for this purpose.

I actually think that the example needs to be updated to account for the creation of a README at the initialization of the repository, and avoid adding the README file manually to the repository. WDT?

Maybe I misunderstand, but shouldn't it be the other way round? What I mean is that the README.md created by the example is the one that contains all the important information and should thus take precedence over the README.md that was automatically created.

@jucamohedano
Copy link
Contributor Author

I will continue working on the issue starting next week and give some feedback on my work. Thank you for the feedback!

@jucamohedano
Copy link
Contributor Author

Hey! Thanks for the suggestion about _load_model because it can simplify the code as long as we can trust the types within the model file to be loaded.

Maybe I misunderstand, but shouldn't it be the other way round? What I mean is that the README.md created by the example is the one that contains all the important information and should thus take precedence over the README.md that was automatically created.

Thanks for making it clear. The call to hub_utils.add_files in the example is the one that throws the error about an already existing README.md in the repository. We could address this problem in hub_utils.add_files to account for the existing file created by the call to hub_utils.init.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing updates to bring this PR closer to be merged. I have made a few comments and had some questions for clarification, please take a look.

We could address this problem in hub_utils.add_files to account for the existing file created by the call to hub_utils.init.

add_files already has an option, exist_ok, which can be set to True to allow overriding.

model = _load_model(model, trusted=True)
model_card = card.Card(model, metadata=card.metadata_from_config(dst))
model_card.save(dst / "README.md")
elif isinstance(model, BaseEstimator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need no check for BaseEstimator here, so this can just be else and the error below can be removed. There could be valid models here that don't inherit from BaseEstimator. It is the user's responsibility to provide an sklearn-compatible model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, got it!

if model_format == "auto":
model_format = "skops"
elif model_format in ["pkl", "pickle", "joblib"]:
model_format = "pickle"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an else clause below because if the user passes model_format="skosp" or another typo, that would just be accepted as is, even though it is invalid. Then further below, the model would not be saved without any indication that something went wrong.

Essentially, the model format checking logic should be the same as inside _create_config. And since checking the model format is new performed in init, we should also be able to remove it from _create_config completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, I can remove that logic from _create_config and have it just in init. I actually started this PR by looking at _create_config. Let me know if you would like to make this change now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that change should be good.


version = metadata.version("scikit-learn")
_, model_format = config_json
# joblib type falls unders auto format, explicityly set to auto
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# joblib type falls unders auto format, explicityly set to auto
# joblib type falls under auto format, explicitly set to auto

version = metadata.version("scikit-learn")
_, model_format = config_json
# joblib type falls unders auto format, explicityly set to auto
# because we can't repeat key "auto" in CONFIG dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand that, could you please explain further?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that comment isn't self-explainable. The dictionary CONFIG in test_hf_hub contains 3 types of models. I wanted to test for a model with name model.joblib, so I added that model to the dictionary. If I recall correctly, adding the key:value pair for the joblib model made other tests fail. But I could fix this by changing the joblib model type auto and still have the name model.joblib to test for.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess it depends on why the other tests fail. If this uncovers a bug with existing code, that would be good to know. Otherwise, I think it's okay to have a separate test for joblib and not changing CONFIG.

model_card = RepoCard.load(Path(dir_path) / "README.md")
model_card.data.license

# override existent modelcard created by init with license attribute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my better understanding, is this testing some new behavior added by this PR or some general behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not a new behavior added by this PR. Now that you call this out I see that maybe we can make this test simpler by checking that the README.md exists after the call to init and that's it? I think that at the time of writing the test, I decided to update the README.md as a way of checking that it exists.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make this test simpler by checking that the README.md exists after the call to init and that's it?

Yes, that sounds like it is sufficient as a test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants