Generate minimal README.md file in repository initialization (issue #113) #207

jucamohedano · 2022-11-01T21:26:46Z

This PR references issue #113

Implemented the function _create_readme which is called in hub_utils.init. It creates a minimal README.md file. The input to this function is the same as the input to _create_config function.
At the moment, the function _create_readme only supports a tabular task for the widget. I haven't added the text task since other functions such as metadata_from_config in _model_card.py don't give support for the text task yet. Let me know if this can be added otherwise.

I have manually tested the initialization of a repository with the example script plot_hf_hub.py and it seemed to work as expected.

adrinjalali

Thanks for the PR @jucamohedano

adrinjalali · 2022-11-02T08:37:20Z

skops/hub_utils/_hf_hub.py

@@ -213,6 +213,83 @@ def recursively_default_dict() -> MutableMapping:
    dump_json(Path(dst) / "config.json", config)


+def _create_readme(


I don't think we need this function.

We can probably simply add such two lines to the init:

model_card = card.Card(model, metadata=card.metadata_from_config(Path(local_repo))) model_card.save(Path(local_repo) / "README.md")

That simplifies it a lot! I applied the change :)

skops/hub_utils/_hf_hub.py

merveenoyan

Thanks a lot, I left a very general feedback 🙂

skops/hub_utils/tests/test_hf_hub.py

merveenoyan · 2022-11-07T14:10:58Z

skops/hub_utils/tests/test_hf_hub.py

+        data=iris.data,
+    )
+    _validate_folder(path=dir_path)
+    assert os.path.isfile(Path(dir_path) / "README.md")


I wonder if we should simply add it under test_init to deduplicate (essentially everything before this step is same and this behavior will be a part of init anyway)? WDYT? @adrinjalali @jucamohedano

That's true, this specific one can be in test_init.

I agree. Made the change.

merveenoyan · 2022-11-07T14:12:02Z

skops/hub_utils/tests/test_hf_hub.py

+    t1 = os.path.getmtime(Path(dir_path) / "README.md")
+
+    # compare the times at which the files were last modified
+    assert t0 != t1


I think instead of this we can render the content of model card and compare whether they're same or not, WDYT? @adrinjalali @jucamohedano

I'm happy either way!

We want to make sure content is modified so it would be nice to have the former.

I made the change :)

adrinjalali

Thanks @jucamohedano , it's almost ready to be merged :)

adrinjalali · 2022-11-16T10:28:55Z

skops/hub_utils/tests/test_hf_hub.py

@@ -395,7 +429,6 @@ def repo_path_for_inference():


 @pytest.mark.network
-@pytest.mark.inference


this is probably removed by mistake

Thank you! I removed that somehow by mistake. I added it back.

adrinjalali · 2022-11-16T10:29:09Z

skops/hub_utils/tests/test_hf_hub.py

-    # test inference backend for classifier and regressor models. This test can
-    # take a lot of time and be flaky.
+    # test inference backend for classifier and regressor models.


this change as well

Thanks again! I added it back :)

adrinjalali · 2022-11-18T17:13:52Z

@BenjaminBossan any idea why the CI fails here? It's really odd to me.

BenjaminBossan · 2022-11-18T18:57:40Z

any idea why the CI fails here? It's really odd to me.

I didn't follow this PR along. Is the expectation here that the error should be caught by the try?

adrinjalali · 2022-11-21T12:25:38Z

Oh I see where the issue comes from. The error is raised inside this test: test_init_empty_model_file_warns

So far, we have allowed people to pass an empty model file, we just raise a warning if they do so.

Now, this PR tries to actually load the model file and do things with it, which means it expects it to be a valid model file.

So, @BenjaminBossan , the question is: do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT?

BenjaminBossan · 2022-11-21T12:37:52Z

do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT?

I would be okay with that, don't really see a use case where you'd start out without a model.

adrinjalali · 2022-11-21T13:00:07Z

#214 should fix the issue.

adrinjalali · 2022-11-22T15:04:05Z

@jucamohedano note that I've merge with latest main, you'd need to do a git pull before applying more changes to have this change applied locally on your machine.

adrinjalali · 2022-11-29T15:45:20Z

@jucamohedano would you be able to debug the issue?

jucamohedano · 2023-05-01T20:14:58Z

One thing to add. In my last commit I have added a fix to the file reading logic in init so that the plot_tabular_regression.py example runs without errors, otherwise it throws an untrusted type error for numpy.float64. However, I'm questioning whether the trusted arg should be added where I have added it in the last commit, or maybe it should be added if we know for sure that the model_format=='skops'. If we go with the latter, then we have to modify the example to specify that the model format is skops. Please let me know if my explanation is clear enough.

BenjaminBossan · 2023-05-02T10:21:27Z

However, I'm questioning whether the trusted arg should be added where I have added it in the last commit, or maybe it should be added if we know for sure that the model_format=='skops'. If we go with the latter, then we have to modify the example to specify that the model format is skops. Please let me know if my explanation is clear enough.

As a user, I would be surprised to find out that model_format="skops" does not auto-trust all types but model_format="auto" does. It should be one or the other. Presumably, we need to always trust everything, otherwise running init would not work for certain models if it uses the skops format. At the same time, trusting all types is something we want to discourage, after all, this is not much better than using pickle. We could add a trusted argument to init, but the function is already quite big as is and at first glance, it's not apparent why it would need that argument.

Overall, this has become much more complicated than I initially expected. I wondered if we could just use a dummy model here, since the model card needs to be later edited by the user anyway, but I think this could also be confusing. So in sum, I'm not sure what the best way is. @adrinjalali what would you suggest?

adrinjalali · 2023-05-04T11:50:27Z

These are all good points. It makes me think that probably the best way is to get a model object and the path to where it's supposed to have been saved, and not load the model in init afterall. This makes the user responsible to properly have loaded the model before giving it to init.

jucamohedano · 2023-05-14T16:21:10Z

I agree with the suggestion from @adrinjalali, otherwise it looks like we would have to hard code the model_format arg. What do you think about his suggestion @BenjaminBossan ?

BenjaminBossan · 2023-05-15T09:37:59Z

Yes, that might be the better solution. I'm not 100% sure if there might not be new issues with that approach, but it's worth trying.

jucamohedano · 2023-05-20T08:43:08Z

Maybe we run into some issue, but we'll see. Shall we keep working on this PR or a new open one?

BenjaminBossan · 2023-05-20T11:41:01Z

Maybe we run into some issue, but we'll see. Shall we keep working on this PR or a new open one?

Do whatever you feel more comfortable with.

jucamohedano · 2023-05-22T08:42:21Z

I have written an implementation to test in my local setup where I have introduced a new argument to the _hf_hub_.init function. the new argument is a model object. However, as expected many tests fail. Maybe we could use the model object if it is given, and if not we could load the model. If the latter raises an error, we just let the user know about this to fix incompatibilties with the saved model object the user's trying to load. WDYT?

I will remove unnecessary changes from the diff in future commits!

BenjaminBossan · 2023-05-22T13:35:39Z

I have written an implementation to test in my local setup where I have introduced a new argument to the _hf_hub_.init function. the new argument is a model object. However, as expected many tests fail. Maybe we could use the model object if it is given, and if not we could load the model. If the latter raises an error, we just let the user know about this to fix incompatibilties with the saved model object the user's trying to load. WDYT?

I would prefer for the change to not be backwards incompatible, i.e. all existing tests should still pass. I have a suggestion, not sure if there is something I missed that would invalidate it, let me know:

The model argument to init will, in the future, also allow an sklearn estimator to be passed. If model is a str or Path, nothing changes, everything stays the same. If model is an estimator, use it to create a README.md (if there isn't already one). Also, save the model in the destination directory using a generic name, like "model.pkl" or "model.skops" (depending on the model_format argument). WDYT about that?

adrinjalali · 2023-05-23T16:19:25Z

The model argument to init will, in the future, also allow an sklearn estimator to be passed. If model is a str or Path, nothing changes, everything stays the same. If model is an estimator, use it to create a README.md (if there isn't already one)

That doesn't sound too bad to me, at some point we might want to deprecate passing a path to the model maybe.

…e respective test

jucamohedano · 2023-05-26T08:26:16Z

I totally agree with the implementation and that's why I decided to go ahead and implement it.
Docs build is currently failing. I will go over that later today and fix it!

jucamohedano · 2023-05-26T19:54:40Z

Hi, I'm not sure what to do about the failing example in the docs build. Since hub_utils.init is creating a README.md file by default when creating the repository, hub_utils.add_files fails to execute because an existing README.md is already there. I would assume it's safe to overwrite it?
This is the error from the docs build failed:

Unexpected failing examples:
/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_california_housing.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_california_housing.py", line 1852, in <module>
    hub_utils.add_files(
  File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/skops/hub_utils/_hf_hub.py", line 531, in add_files
    raise FileExistsError(msg)
FileExistsError: File 'README.md' already found at '/tmp/tmpwmea7ezv'.

BenjaminBossan · 2023-05-31T09:19:52Z

Maybe I misunderstand your code, but it seems like it doesn't do what we discussed, resulting in the error? IIUC, init will always create a README.md. My suggestion was

If model is a str or Path, nothing changes, everything stays the same. If model is an estimator, use it to create a README.md (if there isn't already one).

…f model is an estimator to create README

jucamohedano · 2023-06-02T08:29:42Z

Thank you for pointing that out @BenjaminBossan, I realised the code wasn't doing what was discussed. I reverted back to the behaviour of creating a README if model is a str/Path, otherwise handle it if model is an estimator object. I believe the code should be doing that now. However, docs still fail to build on the example plot_california_housing.py. I actually think that the example needs to be updated to account for the creation of a README at the initialization of the repository, and avoid adding the README file manually to the repository. WDT?

Mistakenly closed the PR while scrolling.

BenjaminBossan · 2023-06-05T09:00:38Z

Thanks for the updates @jucamohedano.

This is not a full on review, as I think there are still some bigger changes needed. Most notably, I feel like init has become too complicated, it is hard to track what is going on. Let's try to simplify it. A good start is to take a look at _load_model, which I believe you can re-use for this purpose.

I actually think that the example needs to be updated to account for the creation of a README at the initialization of the repository, and avoid adding the README file manually to the repository. WDT?

Maybe I misunderstand, but shouldn't it be the other way round? What I mean is that the README.md created by the example is the one that contains all the important information and should thus take precedence over the README.md that was automatically created.

jucamohedano · 2023-07-07T11:30:19Z

I will continue working on the issue starting next week and give some feedback on my work. Thank you for the feedback!

jucamohedano · 2023-07-13T18:35:13Z

Hey! Thanks for the suggestion about _load_model because it can simplify the code as long as we can trust the types within the model file to be loaded.

Maybe I misunderstand, but shouldn't it be the other way round? What I mean is that the README.md created by the example is the one that contains all the important information and should thus take precedence over the README.md that was automatically created.

Thanks for making it clear. The call to hub_utils.add_files in the example is the one that throws the error about an already existing README.md in the repository. We could address this problem in hub_utils.add_files to account for the existing file created by the call to hub_utils.init.

BenjaminBossan

Thanks for providing updates to bring this PR closer to be merged. I have made a few comments and had some questions for clarification, please take a look.

We could address this problem in hub_utils.add_files to account for the existing file created by the call to hub_utils.init.

add_files already has an option, exist_ok, which can be set to True to allow overriding.

BenjaminBossan · 2023-07-14T15:11:12Z

skops/hub_utils/_hf_hub.py

+            model = _load_model(model, trusted=True)
+            model_card = card.Card(model, metadata=card.metadata_from_config(dst))
+            model_card.save(dst / "README.md")
+        elif isinstance(model, BaseEstimator):


I think we need no check for BaseEstimator here, so this can just be else and the error below can be removed. There could be valid models here that don't inherit from BaseEstimator. It is the user's responsibility to provide an sklearn-compatible model.

Okay, got it!

BenjaminBossan · 2023-07-14T15:12:09Z

skops/hub_utils/_hf_hub.py

+            if model_format == "auto":
+                model_format = "skops"
+            elif model_format in ["pkl", "pickle", "joblib"]:
+                model_format = "pickle"


We need an else clause below because if the user passes model_format="skosp" or another typo, that would just be accepted as is, even though it is invalid. Then further below, the model would not be saved without any indication that something went wrong.

Essentially, the model format checking logic should be the same as inside _create_config. And since checking the model format is new performed in init, we should also be able to remove it from _create_config completely.

If so, I can remove that logic from _create_config and have it just in init. I actually started this PR by looking at _create_config. Let me know if you would like to make this change now.

Yes, I think that change should be good.

BenjaminBossan · 2023-07-14T15:16:11Z

skops/hub_utils/tests/test_hf_hub.py

+
+    version = metadata.version("scikit-learn")
+    _, model_format = config_json
+    # joblib type falls unders auto format, explicityly set to auto


Suggested change

# joblib type falls unders auto format, explicityly set to auto

# joblib type falls under auto format, explicitly set to auto

BenjaminBossan · 2023-07-14T15:16:56Z

skops/hub_utils/tests/test_hf_hub.py

+    version = metadata.version("scikit-learn")
+    _, model_format = config_json
+    # joblib type falls unders auto format, explicityly set to auto
+    # because we can't repeat key "auto" in CONFIG dict


Sorry, I don't understand that, could you please explain further?

Sorry, that comment isn't self-explainable. The dictionary CONFIG in test_hf_hub contains 3 types of models. I wanted to test for a model with name model.joblib, so I added that model to the dictionary. If I recall correctly, adding the key:value pair for the joblib model made other tests fail. But I could fix this by changing the joblib model type auto and still have the name model.joblib to test for.

Hmm, I guess it depends on why the other tests fail. If this uncovers a bug with existing code, that would be good to know. Otherwise, I think it's okay to have a separate test for joblib and not changing CONFIG.

BenjaminBossan · 2023-07-14T15:19:24Z

skops/hub_utils/tests/test_hf_hub.py

+        model_card = RepoCard.load(Path(dir_path) / "README.md")
+        model_card.data.license
+
+    # override existent modelcard created by init with license attribute


For my better understanding, is this testing some new behavior added by this PR or some general behavior?

No, this is not a new behavior added by this PR. Now that you call this out I see that maybe we can make this test simpler by checking that the README.md exists after the call to init and that's it? I think that at the time of writing the test, I decided to update the README.md as a way of checking that it exists.

we can make this test simpler by checking that the README.md exists after the call to init and that's it?

Yes, that sounds like it is sufficient as a test.

feat: generate README.md in hub_utils.init

f5e83a9

adrinjalali reviewed Nov 2, 2022

View reviewed changes

jucamohedano added 2 commits November 2, 2022 18:01

Merge branch 'skops-dev:main' into main

28b4b0c

ref: replace _create_readme function with fewer lines

f0e9683

adrinjalali reviewed Nov 2, 2022

View reviewed changes

skops/hub_utils/_hf_hub.py Outdated Show resolved Hide resolved

jucamohedano added 2 commits November 3, 2022 16:09

test create model card in hub_utils.init

1aeb14c

test override model card after created by hub_utils.init

95c0e1b

merveenoyan reviewed Nov 7, 2022

View reviewed changes

jucamohedano added 5 commits November 14, 2022 10:35

Merge branch 'skops-dev:main' into main

e0e6c7d

ref: deduplicate test creation of README in init

4b6cb73

fix: check that content of new model card is modified

870797f

Merge branch 'skops-dev:main' into main

d3a0eac

Merge branch 'main' of github.com:jucamohedano/skops into main

4b3fb8d

adrinjalali reviewed Nov 16, 2022

View reviewed changes

jucamohedano added 2 commits November 18, 2022 10:05

Merge branch 'skops-dev:main' into main

eaed93b

revert lines removed by mistake

f182ee1

adrinjalali approved these changes Nov 18, 2022

View reviewed changes

adrinjalali mentioned this pull request Nov 21, 2022

MNT error in init if given file is empty #214

Merged

Merge branch 'main' into main

9a41cf2

Merge branch 'skops-dev:main' into main

7f7d0c2

jucamohedano added 2 commits December 4, 2022 20:38

Merge branch 'skops-dev:main' into main

1c19795

Merge branch 'skops-dev:main' into main

56165e4

fix: add trusted argument to io.load when checking file extension

3b78b34

jucamohedano added 3 commits May 26, 2023 09:55

fix: add support for model parameter in init based on its type; updat…

4e42408

…e respective test

Merge remote-tracking branch 'skops-upstream/main' into main

af245b1

fix: revert files changes

6138885

change file format in example to skops

dd49d5c

fix: revert to always create README if model is str/Path and handle i…

21c3a5f

…f model is an estimator to create README

jucamohedano closed this Jun 2, 2023

jucamohedano reopened this Jun 2, 2023

jucamohedano added 3 commits July 12, 2023 20:26

replace open file block of code with _load_model function

98e64b4

Merge remote-tracking branch 'skops-upstream/main' into HEAD

af9b3ac

trust types in model file

ffdb1e8

jucamohedano force-pushed the main branch from 59441b4 to ffdb1e8 Compare July 13, 2023 18:23

BenjaminBossan requested changes Jul 14, 2023

View reviewed changes

		@@ -213,6 +213,83 @@ def recursively_default_dict() -> MutableMapping:
		dump_json(Path(dst) / "config.json", config)


		def _create_readme(

		@@ -395,7 +429,6 @@ def repo_path_for_inference():


		@pytest.mark.network
		@pytest.mark.inference

	# joblib type falls unders auto format, explicityly set to auto
	# joblib type falls under auto format, explicitly set to auto

Generate minimal README.md file in repository initialization (issue #113) #207

Are you sure you want to change the base?

Generate minimal README.md file in repository initialization (issue #113) #207

Conversation

jucamohedano commented Nov 1, 2022

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merveenoyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Nov 18, 2022

BenjaminBossan commented Nov 18, 2022

adrinjalali commented Nov 21, 2022

BenjaminBossan commented Nov 21, 2022

adrinjalali commented Nov 21, 2022

adrinjalali commented Nov 22, 2022

adrinjalali commented Nov 29, 2022

jucamohedano commented May 1, 2023

BenjaminBossan commented May 2, 2023 • edited Loading

adrinjalali commented May 4, 2023

jucamohedano commented May 14, 2023

BenjaminBossan commented May 15, 2023

jucamohedano commented May 20, 2023

BenjaminBossan commented May 20, 2023

jucamohedano commented May 22, 2023

BenjaminBossan commented May 22, 2023

adrinjalali commented May 23, 2023

jucamohedano commented May 26, 2023

jucamohedano commented May 26, 2023

BenjaminBossan commented May 31, 2023

jucamohedano commented Jun 2, 2023

BenjaminBossan commented Jun 5, 2023

jucamohedano commented Jul 7, 2023

jucamohedano commented Jul 13, 2023

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented May 2, 2023 •

edited

Loading