Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache model loading in model card #243

Open
BenjaminBossan opened this issue Dec 6, 2022 · 17 comments
Open

Cache model loading in model card #243

BenjaminBossan opened this issue Dec 6, 2022 · 17 comments
Labels
help wanted Extra attention is needed model cards

Comments

@BenjaminBossan
Copy link
Collaborator

Follow up to #96 and #205.

The _load_model function

def _load_model(model: Any, trusted=False) -> Any:
"""Return a model instance.
Loads the model if provided a file path, if already a model instance return
it unmodified.
Parameters
----------
model : pathlib.Path, str, or sklearn estimator
Path/str or the actual model instance. if a Path or str, loads the model.
trusted : bool, default=False
Passed to :func:`skops.io.load` if the model is a file path and it's
a `skops` file.
Returns
-------
model : object
Model instance.
"""
if not isinstance(model, (Path, str)):
return model
model_path = Path(model)
if not model_path.exists():
raise FileNotFoundError(f"File is not present: {model_path}")
try:
if zipfile.is_zipfile(model_path):
model = load(model_path, trusted=trusted)
else:
model = joblib.load(model_path)
except Exception as ex:
msg = f'An "{type(ex).__name__}" occurred during model loading.'
raise RuntimeError(msg) from ex
return model

is currently called each time when the model of a card is accessed, e.g. when repr is called. When the model is a path, it has to be loaded each time, which can be expensive. Therefore, we would like to add a cache to _load_model.

A simple functools.{lru_cache,cache} is, however, not sufficient. This is because the argument, in this case the model path, could remain the same while the model on the drive is overwritten by a new model. The cache key would need to be something else, like the md5 sum of the file that the path points to.

@adrinjalali adrinjalali added the help wanted Extra attention is needed label Dec 6, 2022
@jucamohedano
Copy link
Contributor

Hi! I would like to help here. If we were to use the md5 sum of the file that the path points to, then we would have to calculate the md5 sum every time that _load_model is called ?

@BenjaminBossan
Copy link
Collaborator Author

then we would have to calculate the md5 sum every time that _load_model is called ?

Each time when the function is called with a str or Path argument, yes.

I would assume that calculating the md5 should be really fast, but ideally that could be measured before we change the code.

@jucamohedano
Copy link
Contributor

jucamohedano commented Dec 8, 2022

Actually, it's better to use the md5 sum of the model file rather than loading the model file all the time anyways. I could start a PR on this issue.

@BenjaminBossan
Copy link
Collaborator Author

Actually, it's better to use the md5 sum of the model file rather than loading the model file all the time anyways. I could start a PR on this issue.

Well, if loading the file takes 100 ms and calculating the md5 takes 99 ms, arguably it's better to just always load the file. That's why we should measure, just to be sure.

Instead of md5, we could also consider storing and checking the time the file was last modified using getmtime. Maybe there is some precedence for what the best approach would be.

@jucamohedano
Copy link
Contributor

Okay, I understand. Although, I think that getmtime could make us reload the model file because its modification time changed, but the content of the model file could still be the same, in that case we could compute the hash of file to double check. But again, supposing that hashing with md5 sum is very fast.

Maybe we can use some metadata of the model to check whether it was changed or not.

@adrinjalali
Copy link
Member

The idea is not to "never reload a file twice", the idea is to "not reload the same content in most cases". So if a solution would reduce reloading the same file twice for 95% of the cases, that's good enough of a solution, especially if it keeps it simple.

@jucamohedano
Copy link
Contributor

Thanks for clarifying. I think I could do some testing on the loading times and the md5 hash of a model file, and give some feedback on how they compare?

@adrinjalali
Copy link
Member

Sure.

@jucamohedano
Copy link
Contributor

Sorry I haven't posted a response to what I said I would test. I will give a response in the next 2-3 days :)

@jucamohedano
Copy link
Contributor

Hey! I have some results to share with you.

I have tested loading times and the time taken to compute the md5sum of the pickle file of the model with some examples. I have used two models to do that. The first model I have used is from examples/plot_model_card.py which uses sklearn.ensemble.HistGradientBoostingClassifier, and the other model I used is sklearn.ensemble.ExtraTreesClassifier. Here are the results:

Example 1 - sklearn.ensemble.HistGradientBoostingClassifier
Model size = 228K

time md5sum /tmp/skops-_gor4zib/skops-4e68jjra.pkl
3e58bbfade12f97a9bf1e989ed8ab151  /tmp/skops-_gor4zib/skops-4e68jjra.pkl
real	0m0.003s

Time of _load_model()
0.006422758102416992s

Example 2 - sklearn.ensemble.ExtraTreesClassifier with n_estimators=10000 - number of trees in the forest
Model size = 75M

time md5sum /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl
ada4e1ea1fda4a72cfa52ebf6458c557  /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl
real	0m0.097s

Time of _load_model()
0.9678699970245361s

Example 3 - sklearn.ensemble.ExtraTreesClassifier with n_estimators=100000 - number of trees in the forest
Model size = 749M

time md5sum /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl 
e693e402475155f07ae13a56dbb82b12  /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl
real	0m0.944s

Time of _load_model()
8.120674848556519s

@BenjaminBossan
Copy link
Collaborator Author

Thanks a lot for doing the tests. A first conclusion would be that for small models, calling the hashing function is actually as slow as loading the models, only for big models would we notice a difference. But it can still be quite significant, as shown in your last example.

For this test, could you please check what the times would be when calling the hashing functions from Python using hashlib? I would expect that it adds a bit of an overhead. Also, just in case we require a more secure hashing function, could you please check both md5 and, say, sha256? And maybe run the tests a few times using timeit (or %timeit in case you're using ipython/jupyter).

@jucamohedano
Copy link
Contributor

I measured the times with the timeit python module. Here are three sets of results:

Test 1

Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl md5 time=0.0003169089904986322s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl md5 time=0.11238793900702149s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl md5 time=1.088716946018394s
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl sha256 time=0.00016041501658037305s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl sha256 time=0.06882680201670155s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl sha256 time=0.6861052309977822s

Test 2

Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl md5 time=0.0002976770047098398s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl md5 time=0.11226405500201508s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl md5 time=1.101641189015936s
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl sha256 time=0.0003235790063627064s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl sha256 time=0.07127582101384178s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl sha256 time=0.6806159189436585s

Test 3

Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl md5 time=0.0009365510195493698s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl md5 time=0.1162097500055097s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl md5 time=1.1024849910172634s
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl sha256 time=0.0003798969555646181s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl sha256 time=0.06875280395615846s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl sha256 time=0.6962733599939384s

It looks like sha256 is faster than md5 from these tests. Let me know if you want me to share the code snippet that I've used.

@BenjaminBossan
Copy link
Collaborator Author

So I assume the 3 tested models are the same as in your previous post.

If so, do I interpret the data correctly that calling md5 from Python (2nd comment) on the small model is 10x faster (0.0003 sec) than calling it from the command line (1st comment, 0.003 sec)? That's strange. But for the other models, there is no big difference, so I guess we can ignore this outlier.

Anyway, overall it seems like sha256 is the better choice and is consistently faster than actually loading the model, especially for larger models. Then I think it's safe to say we should go ahead and implement the caching.

@jucamohedano
Copy link
Contributor

Yes, your assumption is correct.

I didn't spot that outlier, I tested again from the command line and it gives me the same result.

Okay, then I will open a PR to implement the caching! :)

@BenjaminBossan
Copy link
Collaborator Author

I just learned about cached_property, which was introduced in Python 3.8. Since we dropped 3.7, maybe this could be an option to use for us.

@jucamohedano
Copy link
Contributor

I think that we wouldn't be able to use cached_property here because it's only applicable to methods

@jucamohedano
Copy link
Contributor

I opened a PR #299 that implements the functionality discussed. However, there's a small problem. I worked on a previous PR #207 which is supposed to be merged, but I worked on this PR in the main branch of my fork (my mistake). Therefore, I had to branch off from main to implement cache model loading and now all my previous commits are attached to #299. I'm sorry if that's an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed model cards
Projects
None yet
Development

No branches or pull requests

3 participants