Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST Check file size differences #350

Merged

Conversation

BenjaminBossan
Copy link
Collaborator

Description

We already measure the runtime performance difference, but we don't check the file size difference. This PR adds another, similar script that does exactly that. The checks are run in the same workflow as the runtime performance checks.

The results are reported, showing top 10 largest differences, once in terms of absolute, once in terms of relative differences. In contrast to the runtime performance check, there is never any error raised, no matter how big the difference is, because it is unclear what, if any, difference, would count as unacceptable.

I compressed the pickle and skops files with a high compression rate. This is a reasonable choice for the benchmark because we can assume that if file size is a concern, users would choose that option.

Results

Running this on my machine, I get:

Largest absolute difference

                             name  pickle (kb)  skops (kb)    abs_diff
0      GradientBoostingClassifier    28.260742  401.458008  373.197266
1       GradientBoostingRegressor    26.522461  398.620117  372.097656
2            ExtraTreesClassifier    55.287109  204.671875  149.384766
3             ExtraTreesRegressor   155.334961  299.827148  144.492188
4          RandomForestClassifier    21.762695  158.903320  137.140625
5                 IsolationForest    83.681641  218.454102  134.772461
6            RandomTreesEmbedding    66.018555  193.952148  127.933594
7           RandomForestRegressor    96.218750  218.662109  122.443359
8  HistGradientBoostingClassifier    12.216797   95.233398   83.016602
9   HistGradientBoostingRegressor    12.311523   94.998047   82.686523

Largest relative difference

                             name  pickle (kb)  skops (kb)   rel_diff
0       GradientBoostingRegressor    26.522461  398.620117  15.029530
1      GradientBoostingClassifier    28.260742  401.458008  14.205501
2              AdaBoostClassifier     5.845703   68.863281  11.780154
3                       SparsePCA     1.147461   11.383789   9.920851
4                  FactorAnalysis     1.434570   11.860352   8.267529
5                KBinsDiscretizer     1.763672   14.403320   8.166667
6  HistGradientBoostingClassifier    12.216797   95.233398   7.795284
7   HistGradientBoostingRegressor    12.311523   94.998047   7.716189
8                   CategoricalNB     1.371094   10.341797   7.542735
9          RandomForestClassifier    21.762695  158.903320   7.301638

Out of curiosity, I also sorted by the lowest difference and there the factor could be as small as skops being 4% larger. The mean relative difference is skops being 137% larger, median is 92% larger.

The largest differences are clearly found in decision tree ensemble estimators. This is perhaps not surprising, given that those should be the largest models in general, with 100 trees by default. Still, perhaps we can come up with a way to store trees more efficiently.

We already measure the runtime performance difference, but we don't
check the file size difference. This PR adds another, similar script
that does exactly that. The checks are run in the same workflow as the
runtime performance checks.

The results are reported, showing top 10 largest differences, once in
terms of absolute, once in terms of relative differences. In contrast to
the runtime performance check, there is never any error raised, no
matter how big the difference is, because it is unclear what, if any,
difference, would count as unacceptable.

For skops, the zip file is highly compressed. This is a reasonable
choice for the benchmark because we can assume that if file size is a
concern, users would choose that option.
@BenjaminBossan
Copy link
Collaborator Author

@skops-dev/maintainers ready for review

@adrinjalali
Copy link
Member

I'm happy to have this, but unlike check_persistence_performance we don't really ever raise here. Are you thinking of having this mostly as something which would inform us of the current state?

@BenjaminBossan
Copy link
Collaborator Author

I'm happy to have this, but unlike check_persistence_performance we don't really ever raise here. Are you thinking of having this mostly as something which would inform us of the current state?

We could have this just for checking, but if you have a suggestion for a maximum acceptable difference, I can add it.

@adrinjalali
Copy link
Member

Maybe check every file is like less than 1MB, like a sanity check?

@BenjaminBossan
Copy link
Collaborator Author

Done

@adrinjalali adrinjalali changed the title Check file size differences TST Check file size differences Apr 25, 2023
@adrinjalali adrinjalali merged commit 0bcde53 into skops-dev:main Apr 25, 2023
@BenjaminBossan BenjaminBossan deleted the check-file-size-differences branch May 3, 2023 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants