-
Notifications
You must be signed in to change notification settings - Fork 9
New Modeling Backend, BetaGeoModel, and PPD Metric #24
Conversation
Merge new pymc model backend
@ColtAllen ! Amazon job! I will tale a look at it next week (been busy these days) 🚀 ! |
Thanks, and that's great! I want to get this PR merged and the alpha release published to PyPI before Fri, 17-Jun, because I'm going on vacation afterward and won't be able to resume work on |
|
||
def posterior_predictive_deviation(obs_freq: npt.ArrayLike, gen_freq: npt.ArrayLike, obs_rec: npt.ArrayLike, gen_rec: npt.ArrayLike) -> float: | ||
""" | ||
In lieu of a traditional posterior predictive check, calculate the standardized wasserstein distance of frequency, weighed by recency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more detail? I'll need to merge this PR today, but I can still answer any questions you have and/or create an issue to implement any proposed changes afterward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! I had a quick look, as I was in a business trip this week, and I know you want feedback in order to iterate (and go to vacations 😎 !). Here Are somme comments:
Code
- The
pymc
package is not in the requirements so I can not run the tests. For example, when tryingpytest -v tests/test_models.py
I get:
ImportError while importing test module '/Users/juanitorduz/Documents/lifetimes/tests/test_models.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../../opt/anaconda3/envs/lifetimes_dev/lib/python3.9/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/test_models.py:12: in <module>
import arviz as az
- Same for other modules like
psutil
. - I think we need to make sure the tests and model runs when installing locally via
pip install -e .
inside a virtual environment. I believe the requirements have to be specified inside thesetup.py
file from therequirements.txt
as in https://github.com/pymc-devs/pymc/blob/main/setup.py#L52-L53 - We need to be able to run the tests in a CI/CD. I can support with that if we use github actions.
- In view of the CI/DC, we should also have code style checkers and linters as the code in this PR does not have a consistent convention. I would suggest black and flake.
- There some
.coverage
files which are being pushed into this branch and are not ignored correctly via the.gitignore
. - We should add the testing artifacts into a
test/fixtures
folder to be able to run the tests in the CI/CD pipeline (so I guess we would need to remove them from the.gitignore
)
Model
- It would be nice to be able to allow the user to modify the priors, for example as in Bambi : https://bambinos.github.io/bambi/main/notebooks/radon_example.html (Ups! I just read this would be part of the second iteration 🙈 !)
- I wonder if we would like to expose the user with the complete
InferenceData
object (instead of just returning the.to_dict()
as xarray is a very rich library. - It would be great to have an example notebook which would serve as documentation on how to run these new generation of models.
I did not have time to go into the details of the methods math expressions ... but I will try to come back to them in the upcoming days. I hope this quick feedback is helpful for the development. Keep up the great work! 🚀 !
Thanks @juanitorduz! I've replied to all of your review comments and will be creating work issues for the requested changes, because I need to get this PR merged today before I go on vacation. The |
Hey! Congrats on the alpha release! I'll try to give a more detail review and test it 🚀 ! Enjoy your time off! |
Thanks @juanitorduz! I'm back from vacation and have reorganized the work tasks into a tentative release roadmap.. |
@ColtAllen Instead of a JSON, would an InferenceData object (i.e. Xarray) not make more sense to persist files? You would then get Arviz functions for free as well. |
Hey @zwelitunyiswa, as of the Beta release the full |
Overview
The current autograd backend for
lifetimes
is no longer being actively developed, and fits models via maximum-likelihood estimation. MLE is an estimate of the posterior distribution mode (i.e., the peak of the bell curve) for each model parameter. It's fast, but considering only a single parameter value is akin to missing the forest for the trees and can be limiting in a number of ways.In #6, @juanitorduz provided a link to a notebook he wrote containing variants of
BetaGeoFitter
written in pymc, the premiere Python library for Bayesian statistical modeling. Bayesian methods can estimate the full posterior distribution of each parameter and confer many advantages for model tuning and interpretation, so I've adapted Juan's code into a new backend forlifetimes
and a Beta-Geometric/NBD model class calledBetaGeoModel
. Best of all, pymc has a very large and engaged developer community, ensuring the dependencies for this library will never be deprecated.New Feature Description
UML diagram of the new modeling backend:
The new
BaseModel
object contains methods and attributes shared between all models and is also an abstract class, providing a template for future models as well as enforcing API standardization.In Bayesian modeling, the user specifies prior distributions for model parameters based on their own subjective intuition. Here's the stochastic dependency graph of what I came up with for
BetaGeoModel
:Parameter posteriors are estimated by the No-U-Turn Sampler (NUTS), a variation of the Hamiltonian Monte Carlo (HMC) sampling algorithm. The following link contains research papers links and interactive demos of these and similar algorithms in action:
The Markov-chain Monte Carlo Interactive Gallery
For larger datasets (perhaps >30k rows) prior specifications become trivial because the data will overwhelm the subjectivity of the priors. However, the smaller the dataset, the more important prior selection becomes. These default model priors performed well in testing, but in a future PR I want to give users more options for tunability.
After a model has been fitted, it will contain an
idata
attribute of typeInferenceData
object, which can be plotted and analyzed with the ArviZ library. Here's an example slightly modified from a code excerpt in Juan's notebook:The existing user API for
lifetimes
remains fully intact; in fact the predictive methods forBetaGeoModel
were copy/pasted right over fromBetaGeoFitter
! The coolest thing about a Bayesian approach is that repeated sample draws from the parameter posteriors will build an entire predictive probability distribution for any given customer, enabling prediction intervals and further analysis around customer behavior. Here's another slightly-modified example from Juan's notebook:Collaborators: The plotting function defined in that code block would a great addition to the
plotting.py
module in a future PR.Please note these graphs cannot be recreated yet with the current version of
lt.BetaGeoModel().conditional_probability_alive()
because it only supports point estimates at this time. However, posterior sampling for predictions is a top priority for the next version release of this library.Instead of saving/loading the full model object via pickle files - which I consider a security risk because anything can be pickled, including malware - model persistence now comes in the form of a stripped-down JSON containing arrays for posterior parameter distributions and some metadata. During testing, the typical file size of this JSON was about 1.5 MB.
Comparisons to
BetaGeoFitter
A downside to this new backend is that full posterior estimation requires more time to fit a model than MLE. The tests I ran on the CDNOW dataset took about a minute to complete on 2.4k rows of data, whereas MLE converged in seconds. However, pymc has experimental support for jax just-in-time-compilation (JIT) and even GPUs, which could speed things up quite a bit and be a great future enhancement for
lifetimes
.The mean values of the parameter posteriors, as well as the predictive outputs they generated, were comparable to that of
BetaGeoFitter
by 1-3 decimal places when fit to the same CDNOW dataset during testing. For well-behaved benchmarking datasets like CDNOW, MLE will outperform Bayesian estimates. However, real-world data is noisy and constantly influenced by unknown external factors, posing overfitting risks for point estimates of model parameters. Full posterior distributions provide a range of possible parameter values that vary every time the model runs, and repeated runs are loosely analogous to an ML modeling stack, affording greater resilience and explainability for the irreducible uncertainties of reality.Posterior Predictive Deviation Metric
Bayesian posterior predictive checks (PPCs) are usually a subjective visual exercise, comparing graphs of stochastic model outputs against observed data. However, an article on Medium inspired me to create a potential metric for PPCs based on the Wasserstein Distance, which I've dubbed the
posterior_predictive_deviation()
:Statistical Tests Won't Help You to Compare Distributions
It can be interpreted as the number of standard deviations required to transform the distribution of model outputs into that of the observed data. The Wasserstein Distance implementation in scipy only calculates distances between 1D arrays, but does allow optional weight vectors to be specified. During testing I weighed the Wasserstein Distance of customer frequency against recency because I wanted to consider both variables in this metric, but this is an uncharted new path for model interpretation and demands further analysis.
After this PR is merged I intend to proceed with publishing an alpha release to PyPI. This new modeling backend is still incomplete and lacking adequate documentation, but it's a start. Functionality elsewhere in
lifetimes
remains unchanged.