Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parameterization in _distribution_new_customers #430

Closed
wants to merge 2 commits into from

Conversation

wd60622
Copy link
Contributor

@wd60622 wd60622 commented Nov 11, 2023

Address #328

Wondering why there is discrepancy in the generative process in the top of the test class and what is required here. Also, this test doesn't confirm the validity

TODO:

  • Add test to confirm @ColtAllen 's claim. Compare the rate to the scale of the observed data
  • Is it good to add an example and a plot of this to one of the notebooks? visual confirmation of above

📚 Documentation preview 📚: https://pymc-marketing--430.org.readthedocs.build/en/430/

@wd60622 wd60622 marked this pull request as draft November 11, 2023 07:17
@wd60622 wd60622 changed the title Fix parameterization Fix parameterization in _distribution_new_customers Nov 11, 2023
Copy link

codecov bot commented Nov 11, 2023

Codecov Report

Merging #430 (04d2462) into main (e9a854d) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #430   +/-   ##
=======================================
  Coverage   90.83%   90.83%           
=======================================
  Files          21       21           
  Lines        1920     1920           
=======================================
  Hits         1744     1744           
  Misses        176      176           
Files Coverage Δ
pymc_marketing/clv/models/beta_geo.py 99.09% <100.00%> (ø)

@wd60622
Copy link
Contributor Author

wd60622 commented Nov 11, 2023

I still think the parameterization was correct before. For instance, the distribution for new population_purchase_rate is in the same scale of naive empirical estimates (obviously better as dropout is possible).
On the otherhand, the 1 / alpha parameterization is way larger.
I had that population_purchase_rate is the rate / time unit so the previous parametrization makes sense to me

Would it make sense to assert

  1. population_purchase_rate estimate <= (frequency / T).max()
  2. some relation between empirical and rate estimates
Screenshot 2023-11-11 at 08 25 40 Screenshot 2023-11-11 at 08 26 45

@ColtAllen
Copy link
Collaborator

Frequency/recency would make for a better emperical comparison, as recency denotes the time period of the customer's last purchase. However, I've been thinking of this in the reciprocal (recency/frequency) which would be the average number of days between purchases.

The best way to test this in practice (if not for this PR) would be to do a posterior predictive check with these purchase and dropout rate distributions as parametrized:

However, PPCs aren't supported in BetaGeoModel. Doing so would require resolving this issue:

#128

@wd60622
Copy link
Contributor Author

wd60622 commented Nov 11, 2023

Sure, that is better approximation, but I would argue the previous parameterization makes more sense with even that value.

For instance, the scale of the previous parameterization goes from 0 to 4 purchases / unit time while alive whereas the suggested parameterization has that goes from 0 to 30+ / unit time. Both the empirical frequency / recency and frequency / T distributions are in that range.

If we had the model where dropout doesn't happen and keep poisson likelihood (even without the gamma population prior), one of those would be our MLEs (the gamma would regularize) so the range of the previous parameterization makes most sense to me.

Screenshot 2023-11-11 at 13 04 33

Pardon any typos in the graphs 🥲

This is all from the test generated data which has those same parameter assumptions, but those ranges just don't add up for me

@ColtAllen
Copy link
Collaborator

Thinking more on this, I've realized you're right - this model is a poisson process for frequency over time, and flipping the rate parameter won't yield time per frequency.

The parameter can remain as previously defined; no need for changes. I do think time between purchases is more interpretable to the user though, so I'll make a note to take the reciprocal of this function's output for plotting in a future PR.

One last nit-pick - the tests in beta geo use arbitrary parameters rather than generating them from a test dataset, and r=4 is unrealistic. In every dataset I've worked with that parameter hovers around 0.5.

@ColtAllen ColtAllen closed this Nov 11, 2023
@wd60622 wd60622 deleted the fix_parameterization branch November 11, 2023 16:51
@wd60622
Copy link
Contributor Author

wd60622 commented Nov 11, 2023

All good. I was wrapping my head around it too. Between reusing greek letters and gamma parameterizations, its hard to keep it straight.
Thanks for taking a look!

@wd60622
Copy link
Contributor Author

wd60622 commented Nov 11, 2023

One last nit-pick - the tests in beta geo use arbitrary parameters rather than generating them from a test dataset, and r=4 is unrealistic. In every dataset I've worked with that parameter hovers around 0.5.

Might be good for a new issue?

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants