Fix parameterization in `_distribution_new_customers` #430

wd60622 · 2023-11-11T07:06:53Z

Address #328

Wondering why there is discrepancy in the generative process in the top of the test class and what is required here. Also, this test doesn't confirm the validity

TODO:

Add test to confirm @ColtAllen 's claim. Compare the rate to the scale of the observed data
Is it good to add an example and a plot of this to one of the notebooks? visual confirmation of above

📚 Documentation preview 📚: https://pymc-marketing--430.org.readthedocs.build/en/430/

codecov · 2023-11-11T07:31:17Z

Codecov Report

Merging #430 (04d2462) into main (e9a854d) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #430   +/-   ##
=======================================
  Coverage   90.83%   90.83%           
=======================================
  Files          21       21           
  Lines        1920     1920           
=======================================
  Hits         1744     1744           
  Misses        176      176

Files	Coverage Δ
pymc_marketing/clv/models/beta_geo.py	`99.09% <100.00%> (ø)`

wd60622 · 2023-11-11T07:31:27Z

I still think the parameterization was correct before. For instance, the distribution for new population_purchase_rate is in the same scale of naive empirical estimates (obviously better as dropout is possible).
On the otherhand, the 1 / alpha parameterization is way larger.
I had that population_purchase_rate is the rate / time unit so the previous parametrization makes sense to me

Would it make sense to assert

population_purchase_rate estimate <= (frequency / T).max()
some relation between empirical and rate estimates

ColtAllen · 2023-11-11T08:42:45Z

Frequency/recency would make for a better emperical comparison, as recency denotes the time period of the customer's last purchase. However, I've been thinking of this in the reciprocal (recency/frequency) which would be the average number of days between purchases.

The best way to test this in practice (if not for this PR) would be to do a posterior predictive check with these purchase and dropout rate distributions as parametrized:

However, PPCs aren't supported in BetaGeoModel. Doing so would require resolving this issue:

#128

wd60622 · 2023-11-11T12:14:05Z

Sure, that is better approximation, but I would argue the previous parameterization makes more sense with even that value.

For instance, the scale of the previous parameterization goes from 0 to 4 purchases / unit time while alive whereas the suggested parameterization has that goes from 0 to 30+ / unit time. Both the empirical frequency / recency and frequency / T distributions are in that range.

If we had the model where dropout doesn't happen and keep poisson likelihood (even without the gamma population prior), one of those would be our MLEs (the gamma would regularize) so the range of the previous parameterization makes most sense to me.

Pardon any typos in the graphs 🥲

This is all from the test generated data which has those same parameter assumptions, but those ranges just don't add up for me

ColtAllen · 2023-11-11T16:23:07Z

Thinking more on this, I've realized you're right - this model is a poisson process for frequency over time, and flipping the rate parameter won't yield time per frequency.

The parameter can remain as previously defined; no need for changes. I do think time between purchases is more interpretable to the user though, so I'll make a note to take the reciprocal of this function's output for plotting in a future PR.

One last nit-pick - the tests in beta geo use arbitrary parameters rather than generating them from a test dataset, and r=4 is unrealistic. In every dataset I've worked with that parameter hovers around 0.5.

wd60622 · 2023-11-11T17:06:01Z

All good. I was wrapping my head around it too. Between reusing greek letters and gamma parameterizations, its hard to keep it straight.
Thanks for taking a look!

wd60622 · 2023-11-11T17:07:05Z

One last nit-pick - the tests in beta geo use arbitrary parameters rather than generating them from a test dataset, and r=4 is unrealistic. In every dataset I've worked with that parameter hovers around 0.5.

Might be good for a new issue?

wd60622 added 2 commits November 11, 2023 08:05

change to recommended

c542aa7

switch in test too

04d2462

wd60622 marked this pull request as draft November 11, 2023 07:17

wd60622 changed the title ~~Fix parameterization~~ Fix parameterization in _distribution_new_customers Nov 11, 2023

ColtAllen closed this Nov 11, 2023

wd60622 deleted the fix_parameterization branch November 11, 2023 16:51

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parameterization in `_distribution_new_customers` #430

Fix parameterization in `_distribution_new_customers` #430

wd60622 commented Nov 11, 2023 •

edited

Loading

codecov bot commented Nov 11, 2023 •

edited

Loading

wd60622 commented Nov 11, 2023 •

edited

Loading

ColtAllen commented Nov 11, 2023

wd60622 commented Nov 11, 2023 •

edited

Loading

ColtAllen commented Nov 11, 2023

wd60622 commented Nov 11, 2023

wd60622 commented Nov 11, 2023

Fix parameterization in _distribution_new_customers #430

Fix parameterization in _distribution_new_customers #430

Conversation

wd60622 commented Nov 11, 2023 • edited Loading

codecov bot commented Nov 11, 2023 • edited Loading

Codecov Report

wd60622 commented Nov 11, 2023 • edited Loading

ColtAllen commented Nov 11, 2023

wd60622 commented Nov 11, 2023 • edited Loading

ColtAllen commented Nov 11, 2023

wd60622 commented Nov 11, 2023

wd60622 commented Nov 11, 2023

Fix parameterization in `_distribution_new_customers` #430

Fix parameterization in `_distribution_new_customers` #430

wd60622 commented Nov 11, 2023 •

edited

Loading

codecov bot commented Nov 11, 2023 •

edited

Loading

wd60622 commented Nov 11, 2023 •

edited

Loading

wd60622 commented Nov 11, 2023 •

edited

Loading