CLV Distribution RVs not Model-Specific #128

ColtAllen · 2023-01-10T12:28:15Z

I'm considering using the RVs in the CLV Distributions module to generate synthetic data for testing the Pareto/NBD model. However, after looking at the rng_fn for both classes, I'm concerned the RVs may not be robust across all model types, and the distribution classes could have similar pathologies.

As currently defined, the sim_data method in rng_fn is using a binomial RV within a while loop for the dropout probability. This works well for the Modified BG/NBD model, but I do not see a provision for the BG/NBD assumption that all non-repeat customers are alive with probability 1. The Pareto/NBD also does not use a binomial RV at all - instead it uses an exponential RV to predict the dropout time period prior to the while loop.

The data generation functions in Lifetimes/BTYD are a useful reference:

https://github.com/ColtAllen/btyd/blob/main/btyd/generate_data.py

The text was updated successfully, but these errors were encountered:

ricardoV94 · 2023-01-12T11:10:35Z

We should add those variants to generate data. Do you want to assign this issue to yourself?

ColtAllen · 2023-01-12T14:23:22Z

I consider this a prerequisite for #127, so I'll get started on adding a ParetoNBD distribution sometime next week.

The existing distributions should also be revised to reflect being written with the BG/NBD model in mind. I can also look into vectorization for sim_data.

ColtAllen · 2023-01-19T23:51:35Z

I'll be creating a PR for this soon. @larryshamalama I see you made the original commit for the ContContract distribution class. Is there a research citation you can provide for me to add to the docstring?

larryshamalama · 2023-01-27T21:07:17Z

I'll be creating a PR for this soon. @larryshamalama I see you made the original commit for the ContContract distribution class. Is there a research citation you can provide for me to add to the docstring?

Sorry, this message slipped by my attention. I did not use any research articles since I did not find any when I was writing out the likelihood... Perhaps there is one out there that I'm unaware of. I can write out the likelihood derivation again if it would help

ColtAllen · 2023-02-24T19:07:26Z

@larryshamalama Let's refactor ContNonContract and ContContractinto BetaGeoNBD and BetaGeoNBDAggregate, respectively, so we can close this out:

Update docstring of ContContract/BetaGeoNBDAggregate
If we change the logp in ContContract/BetaGeoNBDAggregate per Use numerically stable log-likelihood for BetaGeoModel #98, we can close that issue as well.
The T0 param should be removed from both distribution classes because it will left-censor customer data. Functionality to select study start times is a good addition to utils.clv_summary() if you want to create an issue for it.
The RVs for both classes will have identical sim_data methods, which should be refactored like so:

        def sim_data(lam, p, T):
            t = 0 # recency
            n = 0 # frequency

           churn = 0 # BG/NBD assumes all non-repeat customers are active 
           wait = rng.exponential(scale=1 / lam)

           while t + wait < T and not churn:
               n += 1
               churn = rng.binomial(n=1, p=p)
               t += wait
               wait = rng.exponential(scale=1 / lam)
                        
           return np.array([t, n])

larryshamalama · 2023-02-26T10:56:51Z

Sounds like a good plan, thanks for laying a bullet point style action plan. I'm away until early March, we can chat once I'll be back to work.

larryshamalama · 2023-03-03T10:23:44Z

@larryshamalama Let's refactor ContNonContract and ContContractinto BetaGeoNBD and BetaGeoNBDAggregate, respectively, so we can close this out:

Hi @ColtAllen, I am just getting back to work and browsing the current progress that has being made. My understanding is that your focus is, for now, #177 and #176. I can start with #98 and we can see from there. How does that sound?

The reason why I initially added ContContract and ContNonContract is because I felt like those were the primary building blocks for continuous CLVs. In other words, my understanding (at the time) were that models, including BG/NBD, stem from having a same or very similar data-generating process but marginalizing over different priors. Admittedly, how useful these distribution classes will be is unclear to me. We can converse about these ideas some time soon. What do you think?

Edit: Re-reading your original comment in opening this issue, I see where you are coming from. I'm still wondering if there's a better way in generalizing model building blocks and making them robust for all (otherwise most/many) model types.

ColtAllen · 2023-03-03T23:42:51Z

Hi @ColtAllen, I am just getting back to work and browsing the current progress that has being made. My understanding is that your focus is, for now, #177 and #176. I can start with #98 and we can see from there. How does that sound?

Sounds great 👍

Edit: Re-reading your original comment in opening this issue, I see where you are coming from. I'm still wondering if there's a better way in generalizing model building blocks and making them robust for all (otherwise most/many) model types.

My main interest in model-specific distribution blocks is for use within the model like I'm doing in #177, unlocking additional functionality. That said, it could be interesting to test how well the ParetoNBD model converges on data generated from a BG/NBD process, and vice-versa. If there isn't interest in adding an individual-level BG/NBD model, we don't have a means of generating raw transaction data yet, so that could be a better way to repurpose that particular distribution block.

larryshamalama · 2023-04-03T04:21:30Z

Shall we modify the building blocks to be specific to CLV models? E.g. BGNBDRV akin to ParetoNBD. This would entail reworking ContContract and ContNonContract that we currently have.

IIRC, we opted against this because all we needed was the logp method which could be provided via pm.Potential. Adding these as distribution classes would have rng_fns available for use. What do people think?

ColtAllen · 2023-04-22T01:43:35Z

Shall we modify the building blocks to be specific to CLV models? E.g. BGNBDRV akin to ParetoNBD. This would entail reworking ContContract and ContNonContract that we currently have.

IIRC, we opted against this because all we needed was the logp method which could be provided via pm.Potential. Adding these as distribution classes would have rng_fns available for use. What do people think?

@larryshamalama let's rework ContNonContract into a distribution block for raw transaction data, because the other two blocks generate data in recency/frequency summary format. You can work off of the corresponding lifetimes function here:

https://github.com/ColtAllen/btyd/blob/main/btyd/generate_data.py#L75

The reason I suggest this is because if you recall our last weekly project meeting, @twiecki wants all lifetimes functionality in this notebook added to pymc-marketing:

https://github.com/ColtAllen/marketing-case-study/blob/main/case-study.ipynb

And in time, the notebook itself added to the docs. The first thing we need is a raw transaction block to generate the synthetic data.

We should create issues for the other lifetimes utility and plotting functions in that notebook as well.

ColtAllen added the CLV label Jan 10, 2023

ColtAllen mentioned this issue Jan 11, 2023

Datasets Module #104

Closed

ColtAllen self-assigned this Jan 12, 2023

ColtAllen mentioned this issue Jan 19, 2023

Consider using "penalized" priors by default in CLV models #99

Closed

ColtAllen mentioned this issue Jan 24, 2023

Add Pareto/NBD Distribution Class #131

Merged

ColtAllen mentioned this issue Feb 24, 2023

Add ParetoNBDModel #177

Closed

6 tasks

larryshamalama mentioned this issue Mar 9, 2023

Add BetaGeoBetaBinomModel #188

Closed

ColtAllen removed their assignment Mar 23, 2023

larryshamalama mentioned this issue Apr 4, 2023

Vectorize rng_fn in CLV distribution classes #230

Merged

larryshamalama mentioned this issue Apr 12, 2023

Add notebooks to fill the basic CLV grid #25

Open

4 tasks

ColtAllen mentioned this issue Jul 16, 2023

Adapt Evaluation Plots from lifetimes #326

Open

10 tasks

ColtAllen mentioned this issue Nov 11, 2023

Fix parameterization in _distribution_new_customers #430

Closed

ColtAllen mentioned this issue May 21, 2024

CLV API Standardization #527

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLV Distribution RVs not Model-Specific #128

CLV Distribution RVs not Model-Specific #128

ColtAllen commented Jan 10, 2023

ricardoV94 commented Jan 12, 2023

ColtAllen commented Jan 12, 2023

ColtAllen commented Jan 19, 2023

larryshamalama commented Jan 27, 2023

ColtAllen commented Feb 24, 2023 •

edited

Loading

larryshamalama commented Feb 26, 2023

larryshamalama commented Mar 3, 2023 •

edited

Loading

ColtAllen commented Mar 3, 2023

larryshamalama commented Apr 3, 2023

ColtAllen commented Apr 22, 2023

CLV Distribution RVs not Model-Specific #128

CLV Distribution RVs not Model-Specific #128

Comments

ColtAllen commented Jan 10, 2023

ricardoV94 commented Jan 12, 2023

ColtAllen commented Jan 12, 2023

ColtAllen commented Jan 19, 2023

larryshamalama commented Jan 27, 2023

ColtAllen commented Feb 24, 2023 • edited Loading

larryshamalama commented Feb 26, 2023

larryshamalama commented Mar 3, 2023 • edited Loading

ColtAllen commented Mar 3, 2023

larryshamalama commented Apr 3, 2023

ColtAllen commented Apr 22, 2023

ColtAllen commented Feb 24, 2023 •

edited

Loading

larryshamalama commented Mar 3, 2023 •

edited

Loading