Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLV Distribution RVs not Model-Specific #128

Open
ColtAllen opened this issue Jan 10, 2023 · 10 comments
Open

CLV Distribution RVs not Model-Specific #128

ColtAllen opened this issue Jan 10, 2023 · 10 comments
Labels

Comments

@ColtAllen
Copy link
Collaborator

I'm considering using the RVs in the CLV Distributions module to generate synthetic data for testing the Pareto/NBD model. However, after looking at the rng_fn for both classes, I'm concerned the RVs may not be robust across all model types, and the distribution classes could have similar pathologies.

As currently defined, the sim_data method in rng_fn is using a binomial RV within a while loop for the dropout probability. This works well for the Modified BG/NBD model, but I do not see a provision for the BG/NBD assumption that all non-repeat customers are alive with probability 1. The Pareto/NBD also does not use a binomial RV at all - instead it uses an exponential RV to predict the dropout time period prior to the while loop.

The data generation functions in Lifetimes/BTYD are a useful reference:

https://github.com/ColtAllen/btyd/blob/main/btyd/generate_data.py

@ricardoV94
Copy link
Contributor

We should add those variants to generate data. Do you want to assign this issue to yourself?

@ColtAllen ColtAllen self-assigned this Jan 12, 2023
@ColtAllen
Copy link
Collaborator Author

I consider this a prerequisite for #127, so I'll get started on adding a ParetoNBD distribution sometime next week.

The existing distributions should also be revised to reflect being written with the BG/NBD model in mind. I can also look into vectorization for sim_data.

@ColtAllen
Copy link
Collaborator Author

I'll be creating a PR for this soon. @larryshamalama I see you made the original commit for the ContContract distribution class. Is there a research citation you can provide for me to add to the docstring?

@larryshamalama
Copy link
Contributor

I'll be creating a PR for this soon. @larryshamalama I see you made the original commit for the ContContract distribution class. Is there a research citation you can provide for me to add to the docstring?

Sorry, this message slipped by my attention. I did not use any research articles since I did not find any when I was writing out the likelihood... Perhaps there is one out there that I'm unaware of. I can write out the likelihood derivation again if it would help

@ColtAllen ColtAllen mentioned this issue Feb 24, 2023
6 tasks
@ColtAllen
Copy link
Collaborator Author

ColtAllen commented Feb 24, 2023

@larryshamalama Let's refactor ContNonContract and ContContractinto BetaGeoNBD and BetaGeoNBDAggregate, respectively, so we can close this out:

  • Update docstring of ContContract/BetaGeoNBDAggregate

  • If we change the logp in ContContract/BetaGeoNBDAggregate per Use numerically stable log-likelihood for BetaGeoModel #98, we can close that issue as well.

  • The T0 param should be removed from both distribution classes because it will left-censor customer data. Functionality to select study start times is a good addition to utils.clv_summary() if you want to create an issue for it.

  • The RVs for both classes will have identical sim_data methods, which should be refactored like so:

        def sim_data(lam, p, T):
            t = 0 # recency
            n = 0 # frequency

           churn = 0 # BG/NBD assumes all non-repeat customers are active 
           wait = rng.exponential(scale=1 / lam)

           while t + wait < T and not churn:
               n += 1
               churn = rng.binomial(n=1, p=p)
               t += wait
               wait = rng.exponential(scale=1 / lam)
                        
           return np.array([t, n])

@larryshamalama
Copy link
Contributor

Sounds like a good plan, thanks for laying a bullet point style action plan. I'm away until early March, we can chat once I'll be back to work.

@larryshamalama
Copy link
Contributor

larryshamalama commented Mar 3, 2023

@larryshamalama Let's refactor ContNonContract and ContContractinto BetaGeoNBD and BetaGeoNBDAggregate, respectively, so we can close this out:

Hi @ColtAllen, I am just getting back to work and browsing the current progress that has being made. My understanding is that your focus is, for now, #177 and #176. I can start with #98 and we can see from there. How does that sound?

The reason why I initially added ContContract and ContNonContract is because I felt like those were the primary building blocks for continuous CLVs. In other words, my understanding (at the time) were that models, including BG/NBD, stem from having a same or very similar data-generating process but marginalizing over different priors. Admittedly, how useful these distribution classes will be is unclear to me. We can converse about these ideas some time soon. What do you think?

Edit: Re-reading your original comment in opening this issue, I see where you are coming from. I'm still wondering if there's a better way in generalizing model building blocks and making them robust for all (otherwise most/many) model types.

@ColtAllen
Copy link
Collaborator Author

Hi @ColtAllen, I am just getting back to work and browsing the current progress that has being made. My understanding is that your focus is, for now, #177 and #176. I can start with #98 and we can see from there. How does that sound?

Sounds great 👍

Edit: Re-reading your original comment in opening this issue, I see where you are coming from. I'm still wondering if there's a better way in generalizing model building blocks and making them robust for all (otherwise most/many) model types.

My main interest in model-specific distribution blocks is for use within the model like I'm doing in #177, unlocking additional functionality. That said, it could be interesting to test how well the ParetoNBD model converges on data generated from a BG/NBD process, and vice-versa. If there isn't interest in adding an individual-level BG/NBD model, we don't have a means of generating raw transaction data yet, so that could be a better way to repurpose that particular distribution block.

@ColtAllen ColtAllen removed their assignment Mar 23, 2023
@larryshamalama
Copy link
Contributor

Shall we modify the building blocks to be specific to CLV models? E.g. BGNBDRV akin to ParetoNBD. This would entail reworking ContContract and ContNonContract that we currently have.

IIRC, we opted against this because all we needed was the logp method which could be provided via pm.Potential. Adding these as distribution classes would have rng_fns available for use. What do people think?

@ColtAllen
Copy link
Collaborator Author

Shall we modify the building blocks to be specific to CLV models? E.g. BGNBDRV akin to ParetoNBD. This would entail reworking ContContract and ContNonContract that we currently have.

IIRC, we opted against this because all we needed was the logp method which could be provided via pm.Potential. Adding these as distribution classes would have rng_fns available for use. What do people think?

@larryshamalama let's rework ContNonContract into a distribution block for raw transaction data, because the other two blocks generate data in recency/frequency summary format. You can work off of the corresponding lifetimes function here:

https://github.com/ColtAllen/btyd/blob/main/btyd/generate_data.py#L75

The reason I suggest this is because if you recall our last weekly project meeting, @twiecki wants all lifetimes functionality in this notebook added to pymc-marketing:

https://github.com/ColtAllen/marketing-case-study/blob/main/case-study.ipynb

And in time, the notebook itself added to the docs. The first thing we need is a raw transaction block to generate the synthetic data.

We should create issues for the other lifetimes utility and plotting functions in that notebook as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants