Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative Binomial #468

Closed
Tracked by #467
spsanderson opened this issue May 3, 2024 · 0 comments
Closed
Tracked by #467

Negative Binomial #468

spsanderson opened this issue May 3, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@spsanderson
Copy link
Owner

spsanderson commented May 3, 2024

AIC Function:

#' Calculate Akaike Information Criterion (AIC) for Negative Binomial Distribution
#'
#' This function calculates the Akaike Information Criterion (AIC) for a negative binomial distribution fitted to the provided data.
#'
#' @family Utility
#' @author Steven P. Sanderson II, MPH
#'
#' @description
#' This function estimates the parameters size (r) and probability (prob) of a negative binomial distribution from the provided data and then calculates the AIC value based on the fitted distribution.
#'
#' @param .x A numeric vector containing the data to be fitted to a negative binomial distribution.
#'
#' @details
#' This function fits a negative binomial distribution to the provided data. It estimates the parameters size (r) and probability (prob) of the negative binomial distribution from the data.
#' Then, it calculates the AIC value based on the fitted distribution.
#' 
#' Initial parameter estimates: The function uses the method of moments estimate as a starting point for the size (r) parameter of the negative binomial distribution,
#' and the probability (prob) is estimated based on the mean and variance of the data.
#' 
#' Optimization method: Since the parameters are directly calculated from the data, no optimization is needed.
#' 
#' Goodness-of-fit: While AIC is a useful metric for model comparison, it's recommended to also assess the goodness-of-fit of the chosen
#' model using visualization and other statistical tests.
#'
#' @examples
#' # Example 1: Calculate AIC for a sample dataset
#' data <- c(0, 1, 2, 3, 1)
#' util_negative_binomial_aic(data)
#'
#' @return
#' The AIC value calculated based on the fitted negative binomial distribution to the provided data.
#'
#' @name util_negative_binomial_aic
#'
#' @export
#' @rdname util_negative_binomial_aic
util_negative_binomial_aic <- function(.x) {
  # Tidyeval
  x <- as.numeric(.x)
  
  # Estimate size (r) parameter using method of moments
  m <- mean(x)
  v <- var(x)
  r <- m^2 / (v - m)
  
  # Estimate probability (prob) parameter
  prob <- r / (r + m)
  
  # Calculate AIC
  k_negative_binomial <- 2 # Number of parameters for negative binomial distribution (r and prob)
  logLik_negative_binomial <- sum(dnbinom(x, size = r, prob = prob, log = TRUE))
  AIC_negative_binomial <- 2 * k_negative_binomial - 2 * logLik_negative_binomial
  
  # Return AIC
  return(AIC_negative_binomial)
}

Example:

> x <- rnbinom(100, 10, .1)
> util_negative_binomial_aic(x)
[1] 961.4522
> library(fitdistrplus)
Loading required package: MASS
Loading required package: survival
> fitdist(x, "nbinom")
Fitting of the distribution ' nbinom ' by maximum likelihood 
Parameters:
      estimate Std. Error
size  9.900893   1.546991
mu   89.508703   2.997807
> fitdist(x, "nbinom")$aic
[1] 961.3085

Updated param estimates

Function:

#' Estimate Negative Binomial Parameters
#'
#' @family Parameter Estimation
#' @family Binomial
#'
#' @author Steven P. Sanderson II, MPH
#'
#' @details This function will attempt to estimate the negative binomial size and prob
#' parameters given some vector of values.
#'
#' @description The function will return a list output by default, and  if the parameter
#' `.auto_gen_empirical` is set to `TRUE` then the empirical data given to the
#' parameter `.x` will be run through the `tidy_empirical()` function and combined
#' with the estimated negative binomial data.
#'
#' Three different methods of shape parameters are supplied:
#' -  MLE/MME
#' -  MMUE
#' -  MLE via \code{\link[stats]{optim}} function.
#'
#' @param .x The vector of data to be passed to the function.
#' @param .size The size parameter, the default is 1.
#' @param .auto_gen_empirical This is a boolean value of TRUE/FALSE with default
#' set to TRUE. This will automatically create the `tidy_empirical()` output
#' for the `.x` parameter and use the `tidy_combine_distributions()`. The user
#' can then plot out the data using `$combined_data_tbl` from the function output.
#'
#' @examples
#' library(dplyr)
#' library(ggplot2)
#'
#' x <- as.integer(mtcars$mpg)
#' output <- util_negative_binomial_param_estimate(x, .size = 1)
#'
#' output$parameter_tbl
#'
#' output$combined_data_tbl |>
#'   tidy_combined_autoplot()
#'
#' t <- rnbinom(50, 1, .1)
#' util_negative_binomial_param_estimate(t, .size = 1)$parameter_tbl
#'
#' @return
#' A tibble/list
#'
#' @export
#'

util_negative_binomial_param_estimate <- function(.x, .size = 1,
                                                  .auto_gen_empirical = TRUE) {
  
  # Tidyeval ----
  x_term <- as.numeric(.x)
  sum_x <- sum(x_term, na.rm = TRUE)
  minx <- min(x_term)
  maxx <- max(x_term)
  m <- mean(x_term, na.rm = TRUE)
  n <- length(x_term)
  unique_terms <- length(unique(x_term))
  size <- .size
  size_length <- length(size)
  pass <- (n == size_length) || (size_length == 1)
  
  # Checks ----
  if (!is.vector(x_term, mode = "numeric") || is.factor(x_term) ||
      !is.vector(size, mode = "numeric") || is.factor(size)) {
    rlang::abort(
      message = "'.x' and '.size' must be numeric vectors.",
      use_cli_format = TRUE
    )
  }
  
  if (!pass) {
    rlang::abort(
      message = "The length of '.size' must be 1 or the same as the length of '.x'.",
      use_cli_format = TRUE
    )
  }
  
  if (n > size_length) {
    size <- rep(size, n)
  }
  
  if (n < 1) {
    rlang::abort(
      message = "'.x' and '.size' must contain at least one non-missing pari of values.",
      use_cli_format = TRUE
    )
  }
  
  if (!all(x_term == trunc(x_term)) || any(x_term < 0) || !all(size == trunc(size)) ||
      any(size < 1)) {
    rlang::abort(
      message = "All values of '.x' must be non-negative integers, and all values
      of '.size' must be positive integers.",
      use_cli_format = TRUE
    )
  }
  
  # Get params ----
  # EnvStats
  size <- sum(size)
  
  es_mme_size <- size
  es_mme_prob <- size / (size + sum_x)
  
  es_mvue_size <- size
  es_mvue_prob <- (size - 1) / (size + sum_x - 1)
  
  # MLE Method
  # Negative log-likelihood function for optimization
  nll_func <- function(params) {
    size <- params[1]
    mu <- params[2]
    -sum(dnbinom(x, size = size, mu = mu, log = TRUE))
  }
  
  # Initial parameter guesses (you might need to adjust these based on your data)
  initial_params <- c(size = 1, mu = mean(x))
  
  # Optimize using optim()
  optim_result <- optim(initial_params, nll_func)
  
  # Extract estimated parameters
  mle_size <- optim_result$par[1]
  mle_mu <- optim_result$par[2]
  mle_prob <- mle_size / (mle_size + mle_mu)
  
  # Return Tibble ----
  if (.auto_gen_empirical) {
    te <- tidy_empirical(.x = x_term)
    td <- tidy_negative_binomial(
      .n = n, .size = round(mle_size, 3),
      .prob = round(mle_prob, 3)
    )
    combined_tbl <- tidy_combine_distributions(te, td)
  }
  
  ret <- dplyr::tibble(
    dist_type = rep("Negative Binomial", 3),
    samp_size = rep(n, 3),
    min = rep(minx, 3),
    max = rep(maxx, 3),
    mean = c(rep(m, 2), mle_mu),
    method = c("EnvStats_MME_MLE", "EnvStats_MMUE", "MLE_Optim"),
    size = c(es_mme_size, es_mvue_size, mle_size),
    prob = c(es_mme_prob, es_mvue_prob, mle_prob),
    shape_ratio = c(es_mme_size / es_mme_prob, 
                    es_mvue_size / es_mvue_prob, 
                    mle_size / mle_prob)
  )
  
  # Return ----
  attr(ret, "tibble_type") <- "parameter_estimation"
  attr(ret, "family") <- "negative_binomial"
  attr(ret, "x_term") <- .x
  attr(ret, "n") <- n
  
  if (.auto_gen_empirical) {
    output <- list(
      combined_data_tbl = combined_tbl,
      parameter_tbl     = ret
    )
  } else {
    output <- list(
      parameter_tbl = ret
    )
  }
  
  return(output)
}

Example:

> util_negative_binomial_param_estimatev2(x, .size = 10)
$combined_data_tbl
# A tibble: 200 × 8
   sim_number     x     y     dx         dy     p     q dist_type
   <fct>      <int> <dbl>  <dbl>      <dbl> <dbl> <dbl> <fct>    
 1 1              1   106 -2.39  0.00000573  0.71    29 Empirical
 2 1              2   119 -0.330 0.0000105   0.86    36 Empirical
 3 1              3   109  1.73  0.0000188   0.75    36 Empirical
 4 1              4    75  3.79  0.0000324   0.34    38 Empirical
 5 1              5    70  5.84  0.0000539   0.26    40 Empirical
 6 1              6    73  7.90  0.0000873   0.31    40 Empirical
 7 1              7    63  9.96  0.000137    0.19    43 Empirical
 8 1              8    62 12.0   0.000208    0.16    43 Empirical
 9 1              9   114 14.1   0.000306    0.8     50 Empirical
10 1             10   102 16.1   0.000439    0.68    53 Empirical
# ℹ 190 more rows
# ℹ Use `print(n = ...)` to see more rows

$parameter_tbl
# A tibble: 3 × 9
  dist_type         samp_size   min   max  mean method              size   prob shape_ratio
  <chr>                 <int> <dbl> <dbl> <dbl> <chr>              <dbl>  <dbl>       <dbl>
1 Negative Binomial       100    29   170  89.5 EnvStats_MME_MLE 1000    0.100       9951  
2 Negative Binomial       100    29   170  89.5 EnvStats_MMUE    1000    0.100       9960. 
3 Negative Binomial       100    29   170  89.5 MLE_Optim           9.90 0.0996        99.4
@spsanderson spsanderson self-assigned this May 3, 2024
@spsanderson spsanderson added the enhancement New feature or request label May 3, 2024
@spsanderson spsanderson added this to the TidyDensity 1.4.1 milestone May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

1 participant