Negative Binomial #468

spsanderson · 2024-05-03T15:41:21Z

AIC Function:

#' Calculate Akaike Information Criterion (AIC) for Negative Binomial Distribution
#'
#' This function calculates the Akaike Information Criterion (AIC) for a negative binomial distribution fitted to the provided data.
#'
#' @family Utility
#' @author Steven P. Sanderson II, MPH
#'
#' @description
#' This function estimates the parameters size (r) and probability (prob) of a negative binomial distribution from the provided data and then calculates the AIC value based on the fitted distribution.
#'
#' @param .x A numeric vector containing the data to be fitted to a negative binomial distribution.
#'
#' @details
#' This function fits a negative binomial distribution to the provided data. It estimates the parameters size (r) and probability (prob) of the negative binomial distribution from the data.
#' Then, it calculates the AIC value based on the fitted distribution.
#' 
#' Initial parameter estimates: The function uses the method of moments estimate as a starting point for the size (r) parameter of the negative binomial distribution,
#' and the probability (prob) is estimated based on the mean and variance of the data.
#' 
#' Optimization method: Since the parameters are directly calculated from the data, no optimization is needed.
#' 
#' Goodness-of-fit: While AIC is a useful metric for model comparison, it's recommended to also assess the goodness-of-fit of the chosen
#' model using visualization and other statistical tests.
#'
#' @examples
#' # Example 1: Calculate AIC for a sample dataset
#' data <- c(0, 1, 2, 3, 1)
#' util_negative_binomial_aic(data)
#'
#' @return
#' The AIC value calculated based on the fitted negative binomial distribution to the provided data.
#'
#' @name util_negative_binomial_aic
#'
#' @export
#' @rdname util_negative_binomial_aic
util_negative_binomial_aic <- function(.x) {
  # Tidyeval
  x <- as.numeric(.x)
  
  # Estimate size (r) parameter using method of moments
  m <- mean(x)
  v <- var(x)
  r <- m^2 / (v - m)
  
  # Estimate probability (prob) parameter
  prob <- r / (r + m)
  
  # Calculate AIC
  k_negative_binomial <- 2 # Number of parameters for negative binomial distribution (r and prob)
  logLik_negative_binomial <- sum(dnbinom(x, size = r, prob = prob, log = TRUE))
  AIC_negative_binomial <- 2 * k_negative_binomial - 2 * logLik_negative_binomial
  
  # Return AIC
  return(AIC_negative_binomial)
}

Example:

> x <- rnbinom(100, 10, .1)
> util_negative_binomial_aic(x)
[1] 961.4522
> library(fitdistrplus)
Loading required package: MASS
Loading required package: survival
> fitdist(x, "nbinom")
Fitting of the distribution ' nbinom ' by maximum likelihood 
Parameters:
      estimate Std. Error
size  9.900893   1.546991
mu   89.508703   2.997807
> fitdist(x, "nbinom")$aic
[1] 961.3085

Updated param estimates

Function:

#' Estimate Negative Binomial Parameters
#'
#' @family Parameter Estimation
#' @family Binomial
#'
#' @author Steven P. Sanderson II, MPH
#'
#' @details This function will attempt to estimate the negative binomial size and prob
#' parameters given some vector of values.
#'
#' @description The function will return a list output by default, and  if the parameter
#' `.auto_gen_empirical` is set to `TRUE` then the empirical data given to the
#' parameter `.x` will be run through the `tidy_empirical()` function and combined
#' with the estimated negative binomial data.
#'
#' Three different methods of shape parameters are supplied:
#' -  MLE/MME
#' -  MMUE
#' -  MLE via \code{\link[stats]{optim}} function.
#'
#' @param .x The vector of data to be passed to the function.
#' @param .size The size parameter, the default is 1.
#' @param .auto_gen_empirical This is a boolean value of TRUE/FALSE with default
#' set to TRUE. This will automatically create the `tidy_empirical()` output
#' for the `.x` parameter and use the `tidy_combine_distributions()`. The user
#' can then plot out the data using `$combined_data_tbl` from the function output.
#'
#' @examples
#' library(dplyr)
#' library(ggplot2)
#'
#' x <- as.integer(mtcars$mpg)
#' output <- util_negative_binomial_param_estimate(x, .size = 1)
#'
#' output$parameter_tbl
#'
#' output$combined_data_tbl |>
#'   tidy_combined_autoplot()
#'
#' t <- rnbinom(50, 1, .1)
#' util_negative_binomial_param_estimate(t, .size = 1)$parameter_tbl
#'
#' @return
#' A tibble/list
#'
#' @export
#'

util_negative_binomial_param_estimate <- function(.x, .size = 1,
                                                  .auto_gen_empirical = TRUE) {
  
  # Tidyeval ----
  x_term <- as.numeric(.x)
  sum_x <- sum(x_term, na.rm = TRUE)
  minx <- min(x_term)
  maxx <- max(x_term)
  m <- mean(x_term, na.rm = TRUE)
  n <- length(x_term)
  unique_terms <- length(unique(x_term))
  size <- .size
  size_length <- length(size)
  pass <- (n == size_length) || (size_length == 1)
  
  # Checks ----
  if (!is.vector(x_term, mode = "numeric") || is.factor(x_term) ||
      !is.vector(size, mode = "numeric") || is.factor(size)) {
    rlang::abort(
      message = "'.x' and '.size' must be numeric vectors.",
      use_cli_format = TRUE
    )
  }
  
  if (!pass) {
    rlang::abort(
      message = "The length of '.size' must be 1 or the same as the length of '.x'.",
      use_cli_format = TRUE
    )
  }
  
  if (n > size_length) {
    size <- rep(size, n)
  }
  
  if (n < 1) {
    rlang::abort(
      message = "'.x' and '.size' must contain at least one non-missing pari of values.",
      use_cli_format = TRUE
    )
  }
  
  if (!all(x_term == trunc(x_term)) || any(x_term < 0) || !all(size == trunc(size)) ||
      any(size < 1)) {
    rlang::abort(
      message = "All values of '.x' must be non-negative integers, and all values
      of '.size' must be positive integers.",
      use_cli_format = TRUE
    )
  }
  
  # Get params ----
  # EnvStats
  size <- sum(size)
  
  es_mme_size <- size
  es_mme_prob <- size / (size + sum_x)
  
  es_mvue_size <- size
  es_mvue_prob <- (size - 1) / (size + sum_x - 1)
  
  # MLE Method
  # Negative log-likelihood function for optimization
  nll_func <- function(params) {
    size <- params[1]
    mu <- params[2]
    -sum(dnbinom(x, size = size, mu = mu, log = TRUE))
  }
  
  # Initial parameter guesses (you might need to adjust these based on your data)
  initial_params <- c(size = 1, mu = mean(x))
  
  # Optimize using optim()
  optim_result <- optim(initial_params, nll_func)
  
  # Extract estimated parameters
  mle_size <- optim_result$par[1]
  mle_mu <- optim_result$par[2]
  mle_prob <- mle_size / (mle_size + mle_mu)
  
  # Return Tibble ----
  if (.auto_gen_empirical) {
    te <- tidy_empirical(.x = x_term)
    td <- tidy_negative_binomial(
      .n = n, .size = round(mle_size, 3),
      .prob = round(mle_prob, 3)
    )
    combined_tbl <- tidy_combine_distributions(te, td)
  }
  
  ret <- dplyr::tibble(
    dist_type = rep("Negative Binomial", 3),
    samp_size = rep(n, 3),
    min = rep(minx, 3),
    max = rep(maxx, 3),
    mean = c(rep(m, 2), mle_mu),
    method = c("EnvStats_MME_MLE", "EnvStats_MMUE", "MLE_Optim"),
    size = c(es_mme_size, es_mvue_size, mle_size),
    prob = c(es_mme_prob, es_mvue_prob, mle_prob),
    shape_ratio = c(es_mme_size / es_mme_prob, 
                    es_mvue_size / es_mvue_prob, 
                    mle_size / mle_prob)
  )
  
  # Return ----
  attr(ret, "tibble_type") <- "parameter_estimation"
  attr(ret, "family") <- "negative_binomial"
  attr(ret, "x_term") <- .x
  attr(ret, "n") <- n
  
  if (.auto_gen_empirical) {
    output <- list(
      combined_data_tbl = combined_tbl,
      parameter_tbl     = ret
    )
  } else {
    output <- list(
      parameter_tbl = ret
    )
  }
  
  return(output)
}

Example:

> util_negative_binomial_param_estimatev2(x, .size = 10)
$combined_data_tbl
# A tibble: 200 × 8
   sim_number     x     y     dx         dy     p     q dist_type
   <fct>      <int> <dbl>  <dbl>      <dbl> <dbl> <dbl> <fct>    
 1 1              1   106 -2.39  0.00000573  0.71    29 Empirical
 2 1              2   119 -0.330 0.0000105   0.86    36 Empirical
 3 1              3   109  1.73  0.0000188   0.75    36 Empirical
 4 1              4    75  3.79  0.0000324   0.34    38 Empirical
 5 1              5    70  5.84  0.0000539   0.26    40 Empirical
 6 1              6    73  7.90  0.0000873   0.31    40 Empirical
 7 1              7    63  9.96  0.000137    0.19    43 Empirical
 8 1              8    62 12.0   0.000208    0.16    43 Empirical
 9 1              9   114 14.1   0.000306    0.8     50 Empirical
10 1             10   102 16.1   0.000439    0.68    53 Empirical
# ℹ 190 more rows
# ℹ Use `print(n = ...)` to see more rows

$parameter_tbl
# A tibble: 3 × 9
  dist_type         samp_size   min   max  mean method              size   prob shape_ratio
  <chr>                 <int> <dbl> <dbl> <dbl> <chr>              <dbl>  <dbl>       <dbl>
1 Negative Binomial       100    29   170  89.5 EnvStats_MME_MLE 1000    0.100       9951  
2 Negative Binomial       100    29   170  89.5 EnvStats_MMUE    1000    0.100       9960. 
3 Negative Binomial       100    29   170  89.5 MLE_Optim           9.90 0.0996        99.4

Fixes #468

spsanderson mentioned this issue May 3, 2024

New AIC/Param Estimate/Stats Tbl functions #467

Closed

15 tasks

spsanderson self-assigned this May 3, 2024

spsanderson added the enhancement New feature or request label May 3, 2024

spsanderson added this to the TidyDensity 1.4.1 milestone May 3, 2024

spsanderson closed this as completed in 684524b May 4, 2024

spsanderson added a commit that referenced this issue May 4, 2024

Merge pull request #483 from spsanderson/development

626eac0

Fixes #468

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative Binomial #468

Negative Binomial #468

spsanderson commented May 3, 2024 •

edited

Loading

Negative Binomial #468

Negative Binomial #468

Comments

spsanderson commented May 3, 2024 • edited Loading

Updated param estimates

spsanderson commented May 3, 2024 •

edited

Loading