Skip to content

Latest commit

 

History

History
624 lines (293 loc) · 25.3 KB

index.md

File metadata and controls

624 lines (293 loc) · 25.3 KB

There are few different types of outliers:

  1. Point Outliers - values are far from the rest of the data.
  2. Contextual Outliers - relies on the context provided by other points. For instance, when a value is not too far from the rest overall, but far from points nearby in time.
  3. Collective Outliers - when data points collectively look like outlier

It can be detected using:

  1. Automated methods e.g. box-and-whisker
  2. Modeling error e.g. fit exponential smoothing model, and points with very large error might be outlier

There are 2 main reasons for limiting number of factors in model:

  • overfitting, esp. when number of factors > data points & the model might fit too closely to random effects
  • simplicity: less data required, less chance of insignificant factors, easier to interpret

Forward Selection

Forward selection is a type of stepwise regression, which begins with an empty model and a variable is added one at a time to fit a model. Factors with high p-value (p > 0.05) will be removed, and the final set of factors will be used to fit the model.

Backward Selection

Backward selection involes starting with all candidate variables, and removing insignificant factors. This continues until we have no more bad factors (e.g. p > 0.15) to remove.

Stepwise Regression

There are many types of stepwise regression. Stepwise regression that involves both forward and backward elimination method, is one of the methods. This also known as greedy algorithm, because it takes one step that looks best & future options are not considered.

Lasso Regression

We add a constraint to a standard regression equation, and use it on the most important coefficients. In order to that, we will need to scale the data.

lasso min

We can choose t based on number of variables and quality of the model. We can also use lasso method to determine the best trade-off.

lasso threshold

Elastic Net

Elastic net applies contraints on absolute value of coefficients and their squares. We will need to scale the data beforehand, and choose the appropriate t and lambda.

elastic net min

elastic threshold

Change detection method is necessary to find out when changes happen and determine if we need an action, impact of past action, or changes to the current plan. One of the methods we can use to detect changes is to use cumulative sum (CUSUM).

xt = observed value at time t

mu = mean of x, if no change

If running total > 0, add to the previous metric. Else, set to 0 because it is irrelevant to detect changes

Include constant C to pull the running total down a bit. The bigger C, harder for St to get large & get less sensitive.

Detecting an increase

cusum increase

Detecting a decrease

cusum decrease

Is St more or equal to T, threshold?

Some machine learning models assume data is normally distributed. Therefore, results will be biased when this assumption is not met. The unequal variance in a data is called heteroscedasticity. Higher variance in certain data points will make those estimation erros larger and push a model to fit those points.

A Box-Cox transformation, is one of the methods that can deal with heteroscedasticity. This approach is a logarithmic transformation, that shrinks the larger range to reduce its variability and stretches out smaller range to enlarge its variability.

The idea is to find best value of lambda:

t(y), or transform vector of Y, so that it can become closer to normal distribution. 

We will need to find the power transformation, lambda, that maximizes the likelihood when a specified set of explanatory variables is fitted to the following equation as the response:

Power Transformation

The value of lambda can be positive or negative, but not zero.

For lambda = 0, the Box-Cox transformation is defined as log(y).

For lambda = -1, the formula:

Example 1

We can use Q-Q plot to check if we need any transformation.

Trend: increase/decrease of data over time and a trend in a time-series could cause issue with a factor-based analysis. De-trending can be used on either response, or predictors.

One method of de-trending by using 1-dimensional regression (linear fit):

Linear Fit Example

PCA is a dimensionality reduction technique, when we have too many predictors and/or high correlction between some of the predictors. PCA transforms data, by changing the coordinates to remove correlation and ranking coordinates by importance (in order of the amount of variants).

There are 2 benefits on concentrating on the first n principal components:

  • reduce effect of randomness
  • earlier principal components are likely to have higher signal-to-noise ratio (driven by actual effects, than random effects)

Principal components are orthogonal with each other. The steps are outlined below:

scale

  • find all the eigenvectors xtx

  • V: matrix of eigenvectors (sorted by eigenvalue)

  • V = [V1, V2...], where j

  • PCA Linear transformation, where 1st component is XV1, 2nd = XV2

  • kth new factor vlaue for ith data point: kth

Doing so, PCA eliminates correlation between factors. If want to have fewer variables, only include the first n principal components. We can also deal with non-linear functions using kernels (similar to SVM modeling).

Decode unscale coefficients:

Y 1step

Y_2step

Y_3step

Y_4step

Question: A = n x n matrix. A scalar lambda is called an eigenvalue of A, if there is a non-zero vector x such that A.

Such a vector x is called an eigenvector of A corresponding to lambda.

Show that x = x is an eigenvector of A corresponding to lambda = 4.

Eq1

Eq2

Eq3

Eq4

This means if lambda is an eigenvalue of A, and x is an eigenvector belonging to lambda, any non-zero multiple of x will be an eigenvector.

Assuming

Proof

It can be used to determine how do systems work (descriptive), and what will happen in the future (predictive).

But we need to take note that correlation does not mean causation. So just because there's a correlation and our regression model can make good predictions, it does not mean the predictors caused the response.

We can also adjust the data so the fit is linear:

  • Quadratic regression
  • Response transform, log(y)
  • Box-Cox transformation
  • Variable interaction, add interaction term e.g. x1x2

We can determine whether the fitted line is good or not, by looking at the sum of squared errors. The coefficients for x change as we move the line, and the best-fit regression line is the one that minimizes sum of squared errors. The sum of squared errors (SSE) is a convex quadratic function and we need to minimize it, by taking partial derivatives of the sum of squared errors term with respect to each constant & set to zero and solve these equations simultaneously to find the minimum SSE.

Data point i prediction error:

pred error

Sum of squared errors:

sse1

sse2


How can we measure model quality?
Likelihood

The most basic is likelihood. We can measure the probability (density) for any parameter set, and whichever set of parameters gives the highest probability density is called the maximum likelihood (best fit set of parameters).

Assume:

  • error is normally distributed with mean = 0, and variant sigma squared.
  • independent
  • identically distributed

Then the set of parameters that minimizes the SSE is the maximum likelihood fit (MLE).

SSE

where xij = observed predictor value, a0 - am = parameters to fit.

We can use likelihood to compare 2 different models by using the likelihood ratio and conduct a hypothesis test.


Akaike Information Criterion(AIC)

Aikaike's information criterion (AIC) is known as a penalized log-likelihood. Adding extra parameters can lead to overfitting to fit random effects. Smallest AIC is preferred - encourages fewer parameters k, and higher likelihood.

- L*: maximum likelihood value
- k: number of parameters being estimated

AIC

AIC Substitute


Corrected AIC*

AIC has nice properties if there are infinitely manay data points to fit the model. There's a correction term we can add to AIC to deal with data where data set is not infinite.

Corrected AIC

We can calculate the relative probability that one of the models is better than the other.

Example:

  • Model 1: AIC = 80
  • Model 2: AIC = 85

AIC Example 1

AIC Example 2

It is much more likely that the first model is better.


Bayesian Information Criterion(BIC)

BIC Formula

L* = maximum likelihood value

k  = number of parameters being estimated

n  = number of data points 

Similar to AIC:

  • penalty term for BIC > AIC, so encourages models with fewer parameters than AIC does
  • Only use BIC when data points > parameters

When comparing 2 models on the same dataset by their BIC

  • difference > 10, smaller BIC is very likely better
  • 6 < difference < 10, then smaller BIC is likely better
  • 2 < difference < 6, then smaller BIC is somewhat likely better
  • 0 < difference < 2, then smaller BIC is slightly likely better

The difference between AIC and BIC:

  • AIC: frequentist point of view
  • BIC: Bayesian point of view

Regression Output

  • P-values:

    • if > 0.05, remove the attribute.
    • Higher thresholds, more factors can be included but might include irrelevant factor
    • Lower thresholds, less factors can be included and might leave out relevant factor
    • But with large amounts of data, p-values can be significant/small even when the factor is not related to the response
  • Confidence Interval (CI): where the coefficient probably lies, and how close to zero

  • T-Statistic: coeff / standard error, related to p-value

  • Coefficient: not much difference even if very low p-value, then it isn't very meaningful

  • R-squared: estimate how much variations the model accounts for

    • e.g. R-squared = 80% , the model accounts for 80% of variability in the data & the rest are randomness or other factors

We can use trees to divide data set, and specify different models for each subset of the data.

  1. Classification Problems
    • CART (Classification and Regression Trees)
  2. Decision Making
    • Decision Tree

Trees Branching

  • Common practice is to branch one factor at a time
  • Use half of the data, and build a regression model
    • calculate variance of the response for all data points in the leaf
    • test spliting to determine total variance of the 2 branches & choose the lowest
    • if enough data points make the split
    • can go backwars & prune the tree using the other half of data
  • Common rule is to stop branching if a leaf would contain < than 5% of the data points, or low improvement benefit

Random Forests

  • if it's a regression tree, use the average predicted response
  • if it's a classication tree, use the most common predicted response

Pros: better overall estimates, averages between trees somehow neutralizes overfitting

Cons: harder to explain/interpret, can't given specific model from the data

Logistic Regression

This model is useful when response is a probability (number between 0 and 1) and binary (either 0,1).

logit 1

logit 2

logit 3

logit 4

Pseudo R-squared is not really measuring fraction of variance.

Receiver Operating Characteristic (ROC) Curve to specifiy a threshold probability. It is quicky way to determine quality of the model but does not provide cost of false negatives and false positives.

  • x-axis: 1 - Specificity = 1 - TN/(TN + FP)
  • y-axis: Sensitivity = TP/(TP + FN)
  • Sensitivity is also known as True Positive Rate, and Specificity = True Negative Rate

Confusion Matrix

Classification Yes (Model) No (Model)
Yes (True) Correct Incorrect
No (True) Incorrect Correct

Meaning:

Classification Yes (Model) No (Model)
Yes (True) True Positive False Negative
No (True) False Positive True Negative

Example of an email spam filter using SVM:

Classification Yes (Model) No (Model)
Yes (True) 490 (TP) 10 (FN)
No (True) 100 (FP) 400 (TN)

Question:

  • fraction of email expect to be spam = FP / (TP+FP)
  • fraction of real email lost = FN / (TP+FN)

We can also calculate cost of lost productivity:

  • assume: $0 for correct, $0.04 to read spam, $1 to miss a real email
  • if 50% of email is spam, total cost = 490 x $0 + 400 x $0 + 10 x $1 + 100 x $0.04 = $14, or $1.4 per email
  • if 40% of email is spam, total cost = 490 x (60%/50%) x $0 + 400 x (40%/50%) x $0 + 10 x (60%/50%) x $1 + 100 x (40%/50%) x $0.04 = $15.2, or $1.52 per email

Regression

  1. Poisson Regression:

    • use when response follows a Poisson distribution Poisson Dist

    • example: to count arrivals at an airport security line

    • estimate lambda (x)

  2. Regression splines:

    • function of polynomials that connect to each other
    • allow fit to different functions to different parts of the data set, which smooth connections between the parts
    • order k regression spline: the polynomials are all order k
  3. Bayesian regression:

    • also start with some estimate of how regression coefficients & random error is distributed
    • then use Bayes' theorem to update estimate
    • most helpful when not much data
  4. K-Nearest-Neighbor (KNN) regression:

    • no trying to guess function of the attributes which might be a good predictor
    • plot all the data, and predict a response by taking average response of k closest data points

Time series data can be affected by:

  • trends over time e.g. stock prices
  • cyclical variations e.g. annual temperature cycles, weekly sales
  • randomness e.g. stock prices, blood pressure

St = expected baseline response at time period t

xt = observed response at t

  • alpha -> 0: a lot of randomness (e.g. yesterday's baseliness is a good indicator for today, willing to trust previous estimate, St-1)
  • alpha -> 1: not much randomness (e.g. today's baseliness is close to the observed data, willing to trust xt)

Exp Smooth Eq1

Include trends at time period t (starting conditions T1= 0, shows no initial trend):

trend1

trend2

Include cyclic patterns:

It can be like trend, additive component OR

Seasonalities in a multiplicative way (starting condition, 1 = no initial cyclic effect)

  • L: length of a cycle
  • Ct: multiplicative seasonality factor for time t, inflate/deflate the observation (we use cyclic factor from L time periods ago, because that is the most recent cyclic factor from the same part of the cycle)

cyclic1

If C = 1.1 on weekend, just means sales were 10% higher because of weekend

cyclic2

This time series model is also known as single / double / triple exponential smoothing (depending on how many trends and seasonality include).

Triple exponential smoothing is also called Winter's Method or Holt-Winters.

Exponential smoothing smoothes out randomness (peaks/valleys), and can also be used for forecasting. It is primarily used on the most recent data points, so it is better for short-term forecasting.

Our guess for next time period is the same as latest baseline estimate

forecast1

When trends included, best estimate of trend = most current trend estimate

When multiplicative seasonality is included, forecast2

To optimize alpha, beta, gamma: use optimization = min(Ft - xt)^2

AutoRegressive Integrated Moving Average has 3 key parts:

  1. Differences (d)
    • exponential smoothing estimates that St based on xt, xt-1 and it works well if data is stationary
    • often data is NOT stationary, but the difference might be stationary
  2. Autoregression (p)
    • predicting current value based on previous time periods' values
  3. Moving Average (q)
    • use previous errors as predictors

Specific values of p, d, q:

  • ARIMA (0,0,0) = white noise
  • ARIMA (0,1,0) = random walk
  • ARIMA (p,0,0) = autoregressive model
  • ARIMA (0,0,q) = moving average model
  • ARIMA (0,1,1) = basic exponential smoothing model

ARIMA works better than exponential smoothing when data is more stable with fewer peaks, valleys and outliers. We also need at least 40 data points for ARIMA to work well.

Generalized Autoregressive Conditional Heteroscedasticity (GARCH) is used to estimate or forecast the variance. We need to do variance estimation because it can tell us how much the forecast might be different than the true value.

It is important in investment e.g. traditional portfolio optimization model, where it balances the expected return of investments based on amount of volatility.

GARCH ARIMA
Variances, squared errors Observations, linear errors
Raw variances Differences in variances


**References**
  1. The R Book, by Michael J.Crawley, Published by Wiley, 2007.
  2. Introduction to Analytics Modeling, by Prof.Joel Sokol.