Outliers

There are few different types of outliers:

Point Outliers - values are far from the rest of the data.
Contextual Outliers - relies on the context provided by other points. For instance, when a value is not too far from the rest overall, but far from points nearby in time.
Collective Outliers - when data points collectively look like outlier

It can be detected using:

Automated methods e.g. box-and-whisker
Modeling error e.g. fit exponential smoothing model, and points with very large error might be outlier

Variable Selection

There are 2 main reasons for limiting number of factors in model:

overfitting, esp. when number of factors > data points & the model might fit too closely to random effects
simplicity: less data required, less chance of insignificant factors, easier to interpret

Forward Selection

Forward selection is a type of stepwise regression, which begins with an empty model and a variable is added one at a time to fit a model. Factors with high p-value (p > 0.05) will be removed, and the final set of factors will be used to fit the model.

Backward Selection

Backward selection involes starting with all candidate variables, and removing insignificant factors. This continues until we have no more bad factors (e.g. p > 0.15) to remove.

Stepwise Regression

There are many types of stepwise regression. Stepwise regression that involves both forward and backward elimination method, is one of the methods. This also known as greedy algorithm, because it takes one step that looks best & future options are not considered.

Lasso Regression

We add a constraint to a standard regression equation, and use it on the most important coefficients. In order to that, we will need to scale the data.

We can choose t based on number of variables and quality of the model. We can also use lasso method to determine the best trade-off.

Elastic Net

Elastic net applies contraints on absolute value of coefficients and their squares. We will need to scale the data beforehand, and choose the appropriate t and lambda.

Change Detection

Change detection method is necessary to find out when changes happen and determine if we need an action, impact of past action, or changes to the current plan. One of the methods we can use to detect changes is to use cumulative sum (CUSUM).

xt = observed value at time t

mu = mean of x, if no change

If running total > 0, add to the previous metric. Else, set to 0 because it is irrelevant to detect changes

Include constant C to pull the running total down a bit. The bigger C, harder for St to get large & get less sensitive.

Detecting an increase

Detecting a decrease

Is St more or equal to T, threshold?

Box-Cox Transformation

Some machine learning models assume data is normally distributed. Therefore, results will be biased when this assumption is not met. The unequal variance in a data is called heteroscedasticity. Higher variance in certain data points will make those estimation erros larger and push a model to fit those points.

A Box-Cox transformation, is one of the methods that can deal with heteroscedasticity. This approach is a logarithmic transformation, that shrinks the larger range to reduce its variability and stretches out smaller range to enlarge its variability.

The idea is to find best value of :

t(y), or transform vector of Y, so that it can become closer to normal distribution.

We will need to find the power transformation, lambda, that maximizes the likelihood when a specified set of explanatory variables is fitted to the following equation as the response:

The value of lambda can be positive or negative, but not zero.

For lambda = 0, the Box-Cox transformation is defined as log(y).

For lambda = -1, the formula:

We can use Q-Q plot to check if we need any transformation.

De-Trending

Trend: increase/decrease of data over time and a trend in a time-series could cause issue with a factor-based analysis. De-trending can be used on either response, or predictors.

One method of de-trending by using 1-dimensional regression (linear fit):

PCA

PCA is a dimensionality reduction technique, when we have too many predictors and/or high correlction between some of the predictors. PCA transforms data, by changing the coordinates to remove correlation and ranking coordinates by importance (in order of the amount of variants).

There are 2 benefits on concentrating on the first n principal components:

reduce effect of randomness
earlier principal components are likely to have higher signal-to-noise ratio (driven by actual effects, than random effects)

Principal components are orthogonal with each other. The steps are outlined below:

find all the eigenvectors
V: matrix of eigenvectors (sorted by eigenvalue)
V = [V1, V2...], where
PCA Linear transformation, where 1st component is XV1, 2nd = XV2
kth new factor vlaue for ith data point:

Doing so, PCA eliminates correlation between factors. If want to have fewer variables, only include the first n principal components. We can also deal with non-linear functions using kernels (similar to SVM modeling).

Decode unscale coefficients:

Question: A = n x n matrix. A scalar is called an eigenvalue of A, if there is a non-zero vector such that .

Such a vector is called an eigenvector of A corresponding to .

Show that x = is an eigenvector of corresponding to = 4.

This means if lambda is an eigenvalue of A, and x is an eigenvector belonging to lambda, any non-zero multiple of x will be an eigenvector.

Regression

It can be used to determine how do systems work (descriptive), and what will happen in the future (predictive).

But we need to take note that correlation does not mean causation. So just because there's a correlation and our regression model can make good predictions, it does not mean the predictors caused the response.

We can also adjust the data so the fit is linear:

Quadratic regression
Response transform, log(y)
Box-Cox transformation
Variable interaction, add interaction term e.g. x1x2

We can determine whether the fitted line is good or not, by looking at the sum of squared errors. The coefficients for x change as we move the line, and the best-fit regression line is the one that minimizes sum of squared errors. The sum of squared errors (SSE) is a convex quadratic function and we need to minimize it, by taking partial derivatives of the sum of squared errors term with respect to each constant & set to zero and solve these equations simultaneously to find the minimum SSE.

Data point i prediction error:

Sum of squared errors:

How can we measure model quality?
Likelihood

The most basic is likelihood. We can measure the probability (density) for any parameter set, and whichever set of parameters gives the highest probability density is called the maximum likelihood (best fit set of parameters).

Assume:

error is normally distributed with mean = 0, and variant sigma squared.
independent
identically distributed

Then the set of parameters that minimizes the SSE is the maximum likelihood fit (MLE).

where xij = observed predictor value, a0 - am = parameters to fit.

We can use likelihood to compare 2 different models by using the likelihood ratio and conduct a hypothesis test.

Akaike Information Criterion(AIC)

Aikaike's information criterion (AIC) is known as a penalized log-likelihood. Adding extra parameters can lead to overfitting to fit random effects. Smallest AIC is preferred - encourages fewer parameters k, and higher likelihood.

- L*: maximum likelihood value
- k: number of parameters being estimated

Corrected AIC*

AIC has nice properties if there are infinitely manay data points to fit the model. There's a correction term we can add to AIC to deal with data where data set is not infinite.

We can calculate the relative probability that one of the models is better than the other.

Example:

Model 1: AIC = 80
Model 2: AIC = 85

It is much more likely that the first model is better.

Bayesian Information Criterion(BIC)

L* = maximum likelihood value

k  = number of parameters being estimated

n  = number of data points

Similar to AIC:

penalty term for BIC > AIC, so encourages models with fewer parameters than AIC does
Only use BIC when data points > parameters

When comparing 2 models on the same dataset by their BIC

difference > 10, smaller BIC is very likely better
6 < difference < 10, then smaller BIC is likely better
2 < difference < 6, then smaller BIC is somewhat likely better
0 < difference < 2, then smaller BIC is slightly likely better

The difference between AIC and BIC:

AIC: frequentist point of view
BIC: Bayesian point of view

Regression Output

P-values:
- if > 0.05, remove the attribute.
- Higher thresholds, more factors can be included but might include irrelevant factor
- Lower thresholds, less factors can be included and might leave out relevant factor
- But with large amounts of data, p-values can be significant/small even when the factor is not related to the response
Confidence Interval (CI): where the coefficient probably lies, and how close to zero
T-Statistic: coeff / standard error, related to p-value
Coefficient: not much difference even if very low p-value, then it isn't very meaningful
R-squared: estimate how much variations the model accounts for
- e.g. R-squared = 80% , the model accounts for 80% of variability in the data & the rest are randomness or other factors

Advanced Regression

We can use trees to divide data set, and specify different models for each subset of the data.

Classification Problems
- CART (Classification and Regression Trees)
Decision Making
- Decision Tree

Trees Branching

Common practice is to branch one factor at a time
Use half of the data, and build a regression model
- calculate variance of the response for all data points in the leaf
- test spliting to determine total variance of the 2 branches & choose the lowest
- if enough data points make the split
- can go backwars & prune the tree using the other half of data
Common rule is to stop branching if a leaf would contain < than 5% of the data points, or low improvement benefit

Random Forests

if it's a regression tree, use the average predicted response
if it's a classication tree, use the most common predicted response

Pros: better overall estimates, averages between trees somehow neutralizes overfitting

Cons: harder to explain/interpret, can't given specific model from the data

Logistic Regression

This model is useful when response is a probability (number between 0 and 1) and binary (either 0,1).

Pseudo R-squared is not really measuring fraction of variance.

Receiver Operating Characteristic (ROC) Curve to specifiy a threshold probability. It is quicky way to determine quality of the model but does not provide cost of false negatives and false positives.

x-axis: 1 - Specificity = 1 - TN/(TN + FP)
y-axis: Sensitivity = TP/(TP + FN)
Sensitivity is also known as True Positive Rate, and Specificity = True Negative Rate

Confusion Matrix

Classification	Yes (Model)	No (Model)
Yes (True)	Correct	Incorrect
No (True)	Incorrect	Correct

Meaning:

Classification	Yes (Model)	No (Model)
Yes (True)	True Positive	False Negative
No (True)	False Positive	True Negative

Example of an email spam filter using SVM:

Classification	Yes (Model)	No (Model)
Yes (True)	490 (TP)	10 (FN)
No (True)	100 (FP)	400 (TN)

Question:

fraction of email expect to be spam = FP / (TP+FP)
fraction of real email lost = FN / (TP+FN)

We can also calculate cost of lost productivity:

assume: $0 for correct, $0.04 to read spam, $1 to miss a real email
if 50% of email is spam, total cost = 490 x $0 + 400 x $0 + 10 x $1 + 100 x $0.04 = $14, or $1.4 per email
if 40% of email is spam, total cost = 490 x (60%/50%) x $0 + 400 x (40%/50%) x $0 + 10 x (60%/50%) x $1 + 100 x (40%/50%) x $0.04 = $15.2, or $1.52 per email

Regression

Poisson Regression:
- use when response follows a Poisson distribution
- example: to count arrivals at an airport security line
- estimate lambda (x)
Regression splines:
- function of polynomials that connect to each other
- allow fit to different functions to different parts of the data set, which smooth connections between the parts
- order k regression spline: the polynomials are all order k
Bayesian regression:
- also start with some estimate of how regression coefficients & random error is distributed
- then use Bayes' theorem to update estimate
- most helpful when not much data
K-Nearest-Neighbor (KNN) regression:
- no trying to guess function of the attributes which might be a good predictor
- plot all the data, and predict a response by taking average response of k closest data points

Exponential Smoothing

Time series data can be affected by:

trends over time e.g. stock prices
cyclical variations e.g. annual temperature cycles, weekly sales
randomness e.g. stock prices, blood pressure

St = expected baseline response at time period t

xt = observed response at t

alpha -> 0: a lot of randomness (e.g. yesterday's baseliness is a good indicator for today, willing to trust previous estimate, St-1)
alpha -> 1: not much randomness (e.g. today's baseliness is close to the observed data, willing to trust xt)

Include trends at time period t (starting conditions T1= 0, shows no initial trend):

Include cyclic patterns:

It can be like trend, additive component OR

Seasonalities in a multiplicative way (starting condition, 1 = no initial cyclic effect)

L: length of a cycle
Ct: multiplicative seasonality factor for time t, inflate/deflate the observation (we use cyclic factor from L time periods ago, because that is the most recent cyclic factor from the same part of the cycle)

If C = 1.1 on weekend, just means sales were 10% higher because of weekend

This time series model is also known as single / double / triple exponential smoothing (depending on how many trends and seasonality include).

Triple exponential smoothing is also called Winter's Method or Holt-Winters.

Exponential smoothing smoothes out randomness (peaks/valleys), and can also be used for forecasting. It is primarily used on the most recent data points, so it is better for short-term forecasting.

Our guess for next time period is the same as latest baseline estimate

When trends included, best estimate of trend = most current trend estimate

When multiplicative seasonality is included,

To optimize alpha, beta, gamma: use optimization = min(Ft - xt)^2

ARIMA

AutoRegressive Integrated Moving Average has 3 key parts:

Differences (d)
- exponential smoothing estimates that St based on xt, xt-1 and it works well if data is stationary
- often data is NOT stationary, but the difference might be stationary
Autoregression (p)
- predicting current value based on previous time periods' values
Moving Average (q)
- use previous errors as predictors

Specific values of p, d, q:

ARIMA (0,0,0) = white noise
ARIMA (0,1,0) = random walk
ARIMA (p,0,0) = autoregressive model
ARIMA (0,0,q) = moving average model
ARIMA (0,1,1) = basic exponential smoothing model

ARIMA works better than exponential smoothing when data is more stable with fewer peaks, valleys and outliers. We also need at least 40 data points for ARIMA to work well.

GARCH

Generalized Autoregressive Conditional Heteroscedasticity (GARCH) is used to estimate or forecast the variance. We need to do variance estimation because it can tell us how much the forecast might be different than the true value.

It is important in investment e.g. traditional portfolio optimization model, where it balances the expected return of investments based on amount of volatility.

GARCH	ARIMA
Variances, squared errors	Observations, linear errors
Raw variances	Differences in variances

**References**

The R Book, by Michael J.Crawley, Published by Wiley, 2007.
Introduction to Analytics Modeling, by Prof.Joel Sokol.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Outliers

Variable Selection

Change Detection

Box-Cox Transformation

De-Trending

PCA

Regression

Advanced Regression

Exponential Smoothing

ARIMA

GARCH

Files

index.md

Latest commit

History

index.md

File metadata and controls