GitHub - b-knight/StatExplore: A Python Package to Facilitate Statistical Research

Distance Correlation

In statistics, the Pearson product-moment correlation coefficient (or simply "the correlation coefficient") is a standard measure of the extent and direction to which two variables move together. Ranging from [-1,1] where 1 implies perfect correlation and -1 implies perfect inverse correlation, this statistic encapsulates the ratio between two variables' covariance (the numerator) and the product of their variances (the denominator).

_{Equation 1: Pearson Product-Moment Correlation Coefficient}

An key assumption of this statistic is that the underlying relationship between these two statistics is linear. However, this assumption of linearity is often not borne out in reality. Imagine we are assessing the relationship between the amount of money spent on ads targeting visitors of a given website, and the rate of conversion from visitor to paying customer. We could easyily imagine a scenario where up to a certain point, more resources spent on ads tends to increase conversion. However, there may come a point where the prevalence of ads is so great that it is actually offputting to the consumer, accomplishing the opposite of its intended purpose. This scenario is not theoretical, but has been validated by survey data. The implication is that while ad spend may relate intimately to conversion, the correlation coefficient between these two variables is likley to be small - to the point of approaching zero.

The scatterplots below illustrate how, when the relationship between two variables involves a change in direction, the Pearson Product-Moment Correlation Coefficient fails to report the true degree of dependence between variables.

_{Image 1: Sets of Pearson Correlation Coefficients}

_{SOURCE: https://commons.wikimedia.org/wiki/File:Correlation_examples2.svg}

In 2007, Gábor J. Székely called attention to this important limitation of the correlation coefficient and introduced the concept of 'distance correlation' as part of his conception of 'E-statistics' - statistics concerning the energy distance between probability distributions. Within the framework of E-statistics, Székely re-formulated many classical statistical concepts, such as 'distance variance' versus variance, 'distance standard deviation' versus standard deviation, and 'distance covariance' versus covariance. Using these, the definition of correlation coefficient can be re-written, but in such a way that a value of zero occurs if, and only if the two variables are genuinely independent.

_{Equation 2: Distance Correlation}

_{Image 2: Sets of Distance Correlation Coefficients}

_{SOURCE: https://commons.wikimedia.org/wiki/File:Distance_Correlation_Examples.svg}

Calculating the Distance Covariance

For example, let's create some data using R:

x = c(0, 1, 2, 3, 4) 
y = c(2, 1, 0, 1, 2)

Next, we derive a matrix for each variable containing the pairwise distances for that variable. For the purposes of calculating the distance covariance, we use the Euclidean distance. If we were exploring two-dimensional observations (for example, on the Cartesian plane) the appropriate formulation of the Euclidean distance would be as follows:

However, in the example below X and Y are each univariate, and so the Euclidean distance reduces to the absolute value of the differences between observations.

This can be done in R by calling the 'dist' method and specifying "euclidean" as the distance.

x_mat <- dist(x, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)
y_mat <- dist(y, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)

We will also need the column and row means from these distance matrices, as well as the grand mean of those means. If you were to derive these manually, you might use a function like the following:

take_doubly_centered_distances <- function(x_mat) {
    library(reshape2)
    x_df               <- melt(as.matrix(x_mat), varnames = c("row", "col"))
    x_row_means        <- aggregate(x_df, list(x_df$row), mean)
    x_row_means        <- subset(x_row_means, select = -c(Group.1, col))
    names(x_row_means) <- c("row", "row_mean")
    x_df               <- merge(x=x_df, y=x_row_means, by="row")
    x_col_means        <- aggregate(x_df, list(x_df$col), mean)
    x_col_means        <- subset(x_col_means, select = -c(Group.1, row, row_mean))
    names(x_col_means) <- c("col", "col_mean")
    x_df               <- merge(x=x_df, y=x_col_means, by="col")
    x_df$grand_mean    <- mean(c(x_row_means$row_mean, x_col_means$col_mean)) 
    x_df$X             <- x_df$value - x_df$row_mean - x_df$col_mean + x_df$grand_mean 
    x_df = x_df[with(x_df, order(col, row)), ]
    myList <- list()
    for (i in unique(x_df[["col"]])){
      myList[[length(myList)+1]] <- x_df[x_df$col == i,]$X
    }
    output <- matrix(unlist(myList), ncol = length(unique(x_df[["col"]])), byrow = TRUE)
    return(output)
    }

..resulting in the following:

X Pair-Wise Distances

Y Pair-Wise Distances

X	A	B	C	D	E	Row Mean
A	0	1	2	3	4	2
B	1	0	1	2	3	1.4
C	2	1	0	1	2	1.2
D	3	2	1	0	1	1.4
E	4	3	2	1	0	2
Column Mean	2	1.4	1.2	1.4	2	Grand Mean = 1.6

Y	A	B	C	D	E	Row Mean
A	0	1	2	1	0	0.8
B	1	0	1	0	1	0.6
C	2	1	0	1	2	1.2
D	1	0	1	0	1	0.6
E	0	1	2	1	0	0.8
Column Mean	0.8	0.6	1.2	0.6	0.8	Grand Mean = 0.8

_{Tables 1 & 2: Pair-Wise Distances}

We need to doubly center these distance matrices - doubly in this context means we will first subtract from each element its row mean, and secondly subtract its column mean before adding to each element the grand mean.

The resulting matrices should have all rows and all columns sum to zero.

X Doubly Centered Distances

Y Doubly Centered Distances

X	A	B	C	D	E	_Row Sum
A	-2.4	-0.8	0.4	1.2	1.6	0
B	-0.8	-1.2	0	0.8	1.2	0
C	0.4	0	-0.8	0	0.4	0
D	1.2	0.8	0	-1.2	-0.8	0
E	1.6	1.2	0.4	-0.8	-2.4	0
_Column Sum	0	0	0	0	0

Y	A	B	C	D	E	_Row Sum
A	-0.8	0.4	0.8	0.4	-0.8	0
B	0.4	-0.4	0	-0.4	0.4	0
C	0.8	0	-1.6	0	0.8	0
D	0.4	-0.4	0	-0.4	0.4	0
E	-0.8	0.4	0.8	0.4	-0.8	0
_Column Sum	0	0	0	0	0

_{Tables 3 & 4: Distance Matrices After Doubly Centering}

Next, we need to take the arithmetic average of the products of the doubly centered matrices. The summed products is also referred to as the Frobenius inner product, which we subsequently multiply times 1 over n squared to yield the arithmetic average.

_{Equation 3: Squared Sample Distance Covariance}

We can manually do this in R via the 'matrixcalc' library.

arithmetic_average_of_products <- function(x_mat, y_mat) {
  library(matrixcalc)
  if ((nrow(x_mat) == nrow(y_mat)) & (ncol(x_mat) == ncol(y_mat))) {
    val <- frobenius.prod(x_mat, y_mat)
    return(val*(1/nrow(x_mat)^2))
  }
}

Finally, we take the square root of this result to get the sample distance covariance. If compare the results with R's 'energy' package, we see that the results are the same:

> arithmetic_average_of_products(x_mat, y_mat)^(1/2)
    0.438178
> 
> library(energy)
> dcov.test(x, y, index = 1.0, R = NULL)

	Specify the number of replicates R (R > 0) for an independence test

data:  index 1, replicates 0
nV^2 = 0.96, p-value = NA
sample estimates:
    dCov 
    0.438178

Calculating the Distance Standard Deviations

References

Björn Böttcher, Martin Keller-Ressel, René L. Schilling. (2019), Distance multivariance: New Dependence Measures for Random Vectors, The Annals of Statistics, Vol. 47, No. 5, pp.2757-2789. https://projecteuclid.org/euclid.aos/1564797863
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007), Measuring and Testing Dependence by Correlation of Distances, Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794. http://dx.doi.org/10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009), Brownian Distance Covariance, Annals of Applied Statistics, Vol. 3, No. 4, 1236-1265. http://dx.doi.org/10.1214/09-AOAS312
Szekely, G.J. and Rizzo, M.L. (2009), Rejoinder: Brownian Distance Covariance, Annals of Applied Statistics, Vol. 3, No. 4, 1303-1308. https://projecteuclid.org/euclid.aoas/1267453941

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.ipynb_checkpoints		.ipynb_checkpoints
R_Scripts		R_Scripts
images		images
CreateData.py		CreateData.py
README.md		README.md
TestBed.ipynb		TestBed.ipynb
__init__.py		__init__.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distance Correlation

Calculating the Distance Covariance

Calculating the Distance Standard Deviations

References

About

Releases

Packages

Languages

b-knight/StatExplore

Folders and files

Latest commit

History

Repository files navigation

Distance Correlation

Calculating the Distance Covariance

Calculating the Distance Standard Deviations

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages