You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I use this tool to synthesize multi-table databases, I cannot evaluate the differences in correlation between the generated data and real data across tables.
Some commonly used statistical metrics, such as mutual information, can be introduced, like this
Proposed Solution
Mutual information can be used to measure the correlation between each pair of columns in two tables.
Let X and Y be two datasets with the same number of discrete columns/features (m), where one is the original dataset, and the other is a simulated dataset.
For each dataset, we calculate the normalized mutual information between each pair of columns.
The normalized mutual information between random variables X and Y is defined as $nMI(X,Y)=\frac{MI(X,Y)}{min{H(X),H(Y)}}=\frac{H(X)+H(Y)-H(X,Y)}{min{H(X),H(Y)}}$
We can get $H(X)+H(Y)-H(X,Y) \leq H(X)+H(Y)+min {-H(X),-H(Y) } $
That is $MI(X,Y) \leq min {H(X),H(Y)} $
The element at the ith row and jth column of the pairwise mutual information matrix M is $M_{ij}=nMI(X_i,X_j)$
where $X_{i}$ and $X_{j}$ are the random variables corresponding to the ith and jth columns of the dataset X.
Let M and N be the pairwise mutual information matrices for X and Y. Then, mutual information similarity is defined as: $\frac{1}{m^2} \Sigma_{i,j} J(M_{ij},N_{ij})$
J is the Jaccard index defined as $J(A,B)=\frac{min{A,B}}{max{A,B}}$
This metric is bounded by 0 and 1 with 1 being the maximal and best value.
Additional context
TBD
The text was updated successfully, but these errors were encountered:
MooooCat
changed the title
Multi-Table Mutual Information Similarity
[Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity
Dec 29, 2023
Problem
When I use this tool to synthesize multi-table databases, I cannot evaluate the differences in correlation between the generated data and real data across tables.
Some commonly used statistical metrics, such as mutual information, can be introduced, like this
Proposed Solution
Mutual information can be used to measure the correlation between each pair of columns in two tables.
Let X and Y be two datasets with the same number of discrete columns/features (m), where one is the original dataset, and the other is a simulated dataset.
$nMI(X,Y)=\frac{MI(X,Y)}{min{H(X),H(Y)}}=\frac{H(X)+H(Y)-H(X,Y)}{min{H(X),H(Y)}}$
For each dataset, we calculate the normalized mutual information between each pair of columns.
The normalized mutual information between random variables X and Y is defined as
Since
$H(X,Y) \geq max{ H(X),H(Y) }$
$-H(X,Y) \leq min{ -H(X), -H(Y) }$
We can get
$H(X)+H(Y)-H(X,Y) \leq H(X)+H(Y)+min {-H(X),-H(Y) } $
$MI(X,Y) \leq min {H(X),H(Y)} $
$M_{ij}=nMI(X_i,X_j)$
That is
The element at the ith row and jth column of the pairwise mutual information matrix M is
where$X_{i}$ and $X_{j}$ are the random variables corresponding to the ith and jth columns of the dataset X.
Let M and N be the pairwise mutual information matrices for X and Y. Then, mutual information similarity is defined as:
$\frac{1}{m^2} \Sigma_{i,j} J(M_{ij},N_{ij})$
J is the Jaccard index defined as$J(A,B)=\frac{min{A,B}}{max{A,B}}$
This metric is bounded by 0 and 1 with 1 being the maximal and best value.
Additional context
TBD
The text was updated successfully, but these errors were encountered: