[Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity #93

Z712023 · 2023-12-29T07:21:41Z

Problem

When I use this tool to synthesize multi-table databases, I cannot evaluate the differences in correlation between the generated data and real data across tables.

Some commonly used statistical metrics, such as mutual information, can be introduced, like this

Proposed Solution

Mutual information can be used to measure the correlation between each pair of columns in two tables.

Let X and Y be two datasets with the same number of discrete columns/features (m), where one is the original dataset, and the other is a simulated dataset.
For each dataset, we calculate the normalized mutual information between each pair of columns.
The normalized mutual information between random variables X and Y is defined as
$nMI(X,Y)=\frac{MI(X,Y)}{min{H(X),H(Y)}}=\frac{H(X)+H(Y)-H(X,Y)}{min{H(X),H(Y)}}$

Since
$H(X,Y) \geq max{ H(X),H(Y) }$
$-H(X,Y) \leq min{ -H(X), -H(Y) }$

We can get
$H(X)+H(Y)-H(X,Y) \leq H(X)+H(Y)+min {-H(X),-H(Y) } $
That is
$MI(X,Y) \leq min {H(X),H(Y)} $
The element at the ith row and jth column of the pairwise mutual information matrix M is
$M_{ij}=nMI(X_i,X_j)$

where $X_{i}$ and $X_{j}$ are the random variables corresponding to the ith and jth columns of the dataset X.

Let M and N be the pairwise mutual information matrices for X and Y. Then, mutual information similarity is defined as:
$\frac{1}{m^2} \Sigma_{i,j} J(M_{ij},N_{ij})$

J is the Jaccard index defined as $J(A,B)=\frac{min{A,B}}{max{A,B}}$
This metric is bounded by 0 and 1 with 1 being the maximal and best value.

Additional context

TBD

Z712023 added the enhancement New feature or request label Dec 29, 2023

MooooCat changed the title ~~Multi-Table Mutual Information Similarity~~ [Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity Dec 29, 2023

MooooCat assigned Z712023 and MooooCat Dec 29, 2023

MooooCat added good first issue Good for newcomers difficulty-medium labels Dec 29, 2023

MooooCat linked a pull request Jan 10, 2024 that will close this issue

Add mutual information metric #101

Merged

7 tasks

MooooCat closed this as completed in #101 Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity #93

[Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity #93

Z712023 commented Dec 29, 2023 •

edited

Loading

[Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity #93

[Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity #93

Comments

Z712023 commented Dec 29, 2023 • edited Loading

Problem

Proposed Solution

Additional context

Z712023 commented Dec 29, 2023 •

edited

Loading