Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity #93

Closed
Z712023 opened this issue Dec 29, 2023 · 0 comments · Fixed by #101
Closed

[Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity #93

Z712023 opened this issue Dec 29, 2023 · 0 comments · Fixed by #101
Assignees
Labels
difficulty-medium enhancement New feature or request good first issue Good for newcomers

Comments

@Z712023
Copy link
Collaborator

Z712023 commented Dec 29, 2023

Problem

When I use this tool to synthesize multi-table databases, I cannot evaluate the differences in correlation between the generated data and real data across tables.

Some commonly used statistical metrics, such as mutual information, can be introduced, like this

Proposed Solution

Mutual information can be used to measure the correlation between each pair of columns in two tables.

Let X and Y be two datasets with the same number of discrete columns/features (m), where one is the original dataset, and the other is a simulated dataset.
For each dataset, we calculate the normalized mutual information between each pair of columns.
The normalized mutual information between random variables X and Y is defined as
$nMI(X,Y)=\frac{MI(X,Y)}{min{H(X),H(Y)}}=\frac{H(X)+H(Y)-H(X,Y)}{min{H(X),H(Y)}}$

Since
$H(X,Y) \geq max{ H(X),H(Y) }$
$-H(X,Y) \leq min{ -H(X), -H(Y) }$

We can get
$H(X)+H(Y)-H(X,Y) \leq H(X)+H(Y)+min {-H(X),-H(Y) } $
That is
$MI(X,Y) \leq min {H(X),H(Y)} $
The element at the ith row and jth column of the pairwise mutual information matrix M is
$M_{ij}=nMI(X_i,X_j)$

where $X_{i}$ and $X_{j}$ are the random variables corresponding to the ith and jth columns of the dataset X.

Let M and N be the pairwise mutual information matrices for X and Y. Then, mutual information similarity is defined as:
$\frac{1}{m^2} \Sigma_{i,j} J(M_{ij},N_{ij})$

J is the Jaccard index defined as $J(A,B)=\frac{min{A,B}}{max{A,B}}$
This metric is bounded by 0 and 1 with 1 being the maximal and best value.

Additional context

TBD

@Z712023 Z712023 added the enhancement New feature or request label Dec 29, 2023
@MooooCat MooooCat changed the title Multi-Table Mutual Information Similarity [Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity Dec 29, 2023
@MooooCat MooooCat linked a pull request Jan 10, 2024 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty-medium enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants