Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utility function to simplify multi-table schemas #1874

Merged
merged 27 commits into from
Apr 5, 2024

Conversation

R-Palazzo
Copy link
Contributor

@R-Palazzo R-Palazzo commented Mar 26, 2024

CU-86azkea74
Resolve #1832

Formula to determine how many columns to drop for a child

  • $n$: number of columns of the table
  • $k$: number of parameter columns per column. For beta's distribution $k=4$
  • $m = \frac{1000}{\text{total number of relationship}}$: maximum number of column per relationship
  • minimum number of column to drop = $n + k - \sqrt{k^2 + 1 + 2m}$

Some other info:

  • I updated some methods inside HMA to be class or static methods so I can use them outside HMA. I preferred keeping them inside HMA, though, because they are very specific to this model (not used for HSA, for instance.)
  • There is an integration test over our biggest demo datasets.
    The method _simplify_data(data, metadata) makes your data match your metadata.
    It can be useful when you're updating your metadata (removing columns and tables...) to make it work along with your data. If we want, it can be public.

@R-Palazzo R-Palazzo requested a review from a team as a code owner March 26, 2024 21:54
@sdv-team
Copy link
Contributor

@R-Palazzo R-Palazzo removed the request for review from a team March 26, 2024 21:54
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good! I think in a couple of places we can maybe reduce the number of functions and simplify the logic a few steps. It would be helpful to add a docstring for the _get_num_column_to_drop so we can discuss it more

Also in general can we fix some of the names:

  • childs -> children
  • grandchilds -> grandchildren
  • simplify -> simplified if it is being used as an adjective

sdv/multi_table/hma.py Outdated Show resolved Hide resolved
sdv/multi_table/utils.py Outdated Show resolved Hide resolved
sdv/multi_table/utils.py Outdated Show resolved Hide resolved
sdv/multi_table/utils.py Outdated Show resolved Hide resolved
sdv/multi_table/utils.py Outdated Show resolved Hide resolved
sdv/multi_table/utils.py Outdated Show resolved Hide resolved
Comment on lines 299 to 318
all_descendants_to_keep = all_descendants_per_root[root_with_max_descendants]

simplify_metadata, childs, grandchilds = _simplify_relationships(
simplify_metadata, root_with_max_descendants, all_descendants_to_keep
)

simplify_metadata = _simplify_non_descendants_tables(
simplify_metadata, root_with_max_descendants, all_descendants_to_keep
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be simplified by:

  1. Getting a list of all table names that are within 2 of the root (all of the root's children or grandchildren)
  2. Looping through all tables and pruning the ones that aren't in that list

sdv/multi_table/utils.py Outdated Show resolved Hide resolved
sdv/multi_table/utils.py Outdated Show resolved Hide resolved
sdv/multi_table/utils.py Outdated Show resolved Hide resolved
def test_simplify_schema_invalid_metadata():
"""Test ``simplify_schema`` when the metadata is not invalid."""
# Setup
real_data, metadata = download_demo('multi_table', 'Bupa_v1')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're in the process of fixing the demo metadata, we might want to update this to modify the downloaded demo so that it is guaranteed to be invalid

root_tables = _get_root_tables(relationships)
all_descendants = {}
for root in root_tables:
order_descendances = _get_n_order_descendants(relationships, root, order)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: typo

@codecov-commenter
Copy link

codecov-commenter commented Mar 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.55%. Comparing base (27d1b25) to head (209b720).
Report is 3 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1874      +/-   ##
==========================================
+ Coverage   97.46%   97.55%   +0.08%     
==========================================
  Files          51       52       +1     
  Lines        4978     5143     +165     
==========================================
+ Hits         4852     5017     +165     
  Misses        126      126              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


Returns:
int:
Number of columns to drop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this returns a tuple

def test_simplify_schema_invalid_data():
"""Test ``simplify_schema`` when the data is not valid."""
# Setup
real_data, metadata = download_demo('multi_table', 'fake_hotels')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we shouldn't download data for the unit tests

sdv/multi_table/utils.py Outdated Show resolved Hide resolved
@R-Palazzo
Copy link
Contributor Author

Thanks for the review @amontanez24, I addressed latest comments in this commit 8ca4158.

Comment on lines 16 to 20
def _get_relationship_for_child(relationships, child_table):
return [rel for rel in relationships if rel['child_table_name'] == child_table]


def _get_relationship_for_parent(relationships, parent_table):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can we make plural (ie. get_relationships)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in 209b720

Comment on lines 61 to 68
for root in root_tables:
order_descendants = _get_n_order_descendants(relationships, root, order)
all_descendant_root = set()
for order_key in order_descendants:
all_descendant_root.update(order_descendants[order_key])

all_descendants[root] = all_descendant_root

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we prevent having to recompute children and grandchildren later by having this method return a dict in the following format:

root -> {'order_1': ..., 'order_2': ..., 'num_descendants': 6}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done in 209b720

self._estimate_columns_traversal(table_name, columns_per_table, visited)

return columns_per_table
"We recommend simplifying your metadata schema using 'sdv.utils.simplify_schema'."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: should be "using 'sdv.utils.poc.simplify_schema'"

sdv/multi_table/utils.py Show resolved Hide resolved
sdv/multi_table/utils.py Show resolved Hide resolved
elif is_child_to_drop and child_table in metadata.tables:
del metadata.tables[child_table]

return metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function will actually modify the metadata object we pass in, we should remove the return. The convention is usually to have functions that modify arguments in-place to not return anything.

for column in columns_to_drop:
del columns[column]

return metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we should remove the return since we're directly modifying the passed in metadata object

- num_cols_to_drop (int): Number of columns to drop from the child table.
- modelable_columns (dict): Dictionary that maps the modelable sdtype to their columns.
"""
num_column_parameter = 4 # for beta distribution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pull this from HMA instead of using a magic number? Maybe update the distribution to number of parameters dict to a constant, and pull it and HMA's default distribution (in case it ever changes) directly from HMA?

Comment on lines 154 to 163
columns_to_sdtypes = {
column: columns[column]['sdtype'] for column in columns
}
sdtypes_to_columns = defaultdict(list)
for col, sdtype in columns_to_sdtypes.items():
sdtypes_to_columns[sdtype].append(col)

modelable_columns = {
key: value for key, value in sdtypes_to_columns.items() if key in MODELABLE_SDTYPE
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified to something like:

modelable_columns = defaultdict(list)
for col, col_meta in columns.items():
    if col_meta['sdtype'] in MODELABLE_SDTYPE:
        modelable_columns[col_meta['sdtype'].append(col)

@R-Palazzo
Copy link
Contributor Author

Thanks @frances-h, @amontanez24 for the review. All comments should have been addressed.

@R-Palazzo R-Palazzo requested a review from frances-h April 5, 2024 08:48
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM🍾 🎆

Copy link
Contributor

@frances-h frances-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing! 🎉

@R-Palazzo R-Palazzo merged commit fe72d91 into main Apr 5, 2024
37 checks passed
@R-Palazzo R-Palazzo deleted the issue-1832-simplify-schema branch April 5, 2024 18:15
@R-Palazzo R-Palazzo restored the issue-1832-simplify-schema branch April 5, 2024 19:33
@R-Palazzo R-Palazzo deleted the issue-1832-simplify-schema branch April 5, 2024 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add utility function to simplify multi-table schemas
5 participants