Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a utility to drop unknown references (and enforce referential integrity) #1792

Closed
npatki opened this issue Feb 14, 2024 · 0 comments · Fixed by #1800
Closed

Add a utility to drop unknown references (and enforce referential integrity) #1792

npatki opened this issue Feb 14, 2024 · 0 comments · Fixed by #1800
Assignees
Labels
data:multi-table Related to multi-table, relational datasets feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Feb 14, 2024

Problem Description

For multi-table datasets, SDV currently expects that all the foreign key values must be present in the primary key (aka referential integrity). For various reasons* I may currently be in possession of a datasets that does not have referential integrity. This prevents SDV from being able to model my data and instead gives me an error.

*Reasons may include some messiness in the data source, or having random (incomplete) data from various data sources

Expected behavior

Add a utility function called utils.drop_unknown_references:

Parameters:

  • (required) metadata: A MultiTableMetadata object
  • (required) data: A dictionary that maps each table name (string) to the data for that table (pandas.DataFrame)
  • drop_missing_values: A boolean describing whether or not to also drop foreign keys with missing values
    • (default) True: Drop a row if a foreign key has missing values
    • False: Allow rows to contain missing values as foreign keys

Output: A dictionary that maps each of the original table names (string) to cleaned data for that table (pandas.DataFrame). The cleaned data should have referential integrity.

from sdv.utils import drop_unknown_references
from sdv.multi_table import HMASynthesizer

cleaned_data = drop_unknown_references(metadata=my_metadata, data=original_data)

synth = HMASynthesizer(metadata)
synth.fit(cleaned_data) # now synthesizers should accept the cleaned data
...

Note that if a table has multiple foreign keys, then the script should only keep rows where all foreign keys have references. If any one foreign key has an unknown reference, the entire row should be dropped.

Update the error message: If the passed-in data does not have referential integrity, update the error message to point the user towards the drop_unknown_references method. This error check happens in metadata.validate_data.

synth.fit(original_data)
InvalidDataError: The provided data does not match the metadata:
Relationships:
Error: foreign key column 'parent_id' contains unknown references: <values>. Please use the utility method 'drop_unknown_references' to clean the data.

Additional context

  • Do some error validation on the passed in data. It must contain the same table names and the same primary/foreign key column names as described in the metadata.
  • After dropping rows with unknown references, there should be at least 1 row remaining in every table. If a table has 0 rows remaining, then throw an error.
InvalidDataError: All references in table 'transactions' are unknown and must be dropped. Try providing different data for this table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:multi-table Related to multi-table, relational datasets feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants