Add a utility to drop unknown references (and enforce referential integrity) #1792

npatki · 2024-02-14T20:00:18Z

Problem Description

For multi-table datasets, SDV currently expects that all the foreign key values must be present in the primary key (aka referential integrity). For various reasons* I may currently be in possession of a datasets that does not have referential integrity. This prevents SDV from being able to model my data and instead gives me an error.

*Reasons may include some messiness in the data source, or having random (incomplete) data from various data sources

Expected behavior

Add a utility function called utils.drop_unknown_references:

Parameters:

(required) metadata: A MultiTableMetadata object
(required) data: A dictionary that maps each table name (string) to the data for that table (pandas.DataFrame)
drop_missing_values: A boolean describing whether or not to also drop foreign keys with missing values
- (default) True: Drop a row if a foreign key has missing values
- False: Allow rows to contain missing values as foreign keys

Output: A dictionary that maps each of the original table names (string) to cleaned data for that table (pandas.DataFrame). The cleaned data should have referential integrity.

from sdv.utils import drop_unknown_references
from sdv.multi_table import HMASynthesizer

cleaned_data = drop_unknown_references(metadata=my_metadata, data=original_data)

synth = HMASynthesizer(metadata)
synth.fit(cleaned_data) # now synthesizers should accept the cleaned data
...

Note that if a table has multiple foreign keys, then the script should only keep rows where all foreign keys have references. If any one foreign key has an unknown reference, the entire row should be dropped.

Update the error message: If the passed-in data does not have referential integrity, update the error message to point the user towards the drop_unknown_references method. This error check happens in metadata.validate_data.

synth.fit(original_data)

InvalidDataError: The provided data does not match the metadata:
Relationships:
Error: foreign key column 'parent_id' contains unknown references: <values>. Please use the utility method 'drop_unknown_references' to clean the data.

Additional context

Do some error validation on the passed in data. It must contain the same table names and the same primary/foreign key column names as described in the metadata.
After dropping rows with unknown references, there should be at least 1 row remaining in every table. If a table has 0 rows remaining, then throw an error.

InvalidDataError: All references in table 'transactions' are unknown and must be dropped. Try providing different data for this table.

The text was updated successfully, but these errors were encountered:

npatki added feature request Request for a new feature data:multi-table Related to multi-table, relational datasets labels Feb 14, 2024

npatki mentioned this issue Feb 14, 2024

Cleanup utils module: Make internal functions private #1793

Closed

R-Palazzo mentioned this issue Feb 20, 2024

Add a utility to drop unknown references (and enforce referential integrity) #1800

Merged

R-Palazzo closed this as completed in #1800 Feb 26, 2024

npatki mentioned this issue Mar 11, 2024

Add verbosity to drop_unknown_references #1845

Closed

frances-h added this to the 1.11.0 milestone Mar 21, 2024

frances-h assigned R-Palazzo Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a utility to drop unknown references (and enforce referential integrity) #1792

Add a utility to drop unknown references (and enforce referential integrity) #1792

npatki commented Feb 14, 2024 •

edited

Loading

Add a utility to drop unknown references (and enforce referential integrity) #1792

Add a utility to drop unknown references (and enforce referential integrity) #1792

Comments

npatki commented Feb 14, 2024 • edited Loading

Problem Description

Expected behavior

Additional context

npatki commented Feb 14, 2024 •

edited

Loading