Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary keys may not be unique for variable length regexes #2116

Closed
npatki opened this issue Jul 10, 2024 · 6 comments · Fixed by #2161
Closed

Primary keys may not be unique for variable length regexes #2116

npatki opened this issue Jul 10, 2024 · 6 comments · Fixed by #2161
Assignees
Labels
bug Something isn't working
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jul 10, 2024

Environment Details

  • SDV version: 1.14.0

Error Description

If a regex has variable length (eg. 3 or 4 characters long with a format such as '[0-9]{3,4}'), the primary keys that SDV generates may not always be unique. This is supported by the diagnostic report being less than 100%.

Steps to reproduce

import pandas as pd

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
from rdt.transformers.text import RegexGenerator
from sdv.evaluation.single_table import run_diagnostic

data = pd.DataFrame(data={
    'id': [123, 456, 1234, , 5678],
    'col': [23.4, 12.3, 10.0, 90.4, 10.3]
})

metadata = SingleTableMetadata.load_from_dict({
    'columns': {
        'id': { 'sdtype': 'id', 'regex_format': '[0-9]{3,4}' },
        'col': { 'sdtype': 'numerical' }
    },
    'primary_key': 'id'
})

synth = GaussianCopulaSynthesizer(metadata)

# I am specifically assigning the RegexGenerator for illustration purposes
synth.auto_assign_transformers(data)
synth.update_transformers({
    'id': RegexGenerator(regex_format='[0-9]{3,4}', enforce_uniqueness=True, generation_order='alphanumeric')
})

synth.fit(data)
synthetic_data = synth.sample(num_rows=10_000)

diagnostic = run_diagnostic(data, synthetic_data, metadata)
diagnostic.get_details('Data Validity')

Output:
image

Expected Fix

This case can only happen when:

  • The real data has a primary key (or alternate key) stored as an int AND
  • The metadata for it includes a Regex AND
  • The Regex allows for a '0' to be in the first position of the string

If and only if all 3 conditions are met, the synthesizer should throw an error.

synthesizer = HSASynthesizer(metadata)
synthesiser.fit(data)
SynthesizerInputError: Primary key for table 'users' is stored as an int but the Regex allows it to start with '0'. Please remove the Regex or update it to correspond to valid ints.
@npatki npatki added the bug Something isn't working label Jul 10, 2024
@amontanez24
Copy link
Contributor

@npatki Do we only need to check primary keys, or should this error for other ids that have regexes?

@npatki
Copy link
Contributor Author

npatki commented Jul 18, 2024

@amontanez24 is there a reason why we should do this for other IDs?

The problem we are trying to solve is that duplicate values may be generated from a regex. This is only an issue if the column is a primary key (or alternate key) right?

@amontanez24
Copy link
Contributor

I meant unique ids, so I guess alternate keys

@npatki
Copy link
Contributor Author

npatki commented Jul 18, 2024

Got it! I updated the original issue text to include alternate keys as well.

@amontanez24
Copy link
Contributor

@npatki The issue includes the table name in the warning, but it would be kind of annoying to make that happen since at the point that the single table synthesizer gets fit, it doesn't know the table name. Is including the primary key name enough?

@npatki
Copy link
Contributor Author

npatki commented Aug 1, 2024

Discussed today with @amontanez24. Ideally we include the table name because the primary key name is not unique enough to identify a table. Eg. I've seen schemas where all of the primary keys are just called ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants