You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For any column that has sdtype id without a user-provided regex, the SDV currently generates generates index values in a sequential manner (eg, 0, 1, 2, ...) The resulting data doesn't look realistic.
Expected behavior
For any columns of sdtype id that do not have a user-provided regex, ensure that the synthetic data is created randomly.
In technical terms: We currently generate sequential values by assigning the IDGenerator RDT to these columns. Instead, we should assign the AnonymizedFaker to those columns and use the bothify function to make random strings. The exact params to bothify depend on the dtype (storage type) of the data.
If the column is numeric (int, float, etc), we need to ensure that the resulting synthetic values can be cast back to this dtype. In this case, assign the following transformer: AnonymizedFaker(provider_name=None, function_name='bothify', function_kwargs={'text': '##########'})
This allows for 1 billion possible values
When cast back to numbers, they will be completely randomized
If it's a primary key, then also set cardinality_rule='unique'
Otherwise, the synthetic values can remain as a string (object) type. In this case, assign the following transformer: AnonymizedFaker(provider_name=None, function_name='bothify', function_kwargs={'text': 'sdv-id-??????'})
This allows for well over 1 billion possible values
If it's a primary key, then also set cardinality_rule='unique'
Additional context
This change only applies if a column is sdtype 'id' AND there is no 'regex_format' available.
The text was updated successfully, but these errors were encountered:
Problem Description
For any column that has sdtype id without a user-provided regex, the SDV currently generates generates index values in a sequential manner (eg,
0
,1
,2
, ...) The resulting data doesn't look realistic.Expected behavior
For any columns of sdtype
id
that do not have a user-provided regex, ensure that the synthetic data is created randomly.In technical terms: We currently generate sequential values by assigning the
IDGenerator
RDT to these columns. Instead, we should assign theAnonymizedFaker
to those columns and use thebothify
function to make random strings. The exact params tobothify
depend on the dtype (storage type) of the data.AnonymizedFaker(provider_name=None, function_name='bothify', function_kwargs={'text': '##########'})
cardinality_rule='unique'
AnonymizedFaker(provider_name=None, function_name='bothify', function_kwargs={'text': 'sdv-id-??????'})
cardinality_rule='unique'
Additional context
The text was updated successfully, but these errors were encountered: