PARSynthesizer: Duplicate sequence index values when `sequence_length` is higher than real data #2031

srinify · 2024-05-28T21:23:06Z

Environment Details

SDV version: SDV 1.12 and SDV 1.13.1

Error Description

When the desired sequence length is higher than the real data's sequence length and min-max enforcement is enabled, PARSynthesizer can generate duplicate values. This seems to happen especially when the sequence index column is a datetime column. When synthesizing values for the sequence key column, PARSynthesizer runs into the max value and repeats it.

Steps to reproduce

Original Data

2 sequences, each with 5 unique values for the visits column (the sequence index)

Synthetic Data

Synthetic Data example when you set sequence_length parameter to 25:

Synthetic Data example when you set sequence_length parameter to 7:

Full code in Internal Colab Notebook here

Workarounds

You can keep enforce_min_max_values as False and this will remove the max value ceiling for the datetime sequence_key column. But this will mean that the synthesized data will be less representative of your real data so this is a big tradeoff until this bug is fixed.
You can set num_sequences to be identical to the number of rows in your smallest, least unique (when it comes to the sequence key column) sequence from your real data. E.g. if you have a small sequence with 5 unique values for the sequence key, don't generate more than 5 rows per sequence. But this is also a limitation of SDV until this bug is fixed.

Original Discussion here: #2004

The text was updated successfully, but these errors were encountered:

Scit3ch · 2024-05-29T07:02:29Z

@srinify Thank you very much for investigating further and providing a minimum working example.

A thought about your second workaround which would be a good solution for my problem with one addition.
If I limit the sequence length via the parameter sequence_length of the synthesizer all sequences will have the same length, which doesn't represent the real data I have and others may also have very well.
What about a parameter to limit the maximum sequence length to a defined value, but still allow sequences to be shorter.
This could be achieved by finding the longest sequence in the training data lets say it's 100 and the maximum sequence length should be 50. So this means a 50% decrease for this sequence. This percentage could then be applied to all sequences (e.g. a sequence of 30 of the real data becomes 15).
This would preserve the relative ratio of the sequence lengths between the training and generated data.

Regarding your first workaround: Setting enforce_min_max_values to false will effect all columns. An option to exclude certain user definable columns (like sequence_index column in this case) would be beneficial.

srinify · 2024-05-29T13:12:31Z

@Scit3ch all great ideas! In this case, it may be faster for the team to fix the core issue than add workarounds :) Stay tuned! 📺

srinify added bug Something isn't working data:sequential Related to timeseries datasets labels May 28, 2024

srinify mentioned this issue May 28, 2024

Repeated sequence_index values in specific situations #2004

Closed

srinify changed the title ~~PARSynthesizer: Duplicate sequence keys when sequence_length is more than unique value count of a sequence~~ PARSynthesizer: Duplicate sequence index values when sequence_length is higher than real data May 28, 2024

lajohn4747 mentioned this issue May 31, 2024

Do not enforce min/max on sequence index column #2043

Merged

lajohn4747 self-assigned this Jun 5, 2024

lajohn4747 added this to the 1.13.2 milestone Jun 5, 2024

lajohn4747 closed this as completed in #2043 Jun 5, 2024

npatki mentioned this issue Jun 17, 2024

Release notes should not include PRs #2074

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARSynthesizer: Duplicate sequence index values when `sequence_length` is higher than real data #2031

PARSynthesizer: Duplicate sequence index values when `sequence_length` is higher than real data #2031

srinify commented May 28, 2024 •

edited

Loading

Scit3ch commented May 29, 2024

srinify commented May 29, 2024

PARSynthesizer: Duplicate sequence index values when sequence_length is higher than real data #2031

PARSynthesizer: Duplicate sequence index values when sequence_length is higher than real data #2031

Comments

srinify commented May 28, 2024 • edited Loading

Environment Details

Error Description

Steps to reproduce

Workarounds

Scit3ch commented May 29, 2024

srinify commented May 29, 2024

PARSynthesizer: Duplicate sequence index values when `sequence_length` is higher than real data #2031

PARSynthesizer: Duplicate sequence index values when `sequence_length` is higher than real data #2031

srinify commented May 28, 2024 •

edited

Loading