Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle PARSynthesizer model if sequence_index is missing #114

Closed
wants to merge 1 commit into from

Conversation

lajohn4747
Copy link
Contributor

resolves sdv-dev/SDV#1972
CU-86b08wr44

When sequence index is missing, par.py adds a constant column to allow for modeling as seen here. The added context column does not exist in the data though causing KeyErrors. Added a check to prevent failures.

@lajohn4747 lajohn4747 requested a review from a team as a code owner May 17, 2024 16:43
@lajohn4747 lajohn4747 requested review from frances-h and amontanez24 and removed request for a team May 17, 2024 16:43
@@ -181,7 +182,8 @@ def assemble_sequences(
groupby_columns = entity_columns[0] if len(entity_columns) == 1 else entity_columns
for _, sequence in data.groupby(groupby_columns):
sequence.drop(entity_columns, axis=1, inplace=True)
if context_columns:
missing_columns = [col for col in context_columns if col not in sequence.columns]
if context_columns and not missing_columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check the other columns that are in sequence instead of skipping over? Or is the fake column the only one in context_columns?

Copy link
Contributor

@frances-h frances-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are missing context columns being passed to PAR in the first place? My understanding is that the UUID column gets added so that we have a dummy column for the context synthesizer. Presumably it should have either (1) added a dummy constant column to the data or (2) not be passed along to PAR at all.

@frances-h
Copy link
Contributor

Why are missing context columns being passed to PAR in the first place? My understanding is that the UUID column gets added so that we have a dummy column for the context synthesizer. Presumably it should have either (1) added a dummy constant column to the data or (2) not be passed along to PAR at all.

Looking into this more, I think the problem is actually that we're adding the UUID column to self._extra_context_columns. This attribute should only be used for context generated when transforming/preprocessing the data. We'll need to modify how we create the metadata for the context synthesizer so that the UUID column gets added there.

@lajohn4747
Copy link
Contributor Author

Why are missing context columns being passed to PAR in the first place? My understanding is that the UUID column gets added so that we have a dummy column for the context synthesizer. Presumably it should have either (1) added a dummy constant column to the data or (2) not be passed along to PAR at all.

Looking into this more, I think the problem is actually that we're adding the UUID column to self._extra_context_columns. This attribute should only be used for context generated when transforming/preprocessing the data. We'll need to modify how we create the metadata for the context synthesizer so that the UUID column gets added there.

Why does the UUID column need to be added for modeling purposes? Seems like the issue is resolved and all tests pass (with the exception of a unit test checking for the added column) when I remove the the added UUID column, so I am not sure if it is still needed.

@frances-h
Copy link
Contributor

Why are missing context columns being passed to PAR in the first place? My understanding is that the UUID column gets added so that we have a dummy column for the context synthesizer. Presumably it should have either (1) added a dummy constant column to the data or (2) not be passed along to PAR at all.

Looking into this more, I think the problem is actually that we're adding the UUID column to self._extra_context_columns. This attribute should only be used for context generated when transforming/preprocessing the data. We'll need to modify how we create the metadata for the context synthesizer so that the UUID column gets added there.

Why does the UUID column need to be added for modeling purposes? Seems like the issue is resolved and all tests pass (with the exception of a unit test checking for the added column) when I remove the the added UUID column, so I am not sure if it is still needed.

I think the problem is that we can't fit on an empty dataframe, so when there's no context columns we have to add a dummy column to correctly create the context model without erroring out.

@lajohn4747
Copy link
Contributor Author

@lajohn4747 lajohn4747 closed this May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PARSynthesizer model won't fit if sequence_index is missing
3 participants