Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created #1755

Closed
npatki opened this issue Jan 24, 2024 · 32 comments
Labels
data:multi-table Related to multi-table, relational datasets question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@npatki
Copy link
Contributor

npatki commented Jan 24, 2024

I'm filing this issue on behalf of a user.

Environment details

  • SDV version: ?

Problem description

We tried to do an HMA synthesizer on three tables

  • MemInput_COM_2019 with columns Member_ID, Age, Gender and Exposure_Moths. Basically a membership dataset. Total around 150k records.
  • PharmInput_COM_2019 with columns Member_ID, NDC, FillDate, MR_Allowed, MR_Paid, Days_Supplied and Qty_Dispensed. Basically a drug dataset. Total around 1,227k records.
  • MedInput_COM_2019 with columns Member_ID, ToDate, ICDDiag01-25, ProcCode, POS, MR_Allowed and MR_Paid. Basically a medical diagnosis dataset. Total around 3,404k records.

The tables are linked by one key Member_ID. However, when we generated synthesized data with 1% portion, relationships between dates and NDC and ICD codes do not seem to show up properly, from the screenshots for synthesized datasets. Can you advise how we might be able to improve it? Thanks.

image
image

@npatki npatki added question General question about the software new Automatic label applied to new issues labels Jan 24, 2024
@npatki
Copy link
Contributor Author

npatki commented Jan 24, 2024

Hello,

I just wanted to confirm my understanding of the problem:

  1. The synthesizer is faithfully reconstructing NDC and IDC codes that were present in the original data. It is not inventing entirely new or invalid NDC/IDC codes -- such as random, missing values or codes that do not make sense.
  2. For a given member (a row in MemInput_COM_2019), you are looking at the associated drugs (rows in PharmInput_COM_2019) as well as associated medical diagnoses (MedInput_COM_2019). Some of these associations are not realistic. For example, you may be seeing a specific drug (Acetaminophen) that is not useful for a diagnosis (Diabetes).

Could you confirm if this is accurate?

Additional Info

It would also be useful if you could provide a bit more information about how the three tables are connected/what they represent.

  • In MedInput_COM_2019, I see that there are 25 columns for ICD Diagnoses. Does this mean:
    • That there are up to 25 diagnoses possible per person?
    • That if there are <25 diagnoses, there are NaNs for the remaining ones? Eg. you may fill up ICDDiag01-ICDDiag06, but then leave ICDDiag07-ICDDiag25 blank?
  • Are there any restrictions for the number of connections between the tables?
    • From the sizes of the tables, it appears that there are many members (MemInput_COM_2019) that do not have any associated diagnoses or drugs?
    • Is it possible for a member to have a diagnosis but no drugs? Is it possible for a member to have a drug but no diagnosis?
    • Is it possible for a member (row in MemInput_COM_2019) to correspond to 2 or more rows in MedInput_COM_2019? Or is it at most 1 row?
    • Is it possible for a member (row in MemInput_COM_2019) to correspond to 2 or more rows in PharmInput_COM_2019? Or is it at most 1 row?

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 24, 2024
@leeyuntien
Copy link

Yes, for points 1 and 2 mentioned above they are accurate. Our initial questions would be why there are out-of-range date values and N/A's given no N/A's for columns like NDC, FillDate or MR_Allowed etc in the original datasets.

For questions on MedInput_COM_2019, yes there are up to 25 diagnoses possible per person, and if there are <25 diagnoses the remaining ones are left blank. For questions on restrictions for the number of connections between the tables, there are no restrictions ie there could be members without any Med or Pharm, also there could be other members with more than one Med or Pharm or both.

@npatki
Copy link
Contributor Author

npatki commented Jan 24, 2024

Thanks for the information. Very helpful. We can focus on this:

Our initial questions would be why there are out-of-range date values and N/A's given no N/A's for columns like NDC, FillDate or MR_Allowed etc in the original datasets.

Missing Values

You are saying that the real data does not have any missing values (all values are filled in), but the synthetic data does have missing values.

In this case, I believe the root cause is issue #1691 -- there is currently a bug in the HMASynthesizer that we hope to fix soon. I have included 2 possible workarounds in that issue.

Out-of-Range Values

By default, the HMASynthesizer should note down the min/max value of each column in the original data. It should ensure that the synthetic data does not go out-of-bounds. Is this not the case for your data?

Would you be able to provide more details as to which particular column(s) this is happening for?

Better yet -- I would recommend running the Diagnostic Report on the real vs. synthetic data. This report is designed to capture and provide more insights into the exact problems you're mentioning (inventing new values like NaN, and going out-of-bounds). If the score is not 1.0 here, it means there is a bug. You can share with us any detailed breakdowns where you are noticing that the score is <1.0.

@leeyuntien
Copy link

Sure will see if a diagnosis report can be generated.

@leeyuntien
Copy link

Just updated to sdv 1.9.0 and the learning process of HMASynthesizer.fit finished with the same set of tables ie parent MemInput_COM_2019 table linked to two children tables PharmInput_COM_2019 and MedInput_COM_2019 by Member_ID. However the following error messages were generated when calling HMASynthesizer.sample. Do you know why?

C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\scipy\stats_continuous_distns.py:700: RuntimeWarning: Error in function boost::math::tgamma(%1%,%1%): Series evaluation exceeded %1% iterations, giving up now.
return _boost._beta_ppf(q, a, b)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
Traceback (most recent call last):
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\data_processing\data_processor.py", line 906, in reverse_transform
reversed_data[column_name] = reversed_data[column_name].astype(dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\generic.py", line 6637, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\managers.py", line 431, in astype
return self.apply(
^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\managers.py", line 364, in apply
applied = getattr(b, f)(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\blocks.py", line 758, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 237, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 182, in astype_array
values = _astype_nansafe(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 101, in _astype_nansafe
return _astype_float_to_int_nansafe(arr, dtype, copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 145, in _astype_float_to_int_nansafe
raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\base.py", line 393, in sample
sampled_data = self._sample(scale=scale)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 222, in _sample
self._sample_children(table_name=table, sampled_data=sampled_data)
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 142, in _sample_children
self._add_child_rows(
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 108, in _add_child_rows
sampled_rows = self._sample_rows(child_synthesizer, num_rows)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 71, in _sample_rows
return synthesizer._sample_batch(int(num_rows), keep_extra_columns=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\single_table\base.py", line 602, in _sample_batch
sampled, num_valid = self._sample_rows(
^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\single_table\base.py", line 519, in _sample_rows
sampled = self._data_processor.reverse_transform(raw_sampled)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\data_processing\data_processor.py", line 920, in reverse_transform
raise ValueError(e)
ValueError: Cannot convert non-finite values (NA or inf) to integer

@leeyuntien
Copy link

Also in sdv 1.9.0 there is no HSASynthesizer?

Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'HSASynthesizer' from 'sdv.multi_table' (C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table_init_.py)

@npatki
Copy link
Contributor Author

npatki commented Jan 29, 2024

Hi @leeyuntien, thanks for getting back. Were you able to resolve the original problem at the beginning of this issue? Or are you retrying everything with the newest SDV version now?

Error Message

the following error messages were generated when calling HMASynthesizer.sample. Do you know why?

This is strange indeed because the actual line of code that is causing the issue is not supposed to crash. We are actually excepting the ValueError and allowing the sampling to proceed.

try:
reversed_data[column_name] = reversed_data[column_name].astype(dtype)
except ValueError as e:
column_metadata = self.metadata.columns.get(column_name)

The fact that yours crashes anyways (with a ValueError) probably means the newest version of SDV (1.9.0) is not being used for some reason.

In the past, I've noticed that there are sometimes caching issues if you are using a notebook type environment. To sanity check, could you run the following and verify that it prints '1.9.0'?

import sdv
print(sdv.__version__)

HSA

Also in sdv 1.9.0 there is no HSASynthesizer?

The HSASynthesizer is available in the SDV Enterprise SDK, not the public SDV. To get access to the SDV Enterprise SDK, you'd need to purchase a license with us.

More resources:

@leeyuntien
Copy link

sdv version
image

@npatki
Copy link
Contributor Author

npatki commented Jan 29, 2024

Hi @leeyuntien thanks for confirming. We were able to dig in a little further and looks like it is actually happening due to the same as issue #1691 (linked above). Have you tried the workarounds listed in that issue (using 'norm' or 'truncnorm')?

Something else that might help as a workaround: If any columns are stored as integers in memory (in Python), I would casting them to float for the sake of running them through SDV. To see which column(s) are represented as ints, you run the following for each of the table names:

print(data[TABLE_NAME].dtypes)

Then you can convert any column that are listed as int or int64 into floats:

data[TABLE_NAME][COLUMN_NAME] = data[TABLE_NAME][COLUMN_NAME].astype('float')

The good news is that we are actively working on the underlying issue and hope to have a fix up in the near future. Thanks for bearing with us.

@leeyuntien-milli
Copy link

Just tried the workarounds listed in the issue but still got this message. Will change int to float to test.

Traceback (most recent call last):
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\data_processing\data_processor.py", line 906, in reverse_transform
reversed_data[column_name] = reversed_data[column_name].astype(dtype)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5546, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 595, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 406, in apply
applied = getattr(b, f)(**kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 595, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 966, in astype_nansafe
raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
ValueError: Cannot convert non-finite values (NA or inf) to integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\multi_table\base.py", line 393, in sample
sampled_data = self._sample(scale=scale)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 222, in _sample
self._sample_children(table_name=table, sampled_data=sampled_data)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 142, in _sample_children
self._add_child_rows(
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 108, in _add_child_rows
sampled_rows = self._sample_rows(child_synthesizer, num_rows)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 71, in _sample_rows
return synthesizer._sample_batch(int(num_rows), keep_extra_columns=True)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\single_table\base.py", line 602, in _sample_batch
sampled, num_valid = self._sample_rows(
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\single_table\base.py", line 519, in _sample_rows
sampled = self._data_processor.reverse_transform(raw_sampled)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\data_processing\data_processor.py", line 920, in reverse_transform
raise ValueError(e)
ValueError: Cannot convert non-finite values (NA or inf) to integer

@npatki
Copy link
Contributor Author

npatki commented Jan 30, 2024

Sounds good. The change to 'truncnorm' or 'norm' generally make it less likely to run into the problem but it is not guaranteed. I hope the workaround from int to float is able to resolve the crash.

@leeyuntien-milli
Copy link

All datasets were put into the fitting process but a portion of 0.01 was used to sample.

mem table seems normal
image

However, there are still NaN's and NaT's in med and pharm tables.
image

@npatki
Copy link
Contributor Author

npatki commented Jan 31, 2024

Great to hear that it's no longer crashing! This was the immediate goal so at least you have some synthetic data to work with for v1.9.0.

The NaN values are expected right now due to issue #1691. Since the suggested workaround* is not guaranteed, you would have to wait until we resolve this issue. Rest assured that we are actively looking into the root cause and hope to have a resolution in a future release.

*Suggested workaround is to use 'truncnorm' (or 'norm'). You may want to try using 'trunnorm' in addition to converting the columns to floats. This, too, would be a temporary workaround that is not 100% guaranteed at the moment.

@npatki
Copy link
Contributor Author

npatki commented Feb 20, 2024

Hi @leeyuntien -- good news! We have released an updated version of SDV (v1.10.0) that should resolve this issue.

You should no longer have to apply any workarounds. The HMASynthesizer should now be able to run by default without running into any Errors and without creating any unnecessary NaN/NaT values.

Please upgrade to the latest version and give it a try. If you continue to run into this problem, feel free to reply and we can always re-open the issue to continue the investigation. (For any other problems unrelated to NaNs, please feel free to file a new issue.) Thanks.

@npatki npatki closed this as completed Feb 20, 2024
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Feb 20, 2024
@leeyuntien-milli
Copy link

sdv has been updated to 1.10.0 but there are still NaNs and NaTs in the synthesized datasets even if there are none of them in the source datasets, can you advise other ways to deal with it?
image

@npatki
Copy link
Contributor Author

npatki commented Mar 1, 2024

Hi @leeyuntien-milli, sorry to hear that. I'm reopening the issue for discussion.

Just to confirm, upgrading to SDV 1.10.0 means that you'd have to create and train a new synthesizer on 1.10.0 (it is not sufficient to load in an pre-existing synthesizer on 1.10.0). Confirming that that is what you've done?

Since our bug fix went out to 1.10.0, I'm wondering if something else is going on now. (I can confirm that our HSA algorithm works ok, but it seems maybe something is still wrong with the public HMA.) I am wondering if you could provide more information?

  • Can you show us the metadata schema visualization for this? I think you have 3 tables. Are they connected in a straight line A --> B --> C or is it branched? Using metadata.visualize() will be insightful
  • Which columns are having this problem? I am particularly interested in whether it is only the columns of one particular table (eg. a child table or parent table). And whether they are only of a particular type (eg. datetime)

That will help us narrow down what's going wrong.

@npatki npatki reopened this Mar 1, 2024
@npatki npatki added data:multi-table Related to multi-table, relational datasets under discussion Issue is currently being discussed and removed resolution:resolved The issue was fixed, the question was answered, etc. labels Mar 1, 2024
@leeyuntien-milli
Copy link

print(metadata.visualize())

digraph Metadata {
node [fillcolor=lightgoldenrod1 shape=Mrecord style=filled]
mem [label="{mem|Member_ID : id\lDOB : datetime\lGender : categorical\lExposure_Months : numerical\l|Primary key: Member_ID\l}"]
med [label="{med|Member_ID : id\lClaimID : unknown\lFromDate : datetime\lToDate : datetime\lPaidDate : datetime\lICDDiag01 : categorical\lICDDiag02 : categorical\lICDDiag03 : categorical\lICDDiag04 : categorical\lICDDiag05 : categorical\lICDDiag06 : categorical\lICDDiag07 : categorical\lICDDiag08 : categorical\lICDDiag09 : categorical\lICDDiag10 : categorical\lICDDiag11 : categorical\lICDDiag12 : categorical\lICDDiag13 : categorical\lICDDiag14 : categorical\lICDDiag15 : categorical\lICDDiag16 : categorical\lICDDiag17 : categorical\lICDDiag18 : categorical\lICDDiag19 : categorical\lICDDiag20 : categorical\lICDDiag21 : categorical\lICDDiag22 : categorical\lICDDiag23 : categorical\lICDDiag24 : categorical\lICDDiag25 : categorical\lICDDiag26 : categorical\lICDDiag27 : categorical\lICDDiag28 : categorical\lICDDiag29 : categorical\lICDDiag30 : categorical\lProcCode : categorical\lPOS : categorical\lMR_Allowed : numerical\lMR_Paid : numerical\l|Primary key: None\lForeign key (mem): Member_ID\l}"]
pharm [label="{pharm|Member_ID : id\lNDC : categorical\lClaimID : unknown\lFillDate : datetime\lProviderID : categorical\lMR_Allowed : numerical\lMR_Paid : numerical\lDays_Supplied : numerical\lQty_Dispensed : numerical\l|Primary key: None\lForeign key (mem): Member_ID\l}"]
mem -> med [label=" Member_ID → Member_ID" arrowhead=oinv]
mem -> pharm [label=" Member_ID → Member_ID" arrowhead=oinv]
}

@leeyuntien-milli
Copy link

there is no NA's in synthetic_data['mem']
synthetic_data['med'] shows NaT's only in columns ['FromDate', 'ToDate', 'PaidDate']
synthetic_data['pharm'] shows NaT's in column ['FillDate'] and NaN's in columns ['MR_Allowed', 'MR_Paid', 'Days_Supplied', 'Qty_Dispensed']

@npatki
Copy link
Contributor Author

npatki commented Mar 1, 2024

Hi @leeyuntien-milli could you copy-paste the visualization of the metadata when you do metadata.visualize(). Similar to what we have in the demo notebook, this command should render an actual image. Visuals are more helpful for us to understand your metadata.

Or if it's easier, please share your metadata JSON (accessible by print(metadata) or metadata.save_to_json()). Thanks.

Example:
image

@leeyuntien-milli
Copy link

metadata.pdf

@leeyuntien-milli
Copy link

print(metadata)
{
"tables": {
"mem": {
"primary_key": "Member_ID",
"columns": {
"Member_ID": {
"sdtype": "id"
},
"DOB": {
"sdtype": "datetime"
},
"Gender": {
"sdtype": "categorical"
},
"Exposure_Months": {
"sdtype": "numerical"
}
}
},
"med": {
"columns": {
"Member_ID": {
"sdtype": "id"
},
"ClaimID": {
"sdtype": "unknown",
"pii": true
},
"FromDate": {
"sdtype": "datetime"
},
"ToDate": {
"sdtype": "datetime"
},
"PaidDate": {
"sdtype": "datetime"
},
"ICDDiag01": {
"sdtype": "categorical"
},
"ICDDiag02": {
"sdtype": "categorical"
},
"ICDDiag03": {
"sdtype": "categorical"
},
"ICDDiag04": {
"sdtype": "categorical"
},
"ICDDiag05": {
"sdtype": "categorical"
},
"ICDDiag06": {
"sdtype": "categorical"
},
"ICDDiag07": {
"sdtype": "categorical"
},
"ICDDiag08": {
"sdtype": "categorical"
},
"ICDDiag09": {
"sdtype": "categorical"
},
"ICDDiag10": {
"sdtype": "categorical"
},
"ICDDiag11": {
"sdtype": "categorical"
},
"ICDDiag12": {
"sdtype": "categorical"
},
"ICDDiag13": {
"sdtype": "categorical"
},
"ICDDiag14": {
"sdtype": "categorical"
},
"ICDDiag15": {
"sdtype": "categorical"
},
"ICDDiag16": {
"sdtype": "categorical"
},
"ICDDiag17": {
"sdtype": "categorical"
},
"ICDDiag18": {
"sdtype": "categorical"
},
"ICDDiag19": {
"sdtype": "categorical"
},
"ICDDiag20": {
"sdtype": "categorical"
},
"ICDDiag21": {
"sdtype": "categorical"
},
"ICDDiag22": {
"sdtype": "categorical"
},
"ICDDiag23": {
"sdtype": "categorical"
},
"ICDDiag24": {
"sdtype": "categorical"
},
"ICDDiag25": {
"sdtype": "categorical"
},
"ICDDiag26": {
"sdtype": "categorical"
},
"ICDDiag27": {
"sdtype": "categorical"
},
"ICDDiag28": {
"sdtype": "categorical"
},
"ICDDiag29": {
"sdtype": "categorical"
},
"ICDDiag30": {
"sdtype": "categorical"
},
"ProcCode": {
"sdtype": "categorical"
},
"POS": {
"sdtype": "categorical"
},
"MR_Allowed": {
"sdtype": "numerical"
},
"MR_Paid": {
"sdtype": "numerical"
}
}
},
"pharm": {
"columns": {
"Member_ID": {
"sdtype": "id"
},
"NDC": {
"sdtype": "categorical"
},
"ClaimID": {
"sdtype": "unknown",
"pii": true
},
"FillDate": {
"sdtype": "datetime"
},
"ProviderID": {
"sdtype": "categorical"
},
"MR_Allowed": {
"sdtype": "numerical"
},
"MR_Paid": {
"sdtype": "numerical"
},
"Days_Supplied": {
"sdtype": "numerical"
},
"Qty_Dispensed": {
"sdtype": "numerical"
}
}
}
},
"relationships": [
{
"parent_table_name": "mem",
"child_table_name": "med",
"parent_primary_key": "Member_ID",
"child_foreign_key": "Member_ID"
},
{
"parent_table_name": "mem",
"child_table_name": "pharm",
"parent_primary_key": "Member_ID",
"child_foreign_key": "Member_ID"
}
],
"METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}

@npatki
Copy link
Contributor Author

npatki commented Mar 1, 2024

Hi @leeyuntien-milli, thank you. I realize you had already sent the metadata before so apologies for the confusion.

Unfortunately, I am not able to reproduce this issue. I am providing some next steps to unblock you asap.

Running Diagnostics

The SDV is designed to only generate NaN/NaT values if it recognizes that NaN/NaT are possible in the real data.

I would strongly recommend running diagnostic report to see what's happening. We expect the score to be 100% (for more info the docs). What is the score for you?

from sdv.evaluation.multi_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=data,
    synthetic_data=synthetic_data,
    metadata=metadata)
Generating report ...
(1/3) Evaluating Data Validity: : 100%|██████████| 52/52 [00:00<00:00, 366.64it/s]
(2/3) Evaluating Data Structure: : 100%|██████████| 3/3 [00:00<00:00, 137.16it/s]
(3/3) Evaluating Relationship Validity: : 100%|██████████| 2/2 [00:00<00:00, 36.78it/s]

Overall Score: 100.0%

Properties:
- Data Validity: 100.0%
- Data Structure: 100.0%
- Relationship Validity: 100.0%

If it is 100%, it indicates that the SDV is working as intended. The problem may be in how the data is loaded into Python. Python may be reading in some values as NaN or NaT. Let me know what the score is and we can discuss next steps.

Running Test Data

Using your metadata, I created some random test data. Modeling and sampling using HMA, I did not observe any NaN or NaT values. I have attached it here. Could you try it out?

test_data.zip

@leeyuntien-milli
Copy link

Generating report ...
(1/3) Evaluating Data Validity: : 100%|████████████████████████████████████████████████| 52/52 [00:10<00:00, 5.10it/s]
(2/3) Evaluating Data Structure: : 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 192.03it/s]
(3/3) Evaluating Relationship Validity: : 100%|██████████████████████████████████████████| 2/2 [00:01<00:00, 1.83it/s]

Overall Score: 97.56%

Properties:

  • Data Validity: 92.68%
  • Data Structure: 100.0%
  • Relationship Validity: 100.0%

@leeyuntien-milli
Copy link

using the test data there are still NaT's and NaN's so maybe there are some settings that are not set properly here
image
image

@npatki
Copy link
Contributor Author

npatki commented Mar 4, 2024

Hi @leeyuntien-milli, thanks for confirming.

Right, if the test data is also producing NaN/NaT values, I wonder if this is related to your Python environment or the way you're loading the data into Python. Could you please share the code you are using to read the data into Python? Along with anything you may be doing to modify that data once it's loaded into Python?

The recommended approach is to use the load_csvs function, as specified in our docs:

from sdv.datasets.local import load_csvs
from sdv.multi_table import HMASynthesizer

# assume you have unzipped tests_data.zip 
data = load_csvs(folder_name='test_data/')

# should you need to inspect it, the data is available under each file name
med_table = data['med']
pharm_table = data['pharm']
mem_table = data['mem']

# NO further modification of the data is necessary
# you can directly use it with SDV
synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data)

@leeyuntien-milli
Copy link

leeyuntien-milli commented Mar 4, 2024

Please refer to the code using your suggested function of load_csvs but the results are similar. The three tables in test_data is put under the folder data/.

from sdv.multi_table import HMASynthesizer
from sdv.metadata import MultiTableMetadata
from sdv.evaluation.multi_table import run_diagnostic
from sdv.datasets.local import load_csvs

all_data = load_csvs(folder_name='data/')
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data = all_data)
synthesizer = HMASynthesizer(metadata)

for table_name in all_data.keys():
  synthesizer.set_table_parameters(
  table_name=table_name,
  table_parameters={
    'enforce_min_max_values': True,
    'default_distribution': 'truncnorm'})

synthesizer.fit(all_data)
synthetic_data = synthesizer.sample()

diagnostic_report = run_diagnostic(
    real_data=all_data,
    synthetic_data=synthetic_data,
    metadata=metadata)

Generating report ...
(1/3) Evaluating Data Validity: : 100%|██████████████████████████████████████████████| 52/52 [00:00<00:00, 1683.87it/s]
(2/3) Evaluating Data Structure: : 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
(3/3) Evaluating Relationship Validity: : 100%|█████████████████████████████████████████| 2/2 [00:00<00:00, 128.01it/s]

Overall Score: 94.88%

Properties:

  • Data Validity: 84.63%
  • Data Structure: 100.0%
  • Relationship Validity: 100.0%

@npatki
Copy link
Contributor Author

npatki commented Mar 4, 2024

@leeyuntien-milli so using the same exact dataset and SDV version, your results are different than what we're seeing. Very interesting. This possibly indicates some issue with the version of other libraries or platform.

Could you provide more information about your setup? This includes:

  • Python version
  • Version of other software in your Python environment such as numpy, pandas, scipy, etc. (you can use pip freeze > requirements.txt)
  • Your OS (Linux? Windows?) and any other relevant platform details

@leeyuntien-milli
Copy link

leeyuntien-milli commented Mar 4, 2024

  • Python version
    3.8.5
  • Version of other software in your Python environment such as numpy, pandas, scipy, etc. (you can use pip freeze > requirements.txt)
    requirements.txt
  • Your OS (Linux? Windows?) and any other relevant platform details
    Windows 10 Enterprise with Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 3.60 GHz with 128 GB (128 GB usable)

@npatki
Copy link
Contributor Author

npatki commented Mar 4, 2024

Hi @leeyuntien-milli, thanks for the info. We realized that there is a key difference between my previous comment and the code you provided: In your code, you are using set_table_parameters command to update the distribution to 'truncnorm'. Is this intentional?

For SDV 1.10.0, you no longer need to update the distribution. It works for me if you remove this and just directly fit the synthesizer.

all_data = load_csvs(folder_name='data/')
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data = all_data)
synthesizer = HMASynthesizer(metadata)

# directly fit the data
# no need to update the synthesizer
synthesizer.fit(all_data)
synthetic_data = synthesizer.sample()

diagnostic_report = run_diagnostic(
    real_data=all_data,
    synthetic_data=synthetic_data,
    metadata=metadata)

Let me know if that works. In the meantime, we will investigate why truncnorm was causing it to create NaN values.

@leeyuntien-milli
Copy link

Thanks test_data passed so going to see if original datasets work.

Generating report ...
(1/3) Evaluating Data Validity: : 100%|██████████████████████████████████████████████| 52/52 [00:00<00:00, 1662.86it/s]
(2/3) Evaluating Data Structure: : 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
(3/3) Evaluating Relationship Validity: : 100%|██████████████████████████████████████████████████| 2/2 [00:00<?, ?it/s]

Overall Score: 100.0%

Properties:

  • Data Validity: 100.0%
  • Data Structure: 100.0%
  • Relationship Validity: 100.0%

@leeyuntien-milli
Copy link

Fitting through the original datasets shows good results in terms of validity as well.
image
However we observe some issues which we hope can be resolved with some adjustments in package settings.
In med usually FromDate, ToDate, PayDate and ICD codes would vary for different claims with the same member or Member_ID, but seems not so in synthesized data.
image
In pharm usually FillDate and NDC codes would vary for different claims with the same member or Member_ID, but seems not so in synthesized data.
image

@npatki
Copy link
Contributor Author

npatki commented Mar 12, 2024

Hi @leeyuntien-milli, thanks for the detailed response. In the interest of keeping our space clean, we usually we keep 1 GitHub issue open per technical problem. Since we were able to resolve the problem of NaNs (and this issue is getting pretty long), let me close this one. Let's use #1848 for this.

@npatki npatki closed this as completed Mar 12, 2024
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Mar 12, 2024
@npatki npatki changed the title Improving Multi-Table Synthetic Data (Healthcare dataset) Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:multi-table Related to multi-table, relational datasets question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

3 participants