Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created #1755

npatki · 2024-01-24T17:08:01Z

I'm filing this issue on behalf of a user.

Environment details

SDV version: ?

Problem description

We tried to do an HMA synthesizer on three tables

MemInput_COM_2019 with columns Member_ID, Age, Gender and Exposure_Moths. Basically a membership dataset. Total around 150k records.
PharmInput_COM_2019 with columns Member_ID, NDC, FillDate, MR_Allowed, MR_Paid, Days_Supplied and Qty_Dispensed. Basically a drug dataset. Total around 1,227k records.
MedInput_COM_2019 with columns Member_ID, ToDate, ICDDiag01-25, ProcCode, POS, MR_Allowed and MR_Paid. Basically a medical diagnosis dataset. Total around 3,404k records.

The tables are linked by one key Member_ID. However, when we generated synthesized data with 1% portion, relationships between dates and NDC and ICD codes do not seem to show up properly, from the screenshots for synthesized datasets. Can you advise how we might be able to improve it? Thanks.

npatki · 2024-01-24T18:07:31Z

Hello,

I just wanted to confirm my understanding of the problem:

The synthesizer is faithfully reconstructing NDC and IDC codes that were present in the original data. It is not inventing entirely new or invalid NDC/IDC codes -- such as random, missing values or codes that do not make sense.
For a given member (a row in MemInput_COM_2019), you are looking at the associated drugs (rows in PharmInput_COM_2019) as well as associated medical diagnoses (MedInput_COM_2019). Some of these associations are not realistic. For example, you may be seeing a specific drug (Acetaminophen) that is not useful for a diagnosis (Diabetes).

Could you confirm if this is accurate?

Additional Info

It would also be useful if you could provide a bit more information about how the three tables are connected/what they represent.

In MedInput_COM_2019, I see that there are 25 columns for ICD Diagnoses. Does this mean:
- That there are up to 25 diagnoses possible per person?
- That if there are <25 diagnoses, there are NaNs for the remaining ones? Eg. you may fill up ICDDiag01-ICDDiag06, but then leave ICDDiag07-ICDDiag25 blank?
Are there any restrictions for the number of connections between the tables?
- From the sizes of the tables, it appears that there are many members (MemInput_COM_2019) that do not have any associated diagnoses or drugs?
- Is it possible for a member to have a diagnosis but no drugs? Is it possible for a member to have a drug but no diagnosis?
- Is it possible for a member (row in MemInput_COM_2019) to correspond to 2 or more rows in MedInput_COM_2019? Or is it at most 1 row?
- Is it possible for a member (row in MemInput_COM_2019) to correspond to 2 or more rows in PharmInput_COM_2019? Or is it at most 1 row?

leeyuntien · 2024-01-24T19:54:25Z

Yes, for points 1 and 2 mentioned above they are accurate. Our initial questions would be why there are out-of-range date values and N/A's given no N/A's for columns like NDC, FillDate or MR_Allowed etc in the original datasets.

For questions on MedInput_COM_2019, yes there are up to 25 diagnoses possible per person, and if there are <25 diagnoses the remaining ones are left blank. For questions on restrictions for the number of connections between the tables, there are no restrictions ie there could be members without any Med or Pharm, also there could be other members with more than one Med or Pharm or both.

npatki · 2024-01-24T20:56:41Z

Thanks for the information. Very helpful. We can focus on this:

Our initial questions would be why there are out-of-range date values and N/A's given no N/A's for columns like NDC, FillDate or MR_Allowed etc in the original datasets.

Missing Values

You are saying that the real data does not have any missing values (all values are filled in), but the synthetic data does have missing values.

In this case, I believe the root cause is issue #1691 -- there is currently a bug in the HMASynthesizer that we hope to fix soon. I have included 2 possible workarounds in that issue.

Out-of-Range Values

By default, the HMASynthesizer should note down the min/max value of each column in the original data. It should ensure that the synthetic data does not go out-of-bounds. Is this not the case for your data?

Would you be able to provide more details as to which particular column(s) this is happening for?

Better yet -- I would recommend running the Diagnostic Report on the real vs. synthetic data. This report is designed to capture and provide more insights into the exact problems you're mentioning (inventing new values like NaN, and going out-of-bounds). If the score is not 1.0 here, it means there is a bug. You can share with us any detailed breakdowns where you are noticing that the score is <1.0.

leeyuntien · 2024-01-24T22:31:18Z

Sure will see if a diagnosis report can be generated.

leeyuntien · 2024-01-29T16:43:54Z

Just updated to sdv 1.9.0 and the learning process of HMASynthesizer.fit finished with the same set of tables ie parent MemInput_COM_2019 table linked to two children tables PharmInput_COM_2019 and MedInput_COM_2019 by Member_ID. However the following error messages were generated when calling HMASynthesizer.sample. Do you know why?

C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\scipy\stats_continuous_distns.py:700: RuntimeWarning: Error in function boost::math::tgamma(%1%,%1%): Series evaluation exceeded %1% iterations, giving up now.
return _boost._beta_ppf(q, a, b)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
Traceback (most recent call last):
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\data_processing\data_processor.py", line 906, in reverse_transform
reversed_data[column_name] = reversed_data[column_name].astype(dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\generic.py", line 6637, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\managers.py", line 431, in astype
return self.apply(
^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\managers.py", line 364, in apply
applied = getattr(b, f)(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\blocks.py", line 758, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 237, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 182, in astype_array
values = _astype_nansafe(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 101, in _astype_nansafe
return _astype_float_to_int_nansafe(arr, dtype, copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 145, in _astype_float_to_int_nansafe
raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\base.py", line 393, in sample
sampled_data = self._sample(scale=scale)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 222, in _sample
self._sample_children(table_name=table, sampled_data=sampled_data)
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 142, in _sample_children
self._add_child_rows(
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 108, in _add_child_rows
sampled_rows = self._sample_rows(child_synthesizer, num_rows)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 71, in _sample_rows
return synthesizer._sample_batch(int(num_rows), keep_extra_columns=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\single_table\base.py", line 602, in _sample_batch
sampled, num_valid = self._sample_rows(
^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\single_table\base.py", line 519, in _sample_rows
sampled = self._data_processor.reverse_transform(raw_sampled)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\data_processing\data_processor.py", line 920, in reverse_transform
raise ValueError(e)
ValueError: Cannot convert non-finite values (NA or inf) to integer

leeyuntien · 2024-01-29T16:46:08Z

Also in sdv 1.9.0 there is no HSASynthesizer?

Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'HSASynthesizer' from 'sdv.multi_table' (C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table_init_.py)

npatki · 2024-01-29T18:28:06Z

Hi @leeyuntien, thanks for getting back. Were you able to resolve the original problem at the beginning of this issue? Or are you retrying everything with the newest SDV version now?

Error Message

the following error messages were generated when calling HMASynthesizer.sample. Do you know why?

This is strange indeed because the actual line of code that is causing the issue is not supposed to crash. We are actually excepting the ValueError and allowing the sampling to proceed.

SDV/sdv/data_processing/data_processor.py

Lines 905 to 908 in 334ba02

    
           try: 
        
               reversed_data[column_name] = reversed_data[column_name].astype(dtype) 
        
           except ValueError as e: 
        
               column_metadata = self.metadata.columns.get(column_name)

The fact that yours crashes anyways (with a ValueError) probably means the newest version of SDV (1.9.0) is not being used for some reason.

In the past, I've noticed that there are sometimes caching issues if you are using a notebook type environment. To sanity check, could you run the following and verify that it prints '1.9.0'?

import sdv
print(sdv.__version__)

HSA

Also in sdv 1.9.0 there is no HSASynthesizer?

The HSASynthesizer is available in the SDV Enterprise SDK, not the public SDV. To get access to the SDV Enterprise SDK, you'd need to purchase a license with us.

More resources:

HSASynthesizer Docs -- you'll see in the docs that enterprise features are marked with an asterisk with more info
Public SDV vs. SDV Enterprise features -- a full list of features for what is available in the SDKs

leeyuntien · 2024-01-29T18:31:45Z

sdv version

npatki · 2024-01-29T20:12:51Z

Hi @leeyuntien thanks for confirming. We were able to dig in a little further and looks like it is actually happening due to the same as issue #1691 (linked above). Have you tried the workarounds listed in that issue (using 'norm' or 'truncnorm')?

Something else that might help as a workaround: If any columns are stored as integers in memory (in Python), I would casting them to float for the sake of running them through SDV. To see which column(s) are represented as ints, you run the following for each of the table names:

print(data[TABLE_NAME].dtypes)

Then you can convert any column that are listed as int or int64 into floats:

data[TABLE_NAME][COLUMN_NAME] = data[TABLE_NAME][COLUMN_NAME].astype('float')

The good news is that we are actively working on the underlying issue and hope to have a fix up in the near future. Thanks for bearing with us.

leeyuntien-milli · 2024-01-30T01:11:07Z

Just tried the workarounds listed in the issue but still got this message. Will change int to float to test.

Traceback (most recent call last):
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\data_processing\data_processor.py", line 906, in reverse_transform
reversed_data[column_name] = reversed_data[column_name].astype(dtype)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5546, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 595, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 406, in apply
applied = getattr(b, f)(**kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 595, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 966, in astype_nansafe
raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
ValueError: Cannot convert non-finite values (NA or inf) to integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\multi_table\base.py", line 393, in sample
sampled_data = self._sample(scale=scale)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 222, in _sample
self._sample_children(table_name=table, sampled_data=sampled_data)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 142, in _sample_children
self._add_child_rows(
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 108, in _add_child_rows
sampled_rows = self._sample_rows(child_synthesizer, num_rows)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 71, in _sample_rows
return synthesizer._sample_batch(int(num_rows), keep_extra_columns=True)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\single_table\base.py", line 602, in _sample_batch
sampled, num_valid = self._sample_rows(
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\single_table\base.py", line 519, in _sample_rows
sampled = self._data_processor.reverse_transform(raw_sampled)
File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\data_processing\data_processor.py", line 920, in reverse_transform
raise ValueError(e)
ValueError: Cannot convert non-finite values (NA or inf) to integer

npatki · 2024-01-30T17:56:46Z

Sounds good. The change to 'truncnorm' or 'norm' generally make it less likely to run into the problem but it is not guaranteed. I hope the workaround from int to float is able to resolve the crash.

leeyuntien-milli · 2024-01-31T17:17:35Z

All datasets were put into the fitting process but a portion of 0.01 was used to sample.

mem table seems normal

However, there are still NaN's and NaT's in med and pharm tables.

npatki · 2024-01-31T18:15:22Z

Great to hear that it's no longer crashing! This was the immediate goal so at least you have some synthetic data to work with for v1.9.0.

The NaN values are expected right now due to issue #1691. Since the suggested workaround* is not guaranteed, you would have to wait until we resolve this issue. Rest assured that we are actively looking into the root cause and hope to have a resolution in a future release.

*Suggested workaround is to use 'truncnorm' (or 'norm'). You may want to try using 'trunnorm' in addition to converting the columns to floats. This, too, would be a temporary workaround that is not 100% guaranteed at the moment.

npatki · 2024-02-20T21:27:09Z

Hi @leeyuntien -- good news! We have released an updated version of SDV (v1.10.0) that should resolve this issue.

You should no longer have to apply any workarounds. The HMASynthesizer should now be able to run by default without running into any Errors and without creating any unnecessary NaN/NaT values.

Please upgrade to the latest version and give it a try. If you continue to run into this problem, feel free to reply and we can always re-open the issue to continue the investigation. (For any other problems unrelated to NaNs, please feel free to file a new issue.) Thanks.

leeyuntien-milli · 2024-02-28T19:09:57Z

sdv has been updated to 1.10.0 but there are still NaNs and NaTs in the synthesized datasets even if there are none of them in the source datasets, can you advise other ways to deal with it?

npatki · 2024-03-01T15:24:00Z

Hi @leeyuntien-milli, sorry to hear that. I'm reopening the issue for discussion.

Just to confirm, upgrading to SDV 1.10.0 means that you'd have to create and train a new synthesizer on 1.10.0 (it is not sufficient to load in an pre-existing synthesizer on 1.10.0). Confirming that that is what you've done?

Since our bug fix went out to 1.10.0, I'm wondering if something else is going on now. (I can confirm that our HSA algorithm works ok, but it seems maybe something is still wrong with the public HMA.) I am wondering if you could provide more information?

Can you show us the metadata schema visualization for this? I think you have 3 tables. Are they connected in a straight line A --> B --> C or is it branched? Using metadata.visualize() will be insightful
Which columns are having this problem? I am particularly interested in whether it is only the columns of one particular table (eg. a child table or parent table). And whether they are only of a particular type (eg. datetime)

That will help us narrow down what's going wrong.

leeyuntien-milli · 2024-03-01T19:35:25Z

print(metadata.visualize())

digraph Metadata {
node [fillcolor=lightgoldenrod1 shape=Mrecord style=filled]
mem [label="{mem|Member_ID : id\lDOB : datetime\lGender : categorical\lExposure_Months : numerical\l|Primary key: Member_ID\l}"]
med [label="{med|Member_ID : id\lClaimID : unknown\lFromDate : datetime\lToDate : datetime\lPaidDate : datetime\lICDDiag01 : categorical\lICDDiag02 : categorical\lICDDiag03 : categorical\lICDDiag04 : categorical\lICDDiag05 : categorical\lICDDiag06 : categorical\lICDDiag07 : categorical\lICDDiag08 : categorical\lICDDiag09 : categorical\lICDDiag10 : categorical\lICDDiag11 : categorical\lICDDiag12 : categorical\lICDDiag13 : categorical\lICDDiag14 : categorical\lICDDiag15 : categorical\lICDDiag16 : categorical\lICDDiag17 : categorical\lICDDiag18 : categorical\lICDDiag19 : categorical\lICDDiag20 : categorical\lICDDiag21 : categorical\lICDDiag22 : categorical\lICDDiag23 : categorical\lICDDiag24 : categorical\lICDDiag25 : categorical\lICDDiag26 : categorical\lICDDiag27 : categorical\lICDDiag28 : categorical\lICDDiag29 : categorical\lICDDiag30 : categorical\lProcCode : categorical\lPOS : categorical\lMR_Allowed : numerical\lMR_Paid : numerical\l|Primary key: None\lForeign key (mem): Member_ID\l}"]
pharm [label="{pharm|Member_ID : id\lNDC : categorical\lClaimID : unknown\lFillDate : datetime\lProviderID : categorical\lMR_Allowed : numerical\lMR_Paid : numerical\lDays_Supplied : numerical\lQty_Dispensed : numerical\l|Primary key: None\lForeign key (mem): Member_ID\l}"]
mem -> med [label=" Member_ID → Member_ID" arrowhead=oinv]
mem -> pharm [label=" Member_ID → Member_ID" arrowhead=oinv]
}

leeyuntien-milli · 2024-03-01T19:38:44Z

there is no NA's in synthetic_data['mem']
synthetic_data['med'] shows NaT's only in columns ['FromDate', 'ToDate', 'PaidDate']
synthetic_data['pharm'] shows NaT's in column ['FillDate'] and NaN's in columns ['MR_Allowed', 'MR_Paid', 'Days_Supplied', 'Qty_Dispensed']

npatki · 2024-03-01T19:53:17Z

Hi @leeyuntien-milli could you copy-paste the visualization of the metadata when you do metadata.visualize(). Similar to what we have in the demo notebook, this command should render an actual image. Visuals are more helpful for us to understand your metadata.

Or if it's easier, please share your metadata JSON (accessible by print(metadata) or metadata.save_to_json()). Thanks.

Example:

leeyuntien-milli · 2024-03-01T22:25:27Z

metadata.pdf

leeyuntien-milli · 2024-03-01T22:26:28Z

print(metadata)
{
"tables": {
"mem": {
"primary_key": "Member_ID",
"columns": {
"Member_ID": {
"sdtype": "id"
},
"DOB": {
"sdtype": "datetime"
},
"Gender": {
"sdtype": "categorical"
},
"Exposure_Months": {
"sdtype": "numerical"
}
}
},
"med": {
"columns": {
"Member_ID": {
"sdtype": "id"
},
"ClaimID": {
"sdtype": "unknown",
"pii": true
},
"FromDate": {
"sdtype": "datetime"
},
"ToDate": {
"sdtype": "datetime"
},
"PaidDate": {
"sdtype": "datetime"
},
"ICDDiag01": {
"sdtype": "categorical"
},
"ICDDiag02": {
"sdtype": "categorical"
},
"ICDDiag03": {
"sdtype": "categorical"
},
"ICDDiag04": {
"sdtype": "categorical"
},
"ICDDiag05": {
"sdtype": "categorical"
},
"ICDDiag06": {
"sdtype": "categorical"
},
"ICDDiag07": {
"sdtype": "categorical"
},
"ICDDiag08": {
"sdtype": "categorical"
},
"ICDDiag09": {
"sdtype": "categorical"
},
"ICDDiag10": {
"sdtype": "categorical"
},
"ICDDiag11": {
"sdtype": "categorical"
},
"ICDDiag12": {
"sdtype": "categorical"
},
"ICDDiag13": {
"sdtype": "categorical"
},
"ICDDiag14": {
"sdtype": "categorical"
},
"ICDDiag15": {
"sdtype": "categorical"
},
"ICDDiag16": {
"sdtype": "categorical"
},
"ICDDiag17": {
"sdtype": "categorical"
},
"ICDDiag18": {
"sdtype": "categorical"
},
"ICDDiag19": {
"sdtype": "categorical"
},
"ICDDiag20": {
"sdtype": "categorical"
},
"ICDDiag21": {
"sdtype": "categorical"
},
"ICDDiag22": {
"sdtype": "categorical"
},
"ICDDiag23": {
"sdtype": "categorical"
},
"ICDDiag24": {
"sdtype": "categorical"
},
"ICDDiag25": {
"sdtype": "categorical"
},
"ICDDiag26": {
"sdtype": "categorical"
},
"ICDDiag27": {
"sdtype": "categorical"
},
"ICDDiag28": {
"sdtype": "categorical"
},
"ICDDiag29": {
"sdtype": "categorical"
},
"ICDDiag30": {
"sdtype": "categorical"
},
"ProcCode": {
"sdtype": "categorical"
},
"POS": {
"sdtype": "categorical"
},
"MR_Allowed": {
"sdtype": "numerical"
},
"MR_Paid": {
"sdtype": "numerical"
}
}
},
"pharm": {
"columns": {
"Member_ID": {
"sdtype": "id"
},
"NDC": {
"sdtype": "categorical"
},
"ClaimID": {
"sdtype": "unknown",
"pii": true
},
"FillDate": {
"sdtype": "datetime"
},
"ProviderID": {
"sdtype": "categorical"
},
"MR_Allowed": {
"sdtype": "numerical"
},
"MR_Paid": {
"sdtype": "numerical"
},
"Days_Supplied": {
"sdtype": "numerical"
},
"Qty_Dispensed": {
"sdtype": "numerical"
}
}
}
},
"relationships": [
{
"parent_table_name": "mem",
"child_table_name": "med",
"parent_primary_key": "Member_ID",
"child_foreign_key": "Member_ID"
},
{
"parent_table_name": "mem",
"child_table_name": "pharm",
"parent_primary_key": "Member_ID",
"child_foreign_key": "Member_ID"
}
],
"METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}

npatki · 2024-03-01T23:12:51Z

Hi @leeyuntien-milli, thank you. I realize you had already sent the metadata before so apologies for the confusion.

Unfortunately, I am not able to reproduce this issue. I am providing some next steps to unblock you asap.

Running Diagnostics

The SDV is designed to only generate NaN/NaT values if it recognizes that NaN/NaT are possible in the real data.

I would strongly recommend running diagnostic report to see what's happening. We expect the score to be 100% (for more info the docs). What is the score for you?

from sdv.evaluation.multi_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=data,
    synthetic_data=synthetic_data,
    metadata=metadata)

Generating report ...
(1/3) Evaluating Data Validity: : 100%|██████████| 52/52 [00:00<00:00, 366.64it/s]
(2/3) Evaluating Data Structure: : 100%|██████████| 3/3 [00:00<00:00, 137.16it/s]
(3/3) Evaluating Relationship Validity: : 100%|██████████| 2/2 [00:00<00:00, 36.78it/s]

Overall Score: 100.0%

Properties:
- Data Validity: 100.0%
- Data Structure: 100.0%
- Relationship Validity: 100.0%

If it is 100%, it indicates that the SDV is working as intended. The problem may be in how the data is loaded into Python. Python may be reading in some values as NaN or NaT. Let me know what the score is and we can discuss next steps.

Running Test Data

Using your metadata, I created some random test data. Modeling and sampling using HMA, I did not observe any NaN or NaT values. I have attached it here. Could you try it out?

test_data.zip

leeyuntien-milli · 2024-03-02T01:02:19Z

Generating report ...
(1/3) Evaluating Data Validity: : 100%|████████████████████████████████████████████████| 52/52 [00:10<00:00, 5.10it/s]
(2/3) Evaluating Data Structure: : 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 192.03it/s]
(3/3) Evaluating Relationship Validity: : 100%|██████████████████████████████████████████| 2/2 [00:01<00:00, 1.83it/s]

Overall Score: 97.56%

Properties:

Data Validity: 92.68%
Data Structure: 100.0%
Relationship Validity: 100.0%

leeyuntien-milli · 2024-03-02T01:34:40Z

using the test data there are still NaT's and NaN's so maybe there are some settings that are not set properly here

npatki · 2024-03-04T15:36:12Z

Hi @leeyuntien-milli, thanks for confirming.

Right, if the test data is also producing NaN/NaT values, I wonder if this is related to your Python environment or the way you're loading the data into Python. Could you please share the code you are using to read the data into Python? Along with anything you may be doing to modify that data once it's loaded into Python?

The recommended approach is to use the load_csvs function, as specified in our docs:

from sdv.datasets.local import load_csvs
from sdv.multi_table import HMASynthesizer

# assume you have unzipped tests_data.zip 
data = load_csvs(folder_name='test_data/')

# should you need to inspect it, the data is available under each file name
med_table = data['med']
pharm_table = data['pharm']
mem_table = data['mem']

# NO further modification of the data is necessary
# you can directly use it with SDV
synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data)

leeyuntien-milli · 2024-03-04T16:55:28Z

Please refer to the code using your suggested function of load_csvs but the results are similar. The three tables in test_data is put under the folder data/.

from sdv.multi_table import HMASynthesizer
from sdv.metadata import MultiTableMetadata
from sdv.evaluation.multi_table import run_diagnostic
from sdv.datasets.local import load_csvs

all_data = load_csvs(folder_name='data/')
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data = all_data)
synthesizer = HMASynthesizer(metadata)

for table_name in all_data.keys():
  synthesizer.set_table_parameters(
  table_name=table_name,
  table_parameters={
    'enforce_min_max_values': True,
    'default_distribution': 'truncnorm'})

synthesizer.fit(all_data)
synthetic_data = synthesizer.sample()

diagnostic_report = run_diagnostic(
    real_data=all_data,
    synthetic_data=synthetic_data,
    metadata=metadata)

Generating report ...
(1/3) Evaluating Data Validity: : 100%|██████████████████████████████████████████████| 52/52 [00:00<00:00, 1683.87it/s]
(2/3) Evaluating Data Structure: : 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
(3/3) Evaluating Relationship Validity: : 100%|█████████████████████████████████████████| 2/2 [00:00<00:00, 128.01it/s]

Overall Score: 94.88%

Properties:

Data Validity: 84.63%
Data Structure: 100.0%
Relationship Validity: 100.0%

npatki · 2024-03-04T17:32:58Z

@leeyuntien-milli so using the same exact dataset and SDV version, your results are different than what we're seeing. Very interesting. This possibly indicates some issue with the version of other libraries or platform.

Could you provide more information about your setup? This includes:

Python version
Version of other software in your Python environment such as numpy, pandas, scipy, etc. (you can use pip freeze > requirements.txt)
Your OS (Linux? Windows?) and any other relevant platform details

leeyuntien-milli · 2024-03-04T18:18:17Z

Python version
3.8.5
Version of other software in your Python environment such as numpy, pandas, scipy, etc. (you can use pip freeze > requirements.txt)
requirements.txt
Your OS (Linux? Windows?) and any other relevant platform details
Windows 10 Enterprise with Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 3.60 GHz with 128 GB (128 GB usable)

npatki · 2024-03-04T21:05:45Z

Hi @leeyuntien-milli, thanks for the info. We realized that there is a key difference between my previous comment and the code you provided: In your code, you are using set_table_parameters command to update the distribution to 'truncnorm'. Is this intentional?

For SDV 1.10.0, you no longer need to update the distribution. It works for me if you remove this and just directly fit the synthesizer.

all_data = load_csvs(folder_name='data/')
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data = all_data)
synthesizer = HMASynthesizer(metadata)

# directly fit the data
# no need to update the synthesizer
synthesizer.fit(all_data)
synthetic_data = synthesizer.sample()

diagnostic_report = run_diagnostic(
    real_data=all_data,
    synthetic_data=synthetic_data,
    metadata=metadata)

Let me know if that works. In the meantime, we will investigate why truncnorm was causing it to create NaN values.

leeyuntien-milli · 2024-03-04T21:18:44Z

Thanks test_data passed so going to see if original datasets work.

Generating report ...
(1/3) Evaluating Data Validity: : 100%|██████████████████████████████████████████████| 52/52 [00:00<00:00, 1662.86it/s]
(2/3) Evaluating Data Structure: : 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
(3/3) Evaluating Relationship Validity: : 100%|██████████████████████████████████████████████████| 2/2 [00:00<?, ?it/s]

Overall Score: 100.0%

Properties:

Data Validity: 100.0%
Data Structure: 100.0%
Relationship Validity: 100.0%

leeyuntien-milli · 2024-03-11T22:19:14Z

Fitting through the original datasets shows good results in terms of validity as well.

However we observe some issues which we hope can be resolved with some adjustments in package settings.
In med usually FromDate, ToDate, PayDate and ICD codes would vary for different claims with the same member or Member_ID, but seems not so in synthesized data.

In pharm usually FillDate and NDC codes would vary for different claims with the same member or Member_ID, but seems not so in synthesized data.

npatki · 2024-03-12T10:01:01Z

Hi @leeyuntien-milli, thanks for the detailed response. In the interest of keeping our space clean, we usually we keep 1 GitHub issue open per technical problem. Since we were able to resolve the problem of NaNs (and this issue is getting pretty long), let me close this one. Let's use #1848 for this.

npatki added question General question about the software new Automatic label applied to new issues labels Jan 24, 2024

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 24, 2024

npatki closed this as completed Feb 20, 2024

npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Feb 20, 2024

npatki reopened this Mar 1, 2024

npatki added data:multi-table Related to multi-table, relational datasets under discussion Issue is currently being discussed and removed resolution:resolved The issue was fixed, the question was answered, etc. labels Mar 1, 2024

npatki mentioned this issue Mar 4, 2024

HMASynthesizer diagnostic score is not 1.0 when using 'truncnorm' distribution #1831

Closed

npatki mentioned this issue Mar 12, 2024

Unmet patterns for HealthCare dataset #1848

Closed

npatki closed this as completed Mar 12, 2024

npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Mar 12, 2024

npatki changed the title ~~Improving Multi-Table Synthetic Data (Healthcare dataset)~~ Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created #1755

Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created #1755

npatki commented Jan 24, 2024

npatki commented Jan 24, 2024 •

edited

Loading

leeyuntien commented Jan 24, 2024

npatki commented Jan 24, 2024

leeyuntien commented Jan 24, 2024

leeyuntien commented Jan 29, 2024

leeyuntien commented Jan 29, 2024

npatki commented Jan 29, 2024

leeyuntien commented Jan 29, 2024

npatki commented Jan 29, 2024

leeyuntien-milli commented Jan 30, 2024

npatki commented Jan 30, 2024 •

edited

Loading

leeyuntien-milli commented Jan 31, 2024

npatki commented Jan 31, 2024 •

edited

Loading

npatki commented Feb 20, 2024 •

edited

Loading

leeyuntien-milli commented Feb 28, 2024

npatki commented Mar 1, 2024 •

edited

Loading

leeyuntien-milli commented Mar 1, 2024

leeyuntien-milli commented Mar 1, 2024

npatki commented Mar 1, 2024 •

edited

Loading

leeyuntien-milli commented Mar 1, 2024

leeyuntien-milli commented Mar 1, 2024

npatki commented Mar 1, 2024

leeyuntien-milli commented Mar 2, 2024

leeyuntien-milli commented Mar 2, 2024

npatki commented Mar 4, 2024 •

edited

Loading

leeyuntien-milli commented Mar 4, 2024 •

edited

Loading

npatki commented Mar 4, 2024

leeyuntien-milli commented Mar 4, 2024 •

edited

Loading

npatki commented Mar 4, 2024 •

edited

Loading

leeyuntien-milli commented Mar 4, 2024

leeyuntien-milli commented Mar 11, 2024

npatki commented Mar 12, 2024

Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created #1755

Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created #1755

Comments

npatki commented Jan 24, 2024

Environment details

Problem description

npatki commented Jan 24, 2024 • edited Loading

Additional Info

leeyuntien commented Jan 24, 2024

npatki commented Jan 24, 2024

Missing Values

Out-of-Range Values

leeyuntien commented Jan 24, 2024

leeyuntien commented Jan 29, 2024

leeyuntien commented Jan 29, 2024

npatki commented Jan 29, 2024

Error Message

HSA

leeyuntien commented Jan 29, 2024

npatki commented Jan 29, 2024

leeyuntien-milli commented Jan 30, 2024

npatki commented Jan 30, 2024 • edited Loading

leeyuntien-milli commented Jan 31, 2024

npatki commented Jan 31, 2024 • edited Loading

npatki commented Feb 20, 2024 • edited Loading

leeyuntien-milli commented Feb 28, 2024

npatki commented Mar 1, 2024 • edited Loading

leeyuntien-milli commented Mar 1, 2024

leeyuntien-milli commented Mar 1, 2024

npatki commented Mar 1, 2024 • edited Loading

leeyuntien-milli commented Mar 1, 2024

leeyuntien-milli commented Mar 1, 2024

npatki commented Mar 1, 2024

Running Diagnostics

Running Test Data

leeyuntien-milli commented Mar 2, 2024

leeyuntien-milli commented Mar 2, 2024

npatki commented Mar 4, 2024 • edited Loading

leeyuntien-milli commented Mar 4, 2024 • edited Loading

npatki commented Mar 4, 2024

leeyuntien-milli commented Mar 4, 2024 • edited Loading

npatki commented Mar 4, 2024 • edited Loading

leeyuntien-milli commented Mar 4, 2024

leeyuntien-milli commented Mar 11, 2024

npatki commented Mar 12, 2024

npatki commented Jan 24, 2024 •

edited

Loading

npatki commented Jan 30, 2024 •

edited

Loading

npatki commented Jan 31, 2024 •

edited

Loading

npatki commented Feb 20, 2024 •

edited

Loading

npatki commented Mar 1, 2024 •

edited

Loading

npatki commented Mar 1, 2024 •

edited

Loading

npatki commented Mar 4, 2024 •

edited

Loading

leeyuntien-milli commented Mar 4, 2024 •

edited

Loading

leeyuntien-milli commented Mar 4, 2024 •

edited

Loading

npatki commented Mar 4, 2024 •

edited

Loading