-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created #1755
Comments
Hello, I just wanted to confirm my understanding of the problem:
Could you confirm if this is accurate? Additional InfoIt would also be useful if you could provide a bit more information about how the three tables are connected/what they represent.
|
Yes, for points 1 and 2 mentioned above they are accurate. Our initial questions would be why there are out-of-range date values and N/A's given no N/A's for columns like NDC, FillDate or MR_Allowed etc in the original datasets. For questions on MedInput_COM_2019, yes there are up to 25 diagnoses possible per person, and if there are <25 diagnoses the remaining ones are left blank. For questions on restrictions for the number of connections between the tables, there are no restrictions ie there could be members without any Med or Pharm, also there could be other members with more than one Med or Pharm or both. |
Thanks for the information. Very helpful. We can focus on this:
Missing ValuesYou are saying that the real data does not have any missing values (all values are filled in), but the synthetic data does have missing values. In this case, I believe the root cause is issue #1691 -- there is currently a bug in the HMASynthesizer that we hope to fix soon. I have included 2 possible workarounds in that issue. Out-of-Range ValuesBy default, the HMASynthesizer should note down the min/max value of each column in the original data. It should ensure that the synthetic data does not go out-of-bounds. Is this not the case for your data? Would you be able to provide more details as to which particular column(s) this is happening for? Better yet -- I would recommend running the Diagnostic Report on the real vs. synthetic data. This report is designed to capture and provide more insights into the exact problems you're mentioning (inventing new values like NaN, and going out-of-bounds). If the score is not 1.0 here, it means there is a bug. You can share with us any detailed breakdowns where you are noticing that the score is <1.0. |
Sure will see if a diagnosis report can be generated. |
Just updated to sdv 1.9.0 and the learning process of HMASynthesizer.fit finished with the same set of tables ie parent MemInput_COM_2019 table linked to two children tables PharmInput_COM_2019 and MedInput_COM_2019 by Member_ID. However the following error messages were generated when calling HMASynthesizer.sample. Do you know why? C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Also in sdv 1.9.0 there is no HSASynthesizer? Traceback (most recent call last): |
Hi @leeyuntien, thanks for getting back. Were you able to resolve the original problem at the beginning of this issue? Or are you retrying everything with the newest SDV version now? Error Message
This is strange indeed because the actual line of code that is causing the issue is not supposed to crash. We are actually excepting the SDV/sdv/data_processing/data_processor.py Lines 905 to 908 in 334ba02
The fact that yours crashes anyways (with a In the past, I've noticed that there are sometimes caching issues if you are using a notebook type environment. To sanity check, could you run the following and verify that it prints import sdv
print(sdv.__version__) HSA
The HSASynthesizer is available in the SDV Enterprise SDK, not the public SDV. To get access to the SDV Enterprise SDK, you'd need to purchase a license with us. More resources:
|
Hi @leeyuntien thanks for confirming. We were able to dig in a little further and looks like it is actually happening due to the same as issue #1691 (linked above). Have you tried the workarounds listed in that issue (using Something else that might help as a workaround: If any columns are stored as integers in memory (in Python), I would casting them to float for the sake of running them through SDV. To see which column(s) are represented as ints, you run the following for each of the table names: print(data[TABLE_NAME].dtypes) Then you can convert any column that are listed as data[TABLE_NAME][COLUMN_NAME] = data[TABLE_NAME][COLUMN_NAME].astype('float') The good news is that we are actively working on the underlying issue and hope to have a fix up in the near future. Thanks for bearing with us. |
Just tried the workarounds listed in the issue but still got this message. Will change int to float to test. Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Sounds good. The change to |
Great to hear that it's no longer crashing! This was the immediate goal so at least you have some synthetic data to work with for v1.9.0. The NaN values are expected right now due to issue #1691. Since the suggested workaround* is not guaranteed, you would have to wait until we resolve this issue. Rest assured that we are actively looking into the root cause and hope to have a resolution in a future release. *Suggested workaround is to use |
Hi @leeyuntien -- good news! We have released an updated version of SDV (v1.10.0) that should resolve this issue. You should no longer have to apply any workarounds. The HMASynthesizer should now be able to run by default without running into any Errors and without creating any unnecessary NaN/NaT values. Please upgrade to the latest version and give it a try. If you continue to run into this problem, feel free to reply and we can always re-open the issue to continue the investigation. (For any other problems unrelated to NaNs, please feel free to file a new issue.) Thanks. |
Hi @leeyuntien-milli, sorry to hear that. I'm reopening the issue for discussion. Just to confirm, upgrading to SDV 1.10.0 means that you'd have to create and train a new synthesizer on 1.10.0 (it is not sufficient to load in an pre-existing synthesizer on 1.10.0). Confirming that that is what you've done? Since our bug fix went out to 1.10.0, I'm wondering if something else is going on now. (I can confirm that our HSA algorithm works ok, but it seems maybe something is still wrong with the public HMA.) I am wondering if you could provide more information?
That will help us narrow down what's going wrong. |
print(metadata.visualize()) digraph Metadata { |
there is no NA's in synthetic_data['mem'] |
Hi @leeyuntien-milli could you copy-paste the visualization of the metadata when you do Or if it's easier, please share your metadata JSON (accessible by |
|
Hi @leeyuntien-milli, thank you. I realize you had already sent the metadata before so apologies for the confusion. Unfortunately, I am not able to reproduce this issue. I am providing some next steps to unblock you asap. Running DiagnosticsThe SDV is designed to only generate NaN/NaT values if it recognizes that NaN/NaT are possible in the real data. I would strongly recommend running diagnostic report to see what's happening. We expect the score to be 100% (for more info the docs). What is the score for you? from sdv.evaluation.multi_table import run_diagnostic
diagnostic_report = run_diagnostic(
real_data=data,
synthetic_data=synthetic_data,
metadata=metadata)
If it is 100%, it indicates that the SDV is working as intended. The problem may be in how the data is loaded into Python. Python may be reading in some values as NaN or NaT. Let me know what the score is and we can discuss next steps. Running Test DataUsing your metadata, I created some random test data. Modeling and sampling using HMA, I did not observe any NaN or NaT values. I have attached it here. Could you try it out? |
Generating report ... Overall Score: 97.56% Properties:
|
Hi @leeyuntien-milli, thanks for confirming. Right, if the test data is also producing NaN/NaT values, I wonder if this is related to your Python environment or the way you're loading the data into Python. Could you please share the code you are using to read the data into Python? Along with anything you may be doing to modify that data once it's loaded into Python? The recommended approach is to use the from sdv.datasets.local import load_csvs
from sdv.multi_table import HMASynthesizer
# assume you have unzipped tests_data.zip
data = load_csvs(folder_name='test_data/')
# should you need to inspect it, the data is available under each file name
med_table = data['med']
pharm_table = data['pharm']
mem_table = data['mem']
# NO further modification of the data is necessary
# you can directly use it with SDV
synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data) |
Please refer to the code using your suggested function of load_csvs but the results are similar. The three tables in test_data is put under the folder data/.
Generating report ... Overall Score: 94.88% Properties:
|
@leeyuntien-milli so using the same exact dataset and SDV version, your results are different than what we're seeing. Very interesting. This possibly indicates some issue with the version of other libraries or platform. Could you provide more information about your setup? This includes:
|
|
Hi @leeyuntien-milli, thanks for the info. We realized that there is a key difference between my previous comment and the code you provided: In your code, you are using For SDV 1.10.0, you no longer need to update the distribution. It works for me if you remove this and just directly fit the synthesizer. all_data = load_csvs(folder_name='data/')
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data = all_data)
synthesizer = HMASynthesizer(metadata)
# directly fit the data
# no need to update the synthesizer
synthesizer.fit(all_data)
synthetic_data = synthesizer.sample()
diagnostic_report = run_diagnostic(
real_data=all_data,
synthetic_data=synthetic_data,
metadata=metadata) Let me know if that works. In the meantime, we will investigate why |
Thanks test_data passed so going to see if original datasets work. Generating report ... Overall Score: 100.0% Properties:
|
Hi @leeyuntien-milli, thanks for the detailed response. In the interest of keeping our space clean, we usually we keep 1 GitHub issue open per technical problem. Since we were able to resolve the problem of NaNs (and this issue is getting pretty long), let me close this one. Let's use #1848 for this. |
I'm filing this issue on behalf of a user.
Environment details
Problem description
We tried to do an HMA synthesizer on three tables
The tables are linked by one key Member_ID. However, when we generated synthesized data with 1% portion, relationships between dates and NDC and ICD codes do not seem to show up properly, from the screenshots for synthesized datasets. Can you advise how we might be able to improve it? Thanks.
The text was updated successfully, but these errors were encountered: