Skip to content

Releases: sdv-dev/SDV

v0.13.1 - 2021-12-22

22 Dec 20:35
Compare
Choose a tag to compare

This release adds support for passing tabular constraints to the HMA1 model, and adds more explicit error handling for
metric evaluation. It also includes a fix for using categorical columns in the PAR model and documentation updates
for metadata and HMA1.

Bugs Fixed

  • Categorical column after sequence_index column - Issue #314 by @fealho

New Features

  • Support passing tabular constraints to the HMA1 model - Issue #296 by @katxiao
  • Metric evaluation error handling metrics - Issue #638 by @katxiao

Documentation Changes

  • Make true/false values lowercase in Metadata Schema specification - Issue #664 by @katxiao
  • Update docstrings for hma1 methods - Issue #642 by @katxiao

v0.13.0 - 2021-11-22

22 Nov 21:06
Compare
Choose a tag to compare

This release makes multiple improvements to different Constraint classes. The Unique constraint can now
handle columns with the name index and no longer crashes on subsets of the original data. The Between
constraint can now handle columns with nulls properly. The memory of all constraints was also improved.

Various other features and fixes were added. Conditional sampling no longer crashes when the num_rows argument
is not provided. Multiple localizations can now be used for PII fields. Scaffolding for integration tests was added
and the workflows now run pip check.

Additionally, this release adds support for Python 3.9!

Bugs Fixed

  • Gaussian Copula – Memory Issue in Release 0.10.0 - Issue #459 by @xamm
  • Applying Unique Constraint errors when calling model.fit() on a subset of data - Issue #610 by @xamm
  • Calling sampling with conditions and without num_rows crashes - Issue #614 by @xamm
  • Metadata.visualize with path parameter throws AttributeError - Issue #634 by @xamm
  • The Unique constraint crashes when the data contains a column called index - Issue #616 by @xamm
  • The Unique constraint cannot handle non-default index - Issue #617 by @xamm
  • ConstraintsNotMetError when applying Between constraint on datetime columns containing null values - Issue #632 by @katxiao

New Features

  • Adds Multi localisations feature for PII fields defined in #308 - PR #609 by @xamm

Housekeeping Tasks

Internal Improvements

Documentation Changes

  • Anonymizing PII in single table tutorials states address field as e-mail type - Issue #604 by @xamm

Special thanks to @xamm, @katxiao, @pvk-developer and @amontanez24 for all the work that made this release possible!

v0.12.1 - 2021-10-12

12 Oct 19:44
Compare
Choose a tag to compare

This release fixes bugs in constraints, metadata behavior, and SDV documentation. Specifically, we added
proper handling of data containing null values for constraints and timeseries data, and updated the
default metadata detection behavior.

Bugs Fixed

  • ValueError: The parameter loc has invalid values - Issue #353 by @fealho
  • Gaussian Copula is generating different data with metadata and without metadata - Issue #576 by @katxiao
  • Make pomegranate an optional dependency - Issue #567 by @katxiao
  • Small wording change for Question Issue Template - Issue #571 by @katxiao
  • ConstraintsNotMetError when using GreaterThan constraint with datetime - Issue #590 by @katxiao
  • GreaterThan constraint crashing with NaN values - Issue #592 by @katxiao
  • Null values in GreaterThan constraint raises error - Issue #589 by @katxiao
  • ColumnFormula raises ConstraintsNotMetError when checking NaN values - Issue #593 by @katxiao
  • GreaterThan constraint raises TypeError when using datetime - Issue #596 by @katxiao
  • Fix repository language - Issue #464 by @fealho
  • Update init.py - Issue #578 by @dyuliu
  • IndexingError: Unalignable boolean - Issue #446 by @fealho

v0.12.0 - 2021-08-17

19 Aug 05:29
Compare
Choose a tag to compare

This release focuses on improving and expanding upon the existing constraints. More specifically, the users can now
(1) specify multiple columns in Positive and Negative constraints, (2) use the new Uniqueconstraint and
(3) use datetime data with the Between constraint. Additionaly, error messages have been added and updated
to provide more useful feedback to the user.

Besides the added features, several bugs regarding the UniqueCombinations and ColumnFormula constraints have been fixed,
and an error in the metadata.json for the student_placements dataset was corrected. The release also added documentation
for the fit_columns_model which affects the majority of the available constraints.

New Features

  • Change default fit_columns_model to False - Issue #550 by @katxiao
  • Support multi-column specification for positive and negative constraint - Issue #545 by @sarahmish
  • Raise error when multiple constraints can't be enforced - Issue #541 by @amontanez24
  • Create Unique Constraint - Issue #532 by @amontanez24
  • Passing invalid conditions when using constraints produces unreadable errors - Issue #511 by @katxiao
  • Improve error message for ColumnFormula constraint when constraint column used in formula - Issue #508 by @katxiao
  • Add datetime functionality to Between constraint - Issue #504 by @katxiao

Bugs Fixed

  • UniqueCombinations constraint with handling_strategy = 'transform' yields synthetic data with nan values - Issue #521 by @katxiao and @csala
  • UniqueCombinations constraint outputting wrong data type - Issue #510 by @katxiao and @csala
  • UniqueCombinations constraint on only one column gets stuck in an infinite loop - Issue #509 by @katxiao
  • Conditioning on a non-constraint column using the ColumnFormula constraint - Issue #507 by @katxiao
  • Conditioning on the constraint column of the ColumnFormula constraint - Issue #506 by @katxiao
  • Update metadata.json for duration of student_placements dataset - Issue #503 by @amontanez24
  • Unit test for HMA1 when working with a single child row per parent row - Issue #497 by @pvk-developer
  • UniqueCombinations constraint for more than 2 columns - Issue #494 by @katxiao and @csala

Documentation Changes

  • Add explanation of fit_columns_model to API docs - Issue #517 by @katxiao

v0.11.0 - 2021-07-12

12 Jul 22:44
Compare
Choose a tag to compare

This release primarily addresses bugs and feature requests related to using constraints for the single-table models. Users can now enforce scalar comparison with the existing GreaterThan constraint and apply 5 new constraints: OneHotEncoding, Positive, Negative, Between and Rounding. Additionally, the SDV will now auto-apply constraints for rounding numerical values, and for keeping the data within the observed bounds. All related user guides are updated with the new functionality.

New Features

  • Add OneHotEncoding Constraint - Issue #303 by @fealho
  • GreaterThan Constraint should apply to scalars - Issue #410 by @amontanez24
  • Improve GreaterThan constraint - Issue #368 by @amontanez24
  • Add Non-negative and Positive constraints across multiple columns- Issue #409 by @amontanez24
  • Add Between values constraint - Issue #367 by @fealho
  • Ensure values fall within the specified range - Issue #423 by @amontanez24
  • Add Rounding constraint - Issue #482 by @katxiao
  • Add rounding and min/max arguments that are passed down to the NumericalTransformer - Issue #491 by @amontanez24

Bugs Fixed

  • GreaterThan constraint between Date columns rasises TypeError - Issue #421 by @amontanez24
  • GreaterThan constraint's transform strategy fails on columns that are not float - Issue #448 by @amontanez24
  • AttributeError on UniqueCombinations constraint with non-strings - Issue #196 by @katxiao
  • Use reject sampling to sample missing columns for constraints - Issue #435 by @amontanez24

Documentation Changes

  • Ensure privacy metrics are available in the API docs - Issue #458 by @fealho
  • Ensure formula constraint is called ColumnFormula everywhere in the docs - Issue #449 by @fealho

v0.10.1 - 2021-06-10

11 Jun 01:48
Compare
Choose a tag to compare

This release changes the way we sample conditions to not only group by the conditions passed by the user, but also by the transformed conditions that result from them.

Issues resolved

  • Conditionally sampling on variable in constraint should have variety for other variables - Issue #440 by @amontanez24

v0.10.0 - 2021-05-21

21 May 21:23
Compare
Choose a tag to compare

This release improves the constraint functionality by allowing constraints and conditions
at the same time. Additional changes were made to update tutorials.

Issues resolved

  • Not able to use constraints and conditions in the same time - Issue #379
    by @amontanez24
  • Update benchmarking user guide for reading private datasets - Issue #427
    by @katxiao

v0.9.1 - 2021-04-29

29 Apr 21:10
Compare
Choose a tag to compare

This release broadens the constraint functionality by allowing for the ColumnFormula
constraint to take lambda functions and returned functions as an input for its formula.

It also improves conditional sampling by ensuring that any id fields generated by the
model remain unique throughout the sampled data.

The CTGAN model was improved by adjusting a default parameter to be more mathematically
correct.

Additional changes were made to improve tutorials as well as fix fragile tests.

Issues resolved

v0.9.0 - 2021-03-31

01 Apr 01:14
Compare
Choose a tag to compare

This release brings new privacy metrics to the evaluate framework which help to determine if the real data could be obtained or deduced from the synthetic samples. Additionally, now there is a normalized score for the metrics, which stays between 0 and 1.

There are improvements that reduce the usage of memory ram when sampling new data. Also there is a new parameter to control the reject sampling crash, graceful_reject_sampling, which if set to True and if it's not possible to generate all the requested rows, it will just issue a warning and return whatever it was able to generate.

The Metadata object can now be visualized using different combinations of names and details, which can be set to True or False in order to display only the table names with details or without. There is also an improvement on the validation, which now will display all the errors found at the end of the validation instead of only the first one.

This version also exposes all the hyperparameters of the models CTGAN and TVAE to allow a more advanced usage. There is also a fix for the TVAE model on small datasets and it's performance with NaN values has been improved. There is a fix for when using UniqueCombinationConstraint with the transform strategy.

Issues resolved

  • Memory Usage Gaussian Copula Trained Model consuming high memory when generating synthetic data - Issue #304 by @pvk-developer
  • Add option to visualize metadata with only table names - Issue #347 by @csala
  • Add sample parameter to control reject sampling crash - Issue #343 by @fealho
  • Verbose metadata validation - Issue #348 by @csala
  • Missing the introduction of custom specification for hyperparameters in the TVAE model - Issue #344 by @pvk-developer

v0.8.0 - 2021-02-24

24 Feb 22:16
Compare
Choose a tag to compare

This version adds conditional sampling for tabular models by combining a reject-sampling
strategy with the native conditional sampling capabilities from the Gaussian Copulas.

It also introduces several upgrades on the HMA1 algorithm that improve data quality and
robustness in the multi-table scenarios by making changes in how the parameters of the child
tables are aggregated on the parent tables, including a complete rework of how the correlation
matrices are modeled and rebuild after sampling.

Issues resolved

  • Fix probabilities contain NaN error - Issue #326 by @csala
  • Conditional Sampling for tabular models - Issue #316 by @fealho and @csala
  • HMA1: LinAlgError: SVD did not converge - Issue #240 by @csala