Releases: sdv-dev/SDV
v0.13.1 - 2021-12-22
This release adds support for passing tabular constraints to the HMA1 model, and adds more explicit error handling for
metric evaluation. It also includes a fix for using categorical columns in the PAR model and documentation updates
for metadata and HMA1.
Bugs Fixed
New Features
- Support passing tabular constraints to the HMA1 model - Issue #296 by @katxiao
- Metric evaluation error handling metrics - Issue #638 by @katxiao
Documentation Changes
v0.13.0 - 2021-11-22
This release makes multiple improvements to different Constraint
classes. The Unique
constraint can now
handle columns with the name index
and no longer crashes on subsets of the original data. The Between
constraint can now handle columns with nulls properly. The memory of all constraints was also improved.
Various other features and fixes were added. Conditional sampling no longer crashes when the num_rows
argument
is not provided. Multiple localizations can now be used for PII fields. Scaffolding for integration tests was added
and the workflows now run pip check
.
Additionally, this release adds support for Python 3.9!
Bugs Fixed
- Gaussian Copula – Memory Issue in Release 0.10.0 - Issue #459 by @xamm
- Applying Unique Constraint errors when calling model.fit() on a subset of data - Issue #610 by @xamm
- Calling sampling with conditions and without num_rows crashes - Issue #614 by @xamm
- Metadata.visualize with path parameter throws AttributeError - Issue #634 by @xamm
- The Unique constraint crashes when the data contains a column called index - Issue #616 by @xamm
- The Unique constraint cannot handle non-default index - Issue #617 by @xamm
- ConstraintsNotMetError when applying Between constraint on datetime columns containing null values - Issue #632 by @katxiao
New Features
Housekeeping Tasks
- Support latest version of Faker - Issue #621 by @katxiao
- Add scaffolding for Metadata integration tests - Issue #624 by @katxiao
- Add support for Python 3.9 - Issue #631 by @amontanez24
Internal Improvements
- Add pip check to CI workflows - Issue #626 by @pvk-developer
Documentation Changes
Special thanks to @xamm, @katxiao, @pvk-developer and @amontanez24 for all the work that made this release possible!
v0.12.1 - 2021-10-12
This release fixes bugs in constraints, metadata behavior, and SDV documentation. Specifically, we added
proper handling of data containing null values for constraints and timeseries data, and updated the
default metadata detection behavior.
Bugs Fixed
- ValueError: The parameter loc has invalid values - Issue #353 by @fealho
- Gaussian Copula is generating different data with metadata and without metadata - Issue #576 by @katxiao
- Make pomegranate an optional dependency - Issue #567 by @katxiao
- Small wording change for Question Issue Template - Issue #571 by @katxiao
- ConstraintsNotMetError when using GreaterThan constraint with datetime - Issue #590 by @katxiao
- GreaterThan constraint crashing with NaN values - Issue #592 by @katxiao
- Null values in GreaterThan constraint raises error - Issue #589 by @katxiao
- ColumnFormula raises ConstraintsNotMetError when checking NaN values - Issue #593 by @katxiao
- GreaterThan constraint raises TypeError when using datetime - Issue #596 by @katxiao
- Fix repository language - Issue #464 by @fealho
- Update init.py - Issue #578 by @dyuliu
- IndexingError: Unalignable boolean - Issue #446 by @fealho
v0.12.0 - 2021-08-17
This release focuses on improving and expanding upon the existing constraints. More specifically, the users can now
(1) specify multiple columns in Positive
and Negative
constraints, (2) use the new Unique
constraint and
(3) use datetime data with the Between
constraint. Additionaly, error messages have been added and updated
to provide more useful feedback to the user.
Besides the added features, several bugs regarding the UniqueCombinations
and ColumnFormula
constraints have been fixed,
and an error in the metadata.json for the student_placements
dataset was corrected. The release also added documentation
for the fit_columns_model
which affects the majority of the available constraints.
New Features
- Change default fit_columns_model to False - Issue #550 by @katxiao
- Support multi-column specification for positive and negative constraint - Issue #545 by @sarahmish
- Raise error when multiple constraints can't be enforced - Issue #541 by @amontanez24
- Create Unique Constraint - Issue #532 by @amontanez24
- Passing invalid conditions when using constraints produces unreadable errors - Issue #511 by @katxiao
- Improve error message for ColumnFormula constraint when constraint column used in formula - Issue #508 by @katxiao
- Add datetime functionality to Between constraint - Issue #504 by @katxiao
Bugs Fixed
- UniqueCombinations constraint with handling_strategy = 'transform' yields synthetic data with nan values - Issue #521 by @katxiao and @csala
- UniqueCombinations constraint outputting wrong data type - Issue #510 by @katxiao and @csala
- UniqueCombinations constraint on only one column gets stuck in an infinite loop - Issue #509 by @katxiao
- Conditioning on a non-constraint column using the ColumnFormula constraint - Issue #507 by @katxiao
- Conditioning on the constraint column of the ColumnFormula constraint - Issue #506 by @katxiao
- Update metadata.json for duration of student_placements dataset - Issue #503 by @amontanez24
- Unit test for HMA1 when working with a single child row per parent row - Issue #497 by @pvk-developer
- UniqueCombinations constraint for more than 2 columns - Issue #494 by @katxiao and @csala
Documentation Changes
v0.11.0 - 2021-07-12
This release primarily addresses bugs and feature requests related to using constraints for the single-table models. Users can now enforce scalar comparison with the existing GreaterThan
constraint and apply 5 new constraints: OneHotEncoding
, Positive
, Negative
, Between
and Rounding
. Additionally, the SDV will now auto-apply constraints for rounding numerical values, and for keeping the data within the observed bounds. All related user guides are updated with the new functionality.
New Features
- Add OneHotEncoding Constraint - Issue #303 by @fealho
- GreaterThan Constraint should apply to scalars - Issue #410 by @amontanez24
- Improve GreaterThan constraint - Issue #368 by @amontanez24
- Add Non-negative and Positive constraints across multiple columns- Issue #409 by @amontanez24
- Add Between values constraint - Issue #367 by @fealho
- Ensure values fall within the specified range - Issue #423 by @amontanez24
- Add Rounding constraint - Issue #482 by @katxiao
- Add rounding and min/max arguments that are passed down to the NumericalTransformer - Issue #491 by @amontanez24
Bugs Fixed
- GreaterThan constraint between Date columns rasises TypeError - Issue #421 by @amontanez24
- GreaterThan constraint's transform strategy fails on columns that are not float - Issue #448 by @amontanez24
- AttributeError on UniqueCombinations constraint with non-strings - Issue #196 by @katxiao
- Use reject sampling to sample missing columns for constraints - Issue #435 by @amontanez24
Documentation Changes
v0.10.1 - 2021-06-10
This release changes the way we sample conditions to not only group by the conditions passed by the user, but also by the transformed conditions that result from them.
Issues resolved
- Conditionally sampling on variable in constraint should have variety for other variables - Issue #440 by @amontanez24
v0.10.0 - 2021-05-21
This release improves the constraint functionality by allowing constraints and conditions
at the same time. Additional changes were made to update tutorials.
Issues resolved
- Not able to use constraints and conditions in the same time - Issue #379
by @amontanez24 - Update benchmarking user guide for reading private datasets - Issue #427
by @katxiao
v0.9.1 - 2021-04-29
This release broadens the constraint functionality by allowing for the ColumnFormula
constraint to take lambda functions and returned functions as an input for its formula.
It also improves conditional sampling by ensuring that any id
fields generated by the
model remain unique throughout the sampled data.
The CTGAN
model was improved by adjusting a default parameter to be more mathematically
correct.
Additional changes were made to improve tutorials as well as fix fragile tests.
Issues resolved
- Tutorials test sometimes fails - Issue #355
by @fealho - Duplicate IDs when using reject-sampling - Issue #331
by @amontanez24 and @csala - discriminator_decay should be initialized at 1e-6 but it's 0 - Issue #401 by @fealho and @YoucefZemmouri
- Tutorial typo - Issue #380 by @fealho
- Request for sdv.constraint.ColumnFormula for a wider range of function - Issue #373 by @amontanez24 and @JetfiRex
v0.9.0 - 2021-03-31
This release brings new privacy metrics to the evaluate framework which help to determine if the real data could be obtained or deduced from the synthetic samples. Additionally, now there is a normalized score for the metrics, which stays between 0
and 1
.
There are improvements that reduce the usage of memory ram when sampling new data. Also there is a new parameter to control the reject sampling crash, graceful_reject_sampling
, which if set to True
and if it's not possible to generate all the requested rows, it will just issue a warning and return whatever it was able to generate.
The Metadata
object can now be visualized using different combinations of names
and details
, which can be set to True
or False
in order to display only the table names with details or without. There is also an improvement on the validation
, which now will display all the errors found at the end of the validation instead of only the first one.
This version also exposes all the hyperparameters of the models CTGAN
and TVAE
to allow a more advanced usage. There is also a fix for the TVAE
model on small datasets and it's performance with NaN
values has been improved. There is a fix for when using UniqueCombinationConstraint
with the transform
strategy.
Issues resolved
- Memory Usage Gaussian Copula Trained Model consuming high memory when generating synthetic data - Issue #304 by @pvk-developer
- Add option to visualize metadata with only table names - Issue #347 by @csala
- Add sample parameter to control reject sampling crash - Issue #343 by @fealho
- Verbose metadata validation - Issue #348 by @csala
- Missing the introduction of custom specification for hyperparameters in the TVAE model - Issue #344 by @pvk-developer
v0.8.0 - 2021-02-24
This version adds conditional sampling for tabular models by combining a reject-sampling
strategy with the native conditional sampling capabilities from the Gaussian Copulas.
It also introduces several upgrades on the HMA1 algorithm that improve data quality and
robustness in the multi-table scenarios by making changes in how the parameters of the child
tables are aggregated on the parent tables, including a complete rework of how the correlation
matrices are modeled and rebuild after sampling.