Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SDG Readme #139

Merged
merged 13 commits into from
Feb 21, 2024
60 changes: 33 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,26 +26,40 @@
</p>
</div>

Synthetic Data Generator (SDG) is a framework focused on quickly generating high-quality structured tabular data. It supports many single-table and multi-table data synthesis algorithms, achieving up to 120 times performance improvement, and supports differential privacy and other methods to enhance the security of synthesized data.
The Synthetic Data Generator (SDG) is a specialized framework designed to rapidly generate high-quality structured tabular data. It incorporates a wide range of single-table and multi-table data synthesis algorithms, LLM-based synthetic data generation model is also integrated.

Synthetic data is generated by machines based on real data and algorithms, it does not contain sensitive information, but can retain the characteristics of real data.
There is no correspondence between synthetic data and real data, and it is not subject to privacy regulations such as GDPR and ADPPA.
In practical applications, there is no need to worry about the risk of privacy leakage.
High-quality synthetic data can also be used in various fields such as data opening, model training and debugging, system development and testing, etc.
Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

## 🎉 Features
High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc. Read [the latest API docs](https://synthetic-data-generator.readthedocs.io/en/latest/) for more details!

- high performance
- Supports a wide range of statistical data synthesis algorithms to achieve up to 120x performance improvement, without the need for GPU devices;
## 🔧 Features

- Technological advancements
- Supports a wide range of statistical data synthesis algorithms, LLM-based synthetic data generation model is also integrated;
- Optimised for big data scenarios, effectively reducing memory consumption;
- Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner.
- Provide distributed training support for deep learning models with frameworks such as torch.
- Privacy enhancements:
- Privacy enhancements
- SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data.
- Easy to extend
- Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages

Read [the latest API docs](https://synthetic-data-generator.readthedocs.io/en/latest/) for more details.
### 🎉 LLM-integrated synthetic data generation

For a long time, LLM has been used to understand and generate various types of data. In fact, LLM also has certain capabilities in tabular data generation. Also, it has some abilities that cannot be achieved by traditional (based on GAN methods or statistical methods) .

Our `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements two new features:

#### Synthetic data generation without Data

No training data is required, synthetic data can be generated based on metadata data.

![Synthetic data generation without Data](assets/LLM_Case_1.gif)

#### Off-Table feature inference

Infer new column data based on the existing data in the table and the knowledge mastered by LLM.

![Off-Table feature inference](assets/LLM_Case_2.gif)

## 🔛 Quick Start

Expand All @@ -57,9 +71,15 @@ You can use pre-built images to quickly experience the latest features.
docker pull idsteam/sdgx:latest
```

### Install from PyPi

```bash
pip install sdgx
```

### Local Install (Recommended)

At present, the code of this project is updated very quickly. We recommend that you use SDG by installing it through the source code.
Use SDG by installing it through the source code.

```bash
git clone git@github.com:hitsz-ids/synthetic-data-generator.git
Expand All @@ -68,12 +88,6 @@ pip install .
pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git
```

### Install from PyPi

```bash
pip install sdgx
```

### Quick Demo of Single Table Data Generation and Metric

#### Demo code
Expand Down Expand Up @@ -156,21 +170,13 @@ The SDG project was initiated by **Institute of Data Security, Harbin Institute

## 👩‍🎓 Related Work

### Research Paper

- CTGAN:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)
- C3-TGAN: [C3-TGAN- Controllable Tabular Data Synthesis with Explicit Correlations and Property Constraints](https://www.researchgate.net/publication/374652636_C3-TGAN-_Controllable_Tabular_Data_Synthesis_with_Explicit_Correlations_and_Property_Constraints)
- TVAE:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)
- table-GAN:[Data Synthesis based on Generative Adversarial Networks](https://arxiv.org/pdf/1806.03384.pdf)
- CTAB-GAN:[CTAB-GAN: Effective Table Data Synthesizing](https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf)
- OCT-GAN: [OCT-GAN: Neural ODE-based Conditional Tabular GANs](https://arxiv.org/pdf/2105.14969.pdf)

### Dataset

- [Adult](http://archive.ics.uci.edu/ml/datasets/adult)
- [Satellite](http://archive.ics.uci.edu/dataset/146/statlog+landsat+satellite)
- [Rossmann](https://www.kaggle.com/competitions/rossmann-store-sales/data)
- [Telstra](https://www.kaggle.com/competitions/telstra-recruiting-network/data)

## 📄 License

The SDG open source project uses Apache-2.0 license, please refer to the [LICENSE](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE).
61 changes: 36 additions & 25 deletions README_ZH_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,24 +27,41 @@
</p>
</div>

合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量结构化表格数据的组件。支持多种单表、多表数据合成算法,实现最高120倍性能提升,支持差分隐私等方法,加强合成数据安全性
合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量的结构化表格数据的数据组件。SDG支持单表和多表数据合成算法,并集成了基于大语言模型(LLM)的合成数据生成模型

合成数据是由机器根据真实数据与算法生成的,合成数据不含敏感信息,但能保留真实数据中的行为特征。合成数据与真实数据不存在任何对应关系,不受 GDPR 、ADPPA等隐私法规的约束,在实际应用中不需要担心隐私泄漏风险。高质量的合成数据可用于数据安全开放、模型训练调试、系统开发测试等众多领域
合成数据(Synthetic Data)是由计算机使用真实数据、元数据和算法生成的合成数据不包含任何敏感信息,但它保留了原始数据的基本特性。合成数据和真实数据之间没有直接的关联,使其免于GDPR和ADPPA等隐私法规的约束,消除实际应用中的隐私泄露风险

## 🎉 主要特性
高质量的合成数据可以安全、多样化地在各种领域中使用,包括数据共享、模型训练和调试、系统开发和测试等应用。阅读 [最新API文档](https://synthetic-data-generator.readthedocs.io/en/latest/) 获取更多细节。

- 高性能
- 支持多种统计学数据合成算法,实现最高120倍性能提升,不需要GPU设备;
## 🔧 主要特性

- 无限进步:
- 支持多种统计学数据合成算法,支持基于LLM的仿真数据生成方法;
- 为大数据场景优化,有效减少内存消耗;
- 持续跟踪学术界和工业界的最新进展,及时引入支持优秀算法和模型。
- 为深度学习模型提供torch等框架的分布式训练支持
- 隐私增强
- 隐私增强:
- 提供中文敏感数据自动识别能力,包括姓名、身份证号、人名等17种常见敏感字段;
- 支持差分隐私、匿名化等方法,加强合成数据安全性。
- 易扩展
- 支持以插件包的形式拓展模型、数据处理、数据连接器等功能
- 易扩展:
- 支持以插件包的形式拓展模型、数据处理、数据连接器等功能。

### 🎉 借助LLM进行合成数据生成

长期以来,LLM一直被用来理解和生成各种类型的数据。 事实上,LLM在表格数据生成方面也有较强的性能。 且LLM还具有一些传统(基于GAN方法或统计方法)无法实现的能力。

我们的 `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` 实现了两个新功能:

#### 无原始记录的数据合成功能

无需原始训练数据,可以根据元数据生成合成数据。

阅读 [最新的文档](https://synthetic-data-generator.readthedocs.io/en/latest/) 获取更多细节.
![Synthetic data generation without Data](assets/LLM_Case_1.gif)

#### 表外特征推断功能

根据表中已有的数据以及LLM掌握的知识推断表外特征,即新的列数据。

![Off-Table feature inference](assets/LLM_Case_2.gif)

## 🔛 快速开始

Expand All @@ -56,9 +73,15 @@
docker pull idsteam/sdgx:latest
```

### 从本地安装(目前推荐)
### 从Pypi安装

目前本项目的代码更新速度快,我们推荐您通过源码进行安装的方式使用SDG。
```bash
pip install sdgx
```

### 从本地安装

您可以通过源码进行安装的方式使用SDG。

```bash
git clone git@github.com:hitsz-ids/synthetic-data-generator.git
Expand All @@ -67,12 +90,6 @@ pip install .
pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git
```

### 从Pypi安装

```bash
pip install sdgx
```

### 单表数据快速合成示例

#### 演示代码
Expand Down Expand Up @@ -158,18 +175,12 @@ SDG开源项目由**哈尔滨工业大学(深圳)数据安全研究院**发
### 论文

- CTGAN:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)
- C3-TGAN: [C3-TGAN- Controllable Tabular Data Synthesis with Explicit Correlations and Property Constraints](https://www.researchgate.net/publication/374652636_C3-TGAN-_Controllable_Tabular_Data_Synthesis_with_Explicit_Correlations_and_Property_Constraints)
- TVAE:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)
- table-GAN:[Data Synthesis based on Generative Adversarial Networks](https://arxiv.org/pdf/1806.03384.pdf)
- CTAB-GAN:[CTAB-GAN: Effective Table Data Synthesizing](https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf)
- OCT-GAN: [OCT-GAN: Neural ODE-based Conditional Tabular GANs](https://arxiv.org/pdf/2105.14969.pdf)

### 数据集

- [Adult数据集](http://archive.ics.uci.edu/ml/datasets/adult)
- [Satellite数据集](http://archive.ics.uci.edu/dataset/146/statlog+landsat+satellite)
- [Rossmann数据集](https://www.kaggle.com/competitions/rossmann-store-sales/data)
- [Telstra数据集](https://www.kaggle.com/competitions/telstra-recruiting-network/data)

## 📄 许可证

SDG开源项目使用 Apache-2.0 license,有关协议请参考[LICENSE](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE)。
Binary file added assets/LLM_Case_1.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/LLM_Case_2.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/sdg_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading