Skip to content

Commit

Permalink
Merge pull request #21 from hitsz-ids/document-english_version
Browse files Browse the repository at this point in the history
Update Readme (English Version)
  • Loading branch information
MooooCat committed Sep 21, 2023
2 parents 7fc57e2 + 91c31e4 commit 89b7c07
Show file tree
Hide file tree
Showing 4 changed files with 172 additions and 91 deletions.
135 changes: 47 additions & 88 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,55 @@
# 合成数据生成器 -- 快速生成高质量合成数据!
<div align="center">
<img src="docs/sdg_logo.png" width="400" >
</div>

合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量结构化表格数据的组件。支持10余种单表、多表数据合成算法,实现最高120倍性能提升,支持差分隐私等方法,加强合成数据安全性。
# 🚀 Introduction

合成数据是由机器根据真实数据与算法生成的,合成数据不含敏感信息,但能保留真实数据中的行为特征。合成数据与真实数据不存在任何对应关系,不受 GDPR 、ADPPA等隐私法规的约束,在实际应用中不需要担心隐私泄漏风险。高质量的合成数据可用于数据安全开放、模型训练调试、系统开发测试等众多领域。
[![License](https://img.shields.io/badge/License-Apache%202-2162A3.svg)](https://www.apache.org/licenses/LICENSE-2.0.html) [![CN doc](https://img.shields.io/badge/文档-中文版-2162A3.svg)](README_ZH_CN.md)

| 重要链接 | |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- |
| 📖 [文档](https://github.com/hitsz-ids/synthetic-data-generator/tree/main/docs) | 项目API文档 |
| :octocat: [项目仓库](https://github.com/hitsz-ids/synthetic-data-generator) | 项目Github仓库 |
| 📜 [License](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE) | Apache-2.0 license |
| 举个例子 🌰 |[AI靶场](https://datai.pcl.ac.cn/)上运行SDG示例(TBD) |
Synthetic Data Generator (SDG) is a framework focused on quickly generating high-quality structured tabular data. It supports more than 10 single-table and multi-table data synthesis algorithms, achieving up to 120 times performance improvement, and supports differential privacy and other methods to enhance the security of synthesized data.

## 目录
Synthetic data is generated by machines based on real data and algorithms, it does not contain sensitive information, but can retain the characteristics of real data.
There is no correspondence between synthetic data and real data, and it is not subject to privacy regulations such as GDPR and ADPPA.
In practical applications, there is no need to worry about the risk of privacy leakage.
High-quality synthetic data can also be used in various fields such as data opening, model training and debugging, system development and testing, etc.


- [快速开始](#快速开始)
- [主要特性](#主要特性)
- [算法列表](#算法列表)
- [相关论文和数据集链接](#相关论文和数据集链接)
- [API](#API)
- [维护者](#维护者)
- [如何贡献](#如何贡献)
- [许可证](#许可证)
## 🎉 Features

## 快速开始
+ high performance
+ SDG supports a variety of statistical methods, achieving up to 120 times faster execution speed, and reduces dependence on GPU devices;
+ SDG is optimized for large dataset, consumes less memory than other frameworks or GAN-based algorithms;
+ SDG will continue to track the latest progress in academia and industry, and introduce and support excellent algorithms and models in a timely manner.
+ Rapid deployment in production environment
+ Optimize for actual production needs, improve model performance, reduce memory overhead, and support practical features such as single machine multiple cards, multiple machines multiple cards;
+ Provide technologies required for production environments such as automated deployment, containerization, automated monitoring and alarming, and support rapid one-key deployment;
+ Specially optimized for load balancing and fault tolerance to improve high availability.
+ Privacy enhancements:
+ SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data.

### 从Pypi安装

## 🔛 Quick Start

### Install from PyPi

```bash
pip install sdgx
```

### 单表数据快速合成示例
### Quick Demo of Single Table Data Generation

```python
# 导入相关模块
# Import modules
from sdgx.models.single_table.ctgan import CTGAN
from sdgx.transform.sampler import DataSamplerCTGAN
from sdgx.transform.transformer import DataTransformerCTGAN
from sdgx.utils.io.csv_utils import *

# 读取数据
# Read data from demo
demo_data, discrete_cols = get_demo_single_table()
```

真实数据示例如下
Real data are as follows

```
age workclass fnlwgt ... hours-per-week native-country class
Expand All @@ -65,18 +70,18 @@ demo_data, discrete_cols = get_demo_single_table()
```

```python
#定义模型
model = GeneratorCTGAN(epochs=10,\
# Define model
model = CTGAN(epochs=10,\
transformer= DataTransformerCTGAN,\
sampler=DataSamplerCTGAN)
# 训练模型
# Model training
model.fit(demo_data, discrete_cols)

# 生成合成数据
# Generate synthetic data
sampled_data = model.generate(1000)
```

合成数据如下
Synthetic data are as follows

```
age workclass fnlwgt ... hours-per-week native-country class
Expand All @@ -92,79 +97,33 @@ sampled_data = model.generate(1000)
9 28 State-gov 837932 ... 99 United-States <=50K
```

## 主要特性

+ 高性能
+ 支持10余种单表、多表数据合成算法,实现最高120倍性能提升;
+ SDG会持续跟踪学术界和工业界的最新进展,及时引入支持优秀算法和模型。
+ 生产环境快速部署
+ 针对实际生产需求进行优化,提升模型性能,降低内存开销,支持单机多卡、多机多卡等实用特性;
+ 提供自动化部署、容器化技术、自动化监控和报警等生产环境所需技术,支持容器化快速一键部署;
+ 针对负载均衡和容错性进行专门优化,提升组件可用性。
+ 隐私增强:
+ 提供中文敏感数据自动识别能力,包括姓名、身份证号、人名等17种常见敏感字段;
+ 支持差分隐私、匿名化等方法,加强合成数据安全性。

## 算法列表
## 🤝 Join Community

### 表1:单表合成算法效果对比(F1-score)
The SDG project was initiated by **Institute of Data Security, Harbin Institute of Technology**. If you are interested in out project, welcome to join our community. We welcome organizations, teams, and individuals who share our commitment to data protection and security through open source:

| 模型 | Adult(二分类数据集)(%) | Satellite(多分类数据集)(%) |
| :--------: | :--------------------: | :------------------------: |
| 原始数据集 | 69.5 | 89.23 |
| CTGAN | 60.38 | 69.43 |
| TVAE | 59.52 | 83.58 |
| table-GAN | 63.29 | 79.15 |
| CTAB-GAN | 58.59 | 79.24 |
| OCT-GAN | 55.18 | 80.98 |
| CorTGAN | **67.13** | **84.27** |
- Submit an issue by viewing [View First Good Issue](https://github.com/hitsz-ids/synthetic-data-generator/issues/new) or submit a Pull Request。
- For developer documentation, please see [Develop documents].(./docs/develop/readme.md)

### 表2:多表合成算法效果对比
## 👩‍🎓 Related Work

| 模型 | Rossmann(回归数据集)(rmspe) | Telstra(分类数据集)(mlogloss) |
| :--------: | :-------------------------: | :---------------------------: |
| 原始数据集 | 0.2217 | 0.5381 |
| SDV | 0.6897 | 1.1719 |
| CWAMT | **0.4348** | **0.818** |

### 相关论文和数据集链接

#### 论文
### Research Paper

- CTGAN:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)
- TVAE:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)
- table-GAN:[Data Synthesis based on Generative Adversarial Networks](https://arxiv.org/pdf/1806.03384.pdf)
- CTAB-GAN:[CTAB-GAN: Effective Table Data Synthesizing](https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf)
- OCT-GAN: [OCT-GAN: Neural ODE-based Conditional Tabular GANs](https://arxiv.org/pdf/2105.14969.pdf)
- SDV:[The Synthetic data vault](https://sci-hub.se/10.1109/DSAA.2016.49 "多表合成")

#### 数据集

- [Adult数据集](http://archive.ics.uci.edu/ml/datasets/adult)
- [Satellite数据集](http://archive.ics.uci.edu/dataset/146/statlog+landsat+satellite)
- [Rossmann数据集](https://www.kaggle.com/competitions/rossmann-store-sales/data)
- [Telstra数据集](https://www.kaggle.com/competitions/telstra-recruiting-network/data)


## API

具体接口参数请参考 [API文档](https://SDG.readthedocs.io/en/latest/api/index.html) 【TBD】。

## 维护者

SDG开源项目由**哈尔滨工业大学(深圳)数据安全研究院**发起,若您对SDG项目感兴趣并愿意一起完善它,欢迎加入我们的开源社区。

## 如何贡献
### Dataset

非常欢迎你的加入![提一个 Issue](https://github.com/hitsz-ids/synthetic-data-generator/issues/new) 或者提交一个 Pull Request。
- [Adult](http://archive.ics.uci.edu/ml/datasets/adult)
- [Satellite](http://archive.ics.uci.edu/dataset/146/statlog+landsat+satellite)
- [Rossmann](https://www.kaggle.com/competitions/rossmann-store-sales/data)
- [Telstra](https://www.kaggle.com/competitions/telstra-recruiting-network/data)

开发环境配置请参考[开发者文档](./DEVELOP.md)

## 许可证
## 📄 License

SDG开源项目使用 Apache-2.0 license,有关协议请参考[LICENSE](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE)
The SDG open source project uses Apache-2.0 license, please refer to the [LICENSE].(https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE)。

[文档]: https://sgd.github.io/
[项目仓库]: https://github.com/hitsz-ids/synthetic-data-generator
[License]: https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE
[AI靶场]: https://datai.pcl.ac.cn/
125 changes: 125 additions & 0 deletions README_ZH_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
<div align="center">
<img src="docs/sdg_logo.png" width="400" >
</div>

# 🚀 合成数据生成器 -- 快速生成高质量合成数据!

[![License](https://img.shields.io/badge/License-Apache%202-2162A3.svg)](https://www.apache.org/licenses/LICENSE-2.0.html) [![CN doc](https://img.shields.io/badge/Doc-English-2162A3.svg)](README.md)

合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量结构化表格数据的组件。支持10余种单表、多表数据合成算法,实现最高120倍性能提升,支持差分隐私等方法,加强合成数据安全性。

合成数据是由机器根据真实数据与算法生成的,合成数据不含敏感信息,但能保留真实数据中的行为特征。合成数据与真实数据不存在任何对应关系,不受 GDPR 、ADPPA等隐私法规的约束,在实际应用中不需要担心隐私泄漏风险。高质量的合成数据可用于数据安全开放、模型训练调试、系统开发测试等众多领域。

## 🎉 主要特性

+ 高性能
+ 支持多种统计学数据合成算法,实现最高120倍性能提升,不需要GPU设备;
+ 为大数据场景优化,有效减少内存消耗;
+ 持续跟踪学术界和工业界的最新进展,及时引入支持优秀算法和模型。
+ 生产环境快速部署
+ 针对实际生产需求进行优化,提升模型性能,降低内存开销,支持单机多卡、多机多卡等实用特性;
+ 提供自动化部署、容器化技术、自动化监控和报警等生产环境所需技术,支持容器化快速一键部署;
+ 针对负载均衡和容错性进行专门优化,提升组件可用性。
+ 隐私增强:
+ 提供中文敏感数据自动识别能力,包括姓名、身份证号、人名等17种常见敏感字段;
+ 支持差分隐私、匿名化等方法,加强合成数据安全性。


## 🔛 快速开始

### 从Pypi安装

```bash
pip install sdgx
```

### 单表数据快速合成示例

```python
# 导入相关模块
from sdgx.models.single_table.ctgan import CTGAN
from sdgx.transform.sampler import DataSamplerCTGAN
from sdgx.transform.transformer import DataTransformerCTGAN
from sdgx.utils.io.csv_utils import *

# 读取数据
demo_data, discrete_cols = get_demo_single_table()
```

真实数据示例如下:

```
age workclass fnlwgt ... hours-per-week native-country class
0 27 Private 177119 ... 44 United-States <=50K
1 27 Private 216481 ... 40 United-States <=50K
2 25 Private 256263 ... 40 United-States <=50K
3 46 Private 147640 ... 40 United-States <=50K
4 45 Private 172822 ... 76 United-States >50K
... ... ... ... ... ... ... ...
32556 43 Local-gov 33331 ... 40 United-States >50K
32557 44 Private 98466 ... 35 United-States <=50K
32558 23 Private 45317 ... 40 United-States <=50K
32559 45 Local-gov 215862 ... 45 United-States >50K
32560 25 Private 186925 ... 48 United-States <=50K
[32561 rows x 15 columns]
```

```python
#定义模型
model = CTGAN(epochs=10,\
transformer= DataTransformerCTGAN,\
sampler=DataSamplerCTGAN)
# 训练模型
model.fit(demo_data, discrete_cols)

# 生成合成数据
sampled_data = model.generate(1000)
```

合成数据如下:

```
age workclass fnlwgt ... hours-per-week native-country class
0 33 Private 276389 ... 41 United-States >50K
1 33 Self-emp-not-inc 296948 ... 54 United-States <=50K
2 67 Without-pay 266913 ... 51 Columbia <=50K
3 49 Private 423018 ... 41 United-States >50K
4 22 Private 295325 ... 39 United-States >50K
5 63 Private 234140 ... 65 United-States <=50K
6 42 Private 243623 ... 52 United-States <=50K
7 75 Private 247679 ... 41 United-States <=50K
8 79 Private 332237 ... 41 United-States >50K
9 28 State-gov 837932 ... 99 United-States <=50K
```

## 🤝 如何贡献

SDG开源项目由**哈尔滨工业大学(深圳)数据安全研究院**发起,若您对SDG项目感兴趣并愿意一起完善它,欢迎加入我们的开源社区:

- 非常欢迎你的加入![提一个 Issue](https://github.com/hitsz-ids/synthetic-data-generator/issues/new) 或者提交一个 Pull Request。
- 开发环境配置请参考[开发者文档](./DEVELOP.md)


## 👩‍🎓 相关工作

### 论文

- CTGAN:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)
- TVAE:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)
- table-GAN:[Data Synthesis based on Generative Adversarial Networks](https://arxiv.org/pdf/1806.03384.pdf)
- CTAB-GAN:[CTAB-GAN: Effective Table Data Synthesizing](https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf)
- OCT-GAN: [OCT-GAN: Neural ODE-based Conditional Tabular GANs](https://arxiv.org/pdf/2105.14969.pdf)

### 数据集

- [Adult数据集](http://archive.ics.uci.edu/ml/datasets/adult)
- [Satellite数据集](http://archive.ics.uci.edu/dataset/146/statlog+landsat+satellite)
- [Rossmann数据集](https://www.kaggle.com/competitions/rossmann-store-sales/data)
- [Telstra数据集](https://www.kaggle.com/competitions/telstra-recruiting-network/data)


## 📄 许可证

SDG开源项目使用 Apache-2.0 license,有关协议请参考[LICENSE](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE)
3 changes: 0 additions & 3 deletions docs/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,3 @@ sampled = model.generate(num_rows=10)
[10 rows x 10 columns]}}
```

## API

除python组件之外,SDG还支持以Restful接口形式调用,具体接口参数请参考 [API文档](https://SDG.readthedocs.io/en/latest/api/index.html)
Binary file added docs/sdg_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 89b7c07

Please sign in to comment.