Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes: Refactoring of new architecture #56

Merged
merged 20 commits into from
Dec 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions .github/workflows/extension.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: Testing example extensions

on:
push:
branches: ["master", "main"]
pull_request:
branches: ["master", "main"]

jobs:
build:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.11"]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -e .[test]
- name: Install all packages in example/extension
run: |
python -m pip install -e example/extension/dummydataconnector[test]
python -m pip install -e example/extension/dummydataprocessor[test]
python -m pip install -e example/extension/dummymodel[test]
- name: Test with pytest
run: |
pytest -vv example/extension
5 changes: 3 additions & 2 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,17 @@ jobs:
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -e .[test]
- name: Test with pytest
run: |
pytest -vv --cov=sdgx
pytest -vv --cov=sdgx tests
- name: Install dependencies for building
run: |
pip install build twine hatch
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -250,3 +250,5 @@ cython_debug/
#.idea/

# End of https://www.toptal.com/developers/gitignore/api/macos,emacs,python

*.log
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Pre-commit will automatically format the code before each commit, It can also be
pre-commit run --all-files
```

Comment style is [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings).
Comment style follows [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings).

## Install Locally with Test Dependencies

Expand Down Expand Up @@ -56,4 +56,4 @@ Use [start-docs-host.sh](dev-tools/start-docs-host.sh) to deploy a local http se
cd ./dev-tools && ./start-docs-host.sh
```

Access `http://localhost:8080` for docs.
Access `http://localhost:8910` for docs.
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
</p>
</div>

Synthetic Data Generator (SDG) is a framework focused on quickly generating high-quality structured tabular data. It supports more than 10 single-table and multi-table data synthesis algorithms, achieving up to 120 times performance improvement, and supports differential privacy and other methods to enhance the security of synthesized data.
Synthetic Data Generator (SDG) is a framework focused on quickly generating high-quality structured tabular data. It supports many single-table and multi-table data synthesis algorithms, achieving up to 120 times performance improvement, and supports differential privacy and other methods to enhance the security of synthesized data.

Synthetic data is generated by machines based on real data and algorithms, it does not contain sensitive information, but can retain the characteristics of real data.
There is no correspondence between synthetic data and real data, and it is not subject to privacy regulations such as GDPR and ADPPA.
Expand All @@ -32,15 +32,14 @@ High-quality synthetic data can also be used in various fields such as data open
## 🎉 Features

- high performance
- SDG supports a variety of statistical methods, achieving up to 120 times faster execution speed, and reduces dependence on GPU devices;
- SDG is optimized for large dataset, consumes less memory than other frameworks or GAN-based algorithms;
- SDG will continue to track the latest progress in academia and industry, and introduce and support excellent algorithms and models in a timely manner.
- Rapid deployment in production environment
- Optimize for actual production needs, improve model performance, reduce memory overhead, and support practical features such as single machine multiple cards, multiple machines multiple cards;
- Provide technologies required for production environments such as automated deployment, containerization, automated monitoring and alarming, and support rapid one-key deployment;
- Specially optimized for load balancing and fault tolerance to improve high availability.
- Supports a wide range of statistical data synthesis algorithms to achieve up to 120x performance improvement, without the need for GPU devices;
- Optimised for big data scenarios, effectively reducing memory consumption;
- Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner.
- Provide distributed training support for deep learning models with frameworks such as torch.
- Privacy enhancements:
- SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data.
- Easy to Extend
- Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages

Read [the latest API docs](https://synthetic-data-generator.readthedocs.io/en/latest/) for more details.

Expand Down
11 changes: 5 additions & 6 deletions README_ZH_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
</p>
</div>

合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量结构化表格数据的组件。支持10余种单表、多表数据合成算法,实现最高120倍性能提升,支持差分隐私等方法,加强合成数据安全性。
合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量结构化表格数据的组件。支持多种单表、多表数据合成算法,实现最高120倍性能提升,支持差分隐私等方法,加强合成数据安全性。

合成数据是由机器根据真实数据与算法生成的,合成数据不含敏感信息,但能保留真实数据中的行为特征。合成数据与真实数据不存在任何对应关系,不受 GDPR 、ADPPA等隐私法规的约束,在实际应用中不需要担心隐私泄漏风险。高质量的合成数据可用于数据安全开放、模型训练调试、系统开发测试等众多领域。

Expand All @@ -33,13 +33,12 @@
- 支持多种统计学数据合成算法,实现最高120倍性能提升,不需要GPU设备;
- 为大数据场景优化,有效减少内存消耗;
- 持续跟踪学术界和工业界的最新进展,及时引入支持优秀算法和模型。
- 生产环境快速部署
- 针对实际生产需求进行优化,提升模型性能,降低内存开销,支持单机多卡、多机多卡等实用特性;
- 提供自动化部署、容器化技术、自动化监控和报警等生产环境所需技术,支持容器化快速一键部署;
- 针对负载均衡和容错性进行专门优化,提升组件可用性。
- 隐私增强:
- 为深度学习模型提供torch等框架的分布式训练支持
- 隐私增强
- 提供中文敏感数据自动识别能力,包括姓名、身份证号、人名等17种常见敏感字段;
- 支持差分隐私、匿名化等方法,加强合成数据安全性。
- 易扩展
- 支持以插件包的形式拓展模型、数据处理、数据连接器等功能

阅读 [最新的文档](https://synthetic-data-generator.readthedocs.io/en/latest/) 获取更多细节.

Expand Down
2 changes: 1 addition & 1 deletion dev-tools/start-docs-host.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@
set -e
docker run --rm \
-it \
-p 8080:80 \
-p 8910:80 \
-v $(pwd)/../docs/build/html:/usr/share/nginx/html:ro \
nginx
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
# 导入相关模块
from sdgx.models.single_table.ctgan import CTGAN
from sdgx.data_process.sampling.sampler import DataSamplerCTGAN
from sdgx.data_process.transform.transform import DataTransformer
from sdgx.data_processors.transformers.transform import DataTransformer
from sdgx.utils.io.csv_utils import *

# 读取数据
Expand Down
12 changes: 6 additions & 6 deletions docs/develop/single_table_GAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
要开发模块,需要执行以下 5 个步骤:

1. 明确需要开发的模块类型;
1. 算法模块需要继承`BaseSynthesizerModel`类,并完成几个指定的函数;
1. 算法模块需要继承`SynthesizerModel`类,并完成几个指定的函数;
1. 定义模型所需的`Discriminator`类(可选);
1. 定义模型所需的`Generator`类(可选);
1. 本地安装以及测试您的模型。
Expand All @@ -30,13 +30,13 @@

## 第二步:定义您的模型类

大体上讲,定义一个算法模块需要继承`BaseSynthesizerModel`基类,并完成几个指定的函数,即可成为您自己实现的模型模块。
大体上讲,定义一个算法模块需要继承`SynthesizerModel`基类,并完成几个指定的函数,即可成为您自己实现的模型模块。

其具体步骤如下:

1. 在 [single_tablem目录](../../sdgx/models/single_table/) 中创建名为 xxx.py 的 Python 脚本文件,其中 xxx 是您打算开发的模块。

1. 继承 `BaseSynthesizerModel`基类 。
1. 继承 `SynthesizerModel`基类 。

- 首先从 `sdgx/models/base.py` 中导入基类,并且导入其他必要的 Python 包,例如:

Expand All @@ -56,15 +56,15 @@
Sequential,
functional,
)
from sdgx.models.base import BaseSynthesizerModel
from sdgx.models.base import SynthesizerModel
from sdgx.data_process.sampling.sampler import DataSamplerCTGAN
from sdgx.data_process.transform.transform import DataTransformer
from sdgx.data_processors.transformers.transform import DataTransformer
```

- 完成您的模块中的 `__init__` 函数,并定义相应的类变量,以CTGAN为例:

```python
class CTGAN(BaseSynthesizerModel):
class CTGAN(SynthesizerModel):
def __init__(
self,
embedding_dim=128,
Expand Down
Loading