hitsz-ids · MooooCat · Dec 11, 2023 · Dec 8, 2023 · Dec 8, 2023 · Dec 8, 2023
diff --git a/.github/workflows/extension.yml b/.github/workflows/extension.yml
@@ -0,0 +1,35 @@
+name: Testing example extensions
+
+on:
+  push:
+    branches: ["master", "main"]
+  pull_request:
+    branches: ["master", "main"]
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.8", "3.11"]
+
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: 'pip'
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -e .[test]
+      - name: Install all packages in example/extension
+        run: |
+          python -m pip install -e example/extension/dummydataconnector[test]
+          python -m pip install -e example/extension/dummydataprocessor[test]
+          python -m pip install -e example/extension/dummymodel[test]
+      - name: Test with pytest
+        run: |
+          pytest -vv example/extension
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -17,16 +17,17 @@ jobs:
     steps:
       - uses: actions/checkout@v3
       - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v3
+        uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
+          cache: 'pip'
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
           python -m pip install -e .[test]
       - name: Test with pytest
         run: |
-          pytest -vv --cov=sdgx
+          pytest -vv --cov=sdgx tests
       - name: Install dependencies for building
         run: |
           pip install build twine hatch

diff --git a/.gitignore b/.gitignore
@@ -250,3 +250,5 @@ cython_debug/
 #.idea/
 
 # End of https://www.toptal.com/developers/gitignore/api/macos,emacs,python
+
+*.log
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -17,7 +17,7 @@ Pre-commit will automatically format the code before each commit, It can also be
 pre-commit run --all-files
 ```
 
-Comment style is [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings).
+Comment style follows [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings).
 
 ## Install Locally with Test Dependencies
 
@@ -56,4 +56,4 @@ Use [start-docs-host.sh](dev-tools/start-docs-host.sh) to deploy a local http se
 cd ./dev-tools && ./start-docs-host.sh
 ```
 
-Access `http://localhost:8080` for docs.
+Access `http://localhost:8910` for docs.
diff --git a/README.md b/README.md
@@ -22,7 +22,7 @@
 </p>
 </div>
 
-Synthetic Data Generator (SDG) is a framework focused on quickly generating high-quality structured tabular data. It supports more than 10 single-table and multi-table data synthesis algorithms, achieving up to 120 times performance improvement, and supports differential privacy and other methods to enhance the security of synthesized data.
+Synthetic Data Generator (SDG) is a framework focused on quickly generating high-quality structured tabular data. It supports many single-table and multi-table data synthesis algorithms, achieving up to 120 times performance improvement, and supports differential privacy and other methods to enhance the security of synthesized data.
 
 Synthetic data is generated by machines based on real data and algorithms, it does not contain sensitive information, but can retain the characteristics of real data.
 There is no correspondence between synthetic data and real data, and it is not subject to privacy regulations such as GDPR and ADPPA.
@@ -32,15 +32,14 @@ High-quality synthetic data can also be used in various fields such as data open
 ## 🎉 Features
 
 - high performance
-  - SDG supports a variety of statistical methods, achieving up to 120 times faster execution speed, and reduces dependence on GPU devices;
-  - SDG is optimized for large dataset, consumes less memory than other frameworks or GAN-based algorithms;
-  - SDG will continue to track the latest progress in academia and industry, and introduce and support excellent algorithms and models in a timely manner.
-- Rapid deployment in production environment
-  - Optimize for actual production needs, improve model performance, reduce memory overhead, and support practical features such as single machine multiple cards, multiple machines multiple cards;
-  - Provide technologies required for production environments such as automated deployment, containerization, automated monitoring and alarming, and support rapid one-key deployment;
-  - Specially optimized for load balancing and fault tolerance to improve high availability.
+  - Supports a wide range of statistical data synthesis algorithms to achieve up to 120x performance improvement, without the need for GPU devices;
+  - Optimised for big data scenarios, effectively reducing memory consumption;
+  - Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner.
+  - Provide distributed training support for deep learning models with frameworks such as torch.
 - Privacy enhancements:
   - SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data.
+- Easy to Extend
+  - Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages
 
 Read [the latest API docs](https://synthetic-data-generator.readthedocs.io/en/latest/) for more details.
 

diff --git a/README_ZH_CN.md b/README_ZH_CN.md
@@ -23,7 +23,7 @@
 </p>
 </div>
 
-合成数据生成器（Synthetic Data Generator，SDG）是一个专注于快速生成高质量结构化表格数据的组件。支持10余种单表、多表数据合成算法，实现最高120倍性能提升，支持差分隐私等方法，加强合成数据安全性。
+合成数据生成器（Synthetic Data Generator，SDG）是一个专注于快速生成高质量结构化表格数据的组件。支持多种单表、多表数据合成算法，实现最高120倍性能提升，支持差分隐私等方法，加强合成数据安全性。
 
 合成数据是由机器根据真实数据与算法生成的，合成数据不含敏感信息，但能保留真实数据中的行为特征。合成数据与真实数据不存在任何对应关系，不受 GDPR 、ADPPA等隐私法规的约束，在实际应用中不需要担心隐私泄漏风险。高质量的合成数据可用于数据安全开放、模型训练调试、系统开发测试等众多领域。
 
@@ -33,13 +33,12 @@
   - 支持多种统计学数据合成算法，实现最高120倍性能提升，不需要GPU设备；
   - 为大数据场景优化，有效减少内存消耗；
   - 持续跟踪学术界和工业界的最新进展，及时引入支持优秀算法和模型。
-- 生产环境快速部署
-  - 针对实际生产需求进行优化，提升模型性能，降低内存开销，支持单机多卡、多机多卡等实用特性；
-  - 提供自动化部署、容器化技术、自动化监控和报警等生产环境所需技术，支持容器化快速一键部署；
-  - 针对负载均衡和容错性进行专门优化，提升组件可用性。
-- 隐私增强：
+  - 为深度学习模型提供torch等框架的分布式训练支持
+- 隐私增强
   - 提供中文敏感数据自动识别能力，包括姓名、身份证号、人名等17种常见敏感字段；
   - 支持差分隐私、匿名化等方法，加强合成数据安全性。
+- 易扩展
+  - 支持以插件包的形式拓展模型、数据处理、数据连接器等功能
 
 阅读 [最新的文档](https://synthetic-data-generator.readthedocs.io/en/latest/) 获取更多细节.
 

diff --git a/dev-tools/start-docs-host.sh b/dev-tools/start-docs-host.sh
@@ -6,6 +6,6 @@
 set -e
 docker run --rm \
 -it \
--p 8080:80 \
+-p 8910:80 \
 -v $(pwd)/../docs/build/html:/usr/share/nginx/html:ro \
 nginx
diff --git a/docs/README.md b/docs/README.md
@@ -10,7 +10,7 @@
 # 导入相关模块
 from sdgx.models.single_table.ctgan import CTGAN
 from sdgx.data_process.sampling.sampler import DataSamplerCTGAN
-from sdgx.data_process.transform.transform import DataTransformer
+from sdgx.data_processors.transformers.transform import DataTransformer
 from sdgx.utils.io.csv_utils import *
 
 # 读取数据

diff --git a/docs/develop/single_table_GAN.md b/docs/develop/single_table_GAN.md
@@ -7,7 +7,7 @@
 要开发模块，需要执行以下 5 个步骤：
 
 1. 明确需要开发的模块类型；
-1. 算法模块需要继承`BaseSynthesizerModel`类，并完成几个指定的函数；
+1. 算法模块需要继承`SynthesizerModel`类，并完成几个指定的函数；
 1. 定义模型所需的`Discriminator`类（可选）；
 1. 定义模型所需的`Generator`类（可选）；
 1. 本地安装以及测试您的模型。
@@ -30,13 +30,13 @@
 
 ## 第二步：定义您的模型类
 
-大体上讲，定义一个算法模块需要继承`BaseSynthesizerModel`基类，并完成几个指定的函数，即可成为您自己实现的模型模块。
+大体上讲，定义一个算法模块需要继承`SynthesizerModel`基类，并完成几个指定的函数，即可成为您自己实现的模型模块。
 
 其具体步骤如下：
 
 1. 在 [single_tablem目录](../../sdgx/models/single_table/) 中创建名为 xxx.py 的 Python 脚本文件，其中 xxx 是您打算开发的模块。
 
-1. 继承 `BaseSynthesizerModel`基类 。
+1. 继承 `SynthesizerModel`基类 。
 
    - 首先从 `sdgx/models/base.py` 中导入基类，并且导入其他必要的 Python 包，例如：
 
@@ -56,15 +56,15 @@
              Sequential,
              functional,
          )
-         from sdgx.models.base import BaseSynthesizerModel
+         from sdgx.models.base import SynthesizerModel
          from sdgx.data_process.sampling.sampler import DataSamplerCTGAN
-         from sdgx.data_process.transform.transform import DataTransformer
+         from sdgx.data_processors.transformers.transform import DataTransformer
      ```
 
    - 完成您的模块中的 `__init__` 函数，并定义相应的类变量，以CTGAN为例：
 
      ```python
-       class CTGAN(BaseSynthesizerModel):
+       class CTGAN(SynthesizerModel):
            def __init__(
                self,
                embedding_dim=128,