Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Regex Inspector and Email Inspector example. #115

Merged
merged 37 commits into from
Jan 29, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
eaaac1e
add InspectorInitError
MooooCat Jan 18, 2024
b88689b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 18, 2024
bec9bec
[Sweep GHA Fix] The GitHub Actions run failed with... (#116)
sweep-ai[bot] Jan 18, 2024
058b810
add regex inspector (still draft)
MooooCat Jan 18, 2024
eaaae27
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 18, 2024
2b4daac
Merge branch 'main' into feature-regex-inspector
MooooCat Jan 19, 2024
d8e8ab1
add regex base inspector
MooooCat Jan 19, 2024
6843762
add some personal info inspector
MooooCat Jan 19, 2024
c4f5c61
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 19, 2024
3686323
fix hookimpl
MooooCat Jan 19, 2024
1de7d6a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 19, 2024
b1df970
Merge branch 'main' into feature-regex-inspector
MooooCat Jan 20, 2024
4bfde14
Merge branch 'main' into feature-regex-inspector
MooooCat Jan 20, 2024
3223ca9
add _inspect_level
MooooCat Jan 22, 2024
bd69436
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 22, 2024
2deef03
discard weird change from sweep
MooooCat Jan 22, 2024
de216e5
fix typo in sweep commits
MooooCat Jan 22, 2024
b0404ef
add PII attribute
MooooCat Jan 22, 2024
f752406
add email test case (still draft)
MooooCat Jan 22, 2024
4e8d1b6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 22, 2024
73fce1b
add personal info inspector
MooooCat Jan 23, 2024
1248f11
add test cases (still draft)
MooooCat Jan 23, 2024
91bf735
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 23, 2024
861b9a0
add domain_verification
MooooCat Jan 24, 2024
10f27b2
update localized inspectors
MooooCat Jan 24, 2024
b1cad2d
add test cases
MooooCat Jan 24, 2024
212eb98
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 24, 2024
945b548
discard sweep change
MooooCat Jan 24, 2024
bc1d724
fix col name typo
MooooCat Jan 24, 2024
9ebe668
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 24, 2024
0c9d08b
fix expection type
MooooCat Jan 24, 2024
17c117d
Merge branch 'main' into feature-regex-inspector
MooooCat Jan 24, 2024
f78a710
add corner test cases
MooooCat Jan 24, 2024
7e064b9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 24, 2024
62b576e
fix version typo
MooooCat Jan 26, 2024
1166c32
add inspector manager testcase
MooooCat Jan 26, 2024
378d0a9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions LICENSE
MooooCat marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Apache License
Apache License, Version 2.0
Version 2.0, January 2004
http://www.apache.org/licenses/

Expand Down Expand Up @@ -188,7 +188,7 @@

Copyright 2023 hitsz-ids

Licensed under the Apache License, Version 2.0 (the "License");
Licensed under the Apache License, Version 2.0, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

Expand Down
8 changes: 4 additions & 4 deletions docs/source/design/motivation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,11 @@ What can we do?
To address these challenges, we intend to design a new system that is efficient, scalable,
and capable of simulating databases at the scale of tens of millions.
This new system will be designed with a flexible architecture
that can easily incorporate additional algorithms and support different types of data.
Furthermore, it will be licensed under the `Apache 2.0 license <https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE>`_,
which allows for greater freedom in terms of modifications and derivative works.
that can easily incorporate additional algorithms, support different types of data, and provide efficient scalability.
Furthermore, it will be now licensed under the Apache License, Version 2.0
which allows for greater freedom in terms of modifications and derivative works and is more permissive for open source contributions.

By developing this new system, we aim to advance the research and development of synthetic data,
In response to these challenges, we aim to design a new system, we aim to advance the research and development of synthetic data,
providing a more robust and flexible tool for data scientists and researchers.
This will not only enhance the quality and representativeness of synthetic data but also promote its use
in a wider range of applications,
Expand Down
Empty file.
5 changes: 3 additions & 2 deletions docs/source/developer_guides/extension/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ which is based on the `entry-points of Python project <https://packaging.python.

A plugin project is made up of three parts:

- A class, inherits from the ``register_type`` of :ref:`Manager <manager>`, which contains your own logic.
- A class that inherits from the ``register_type`` of :ref:`Manager <manager>`, containing your own logic.
- A register function, which's name is defined(decorated) by ``@hookspec``.
and you need to implement it and use ``@hookimp`` to declare it as a registed hook.
- A ``entry-points`` in ``pyproject.toml``, which pointing to the hookimp function. The subdomain of the entry-point
Expand All @@ -26,6 +26,7 @@ Plugin-supported modules
- :ref:`API Reference for extended Data Connector <api_reference/data-connectors-extension>`:
:ref:`Data Connector <Data Connector>` is used to connect to data sources.
- :ref:`API Reference for extended Cacher for DataLoader <api_reference/cachers-extension>`:
:ref:`Cacher <Cacher>` is used for improving performance, reducing network overhead, and supporting large datasets.:
:ref:`Cacher <Cacher>` is used for improving performance,
reducing network overhead and support large datasets.
- :ref:`API Reference for extended Data Processor <api_reference/data-processors-extension>`:
Expand All @@ -34,7 +35,7 @@ Plugin-supported modules
- :ref:`API Reference for extended Inspector for Metadata <api_reference/data-models-inspectors-extension>`:
:ref:`Inspector <Inspector>` is used to extract metadata such as patterns, types, etc. from raw data.
- :ref:`API Reference for extended Model <api_reference/models-extension>`:
:ref:`Model <SynthesizerModel>`, the model fitted by processed data and used to generate synthetic data.
:ref:`Model <SynthesizerModel>`: The model fitted by processed data and used to generate synthetic data., the model fitted by processed data and used to generate synthetic data.
- :ref:`API Reference for extended Data Exporter <api_reference/data-exporters-extension>`:
:ref:`Data Exporter <Data Exporter>` is used to export data to somewhere.
Use it in CLI or library way to save your processed data or synthetic data.
28 changes: 28 additions & 0 deletions sdgx/data_models/inspectors/personal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from sdgx.data_models.inspectors.extension import hookimpl
from sdgx.data_models.inspectors.regex import RegexInspector


class EmailInspector(RegexInspector):
pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
data_type_name = "email"


class ChinaMainlandIDInspector(RegexInspector):
pattern = r"(^[1-9]\\d{5}(18|19|20)\\d{2}((0[1-9])|(10|11|12))(([0-2][1-9])|10|20|30|31)\\d{3}[0-9Xx]$)|"

data_type_name = "china_mainland_id"


class ChinaMainlandMobilePhoneInspector(RegexInspector):
pattern = r"^1[3-9]\d{9}$"

data_type_name = "china_mainland_mobile_phone"


@hookimpl
def register(manager):
manager.register("EmailInspector", EmailInspector)

manager.register("ChinaMainlandIDInspector", ChinaMainlandIDInspector)

manager.register("ChinaMainlandMobilePhoneInspector", ChinaMainlandMobilePhoneInspector)
113 changes: 113 additions & 0 deletions sdgx/data_models/inspectors/regex.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
from __future__ import annotations

import re
from typing import Any

import pandas as pd

from sdgx.data_models.inspectors.base import Inspector
from sdgx.exceptions import InspectorInitError

# By default, we will not directly register the RegexInspector to the Inspector Manager
# Instead, use it as a baseclass or user-defined regex, then put it into the Inspector Manager or use it alone


class RegexInspector(Inspector):
"""RegexInspector
RegexInspector is a sdgx inspector that uses regular expression rules to detect column data types. It can be initialized with a custom expression, or it can be inherited and applied to specific data types,such as email, US address, HKID etc.
"""

pattern: str = None
"""
pattern is the regular expression string of current inspector.
"""

data_type_name: str = None
"""
data_type_name is the name of the data type, such as email, US address, HKID etc.
"""

_match_percentage: float = 0.8
"""
match_percentage shoud > 0.5 and < 1.

Due to the existence of empty data, wrong data, etc., the match_percentage is the proportion of the current regular expression compound. When the number of compound regular expressions is higher than this ratio, the column can be considered fit the current data type.
"""

@property
def match_percentage(self):
return self._match_percentage

@match_percentage.setter
def match_percentage(self, value):
if value > 0.5 and value <= 1:
self._match_percentage = value
else:
raise InspectorInitError("The match_percentage should be set in (0.5, 1].")

def __init__(
self,
pattern: str = None,
data_type_name: str = None,
match_percentage: float = None,
*args,
**kwargs,
):
super().__init__(*args, **kwargs)
self.regex_columns: set[str] = set()

# this pattern should be a re pattern
if pattern:
self.pattern = pattern
# check pattern
if self.pattern is None:
raise InspectorInitError("Regular expression NOT found.")
self.p = re.compile(self.pattern)

# set data_type_name
if data_type_name:
if data_type_name.endswith("_columns"):
self.data_type_name = data_type_name[:-8]
else:
self.data_type_name = data_type_name
elif not self.data_type_name:
self.data_type_name = f"regex_{self.pattern}_columns"
# then chech the data type name
if self.data_type_name is None:
raise InspectorInitError("Inspector's data type undefined.")

# set percentage
if match_percentage:
self.match_percentage = match_percentage

def fit(self, raw_data: pd.DataFrame, *args, **kwargs):
"""Fit the inspector.

Finds the list of regex columns from the raw data.

Args:
raw_data (pd.DataFrame): Raw data
"""
for each_col in raw_data.columns:
each_match_rate = self._fit_column(raw_data[each_col])
if each_match_rate > self.match_percentage:
self.regex_columns.add(each_col)

self.ready = True

def _fit_column(self, column_data: pd.Series):
"""
Regular expression matching for a single column, returning the matching ratio.
"""
length = len(column_data)
match_cnt = 0
for i in column_data:
m = re.match(self.p, str(i))
if m:
match_cnt += 1
return match_cnt / length

def inspect(self, *args, **kwargs) -> dict[str, Any]:
"""Inspect raw data and generate metadata."""

return {self.data_type_name + "_columns": list(self.regex_columns)}
4 changes: 4 additions & 0 deletions sdgx/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,3 +137,7 @@ class MetadataCombinerInvalidError(MetadataCombinerError):

class MetadataCombinerInitError(MetadataCombinerError):
ERROR_CODE = 9006


class InspectorInitError(DataModelError):
ERROR_CODE = 9007
Loading