Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Good first issue | Feature] Add Regex Inspector #94

Closed
MooooCat opened this issue Dec 29, 2023 · 0 comments · Fixed by #115
Closed

[Good first issue | Feature] Add Regex Inspector #94

MooooCat opened this issue Dec 29, 2023 · 0 comments · Fixed by #115
Labels
difficulty-medium enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@MooooCat
Copy link
Contributor

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

This inspector determines whether a column matches the regular expression given by the user and outputs the column names.

Add an Inspector that accepts two parameters:

  • User defined regular expression (string type);
  • Whether it is a PII column (bool type): whether the column contains private information.

🏕Solution(optional)

The content is as follows:

  • Inherit sdgx.data_models.inspector.base.Inspector and implement the fit method;
  • Inherit sdgx.data_models.inspector.base.Inspector and implement the inspect method;
  • Complete examples of using this Inspector to infer data types;
  • Complete necessary test cases.

🍰Detail(optional)

For the __init__ method:

  • This method should contain regular expressions as input parameter;
  • Necessary checks should be executed on regular expressions.

For the fit method, the input parameters should be:

  • raw_data (pd.DataFrame): the input data;
  • It is recommended to add a match_rate parameter (default is set to 0.8 or other values). This parameter is between 0-1, when a column of data with a "match_rate" ratio matches the regular expression, this column should appear in the inspect results.

For inspect method:

  • Like other inspectors, should output the names of columns that match the data type inferred by this inspector.
  • Output PII attributes for easy updating to metadata.

🍰Example(optional)

inspectors = InspectorManager().init_inspcetors(
        include_inspectors, exclude_inspectors, **(inspector_init_kwargs or {})
    )
for inspector in inspectors:
    inspector.fit(df)

metadata = Metadata(primary_keys=[df.columns[0]], column_list=list(df.columns))
for inspector in inspectors:
    metadata.update(inspector.inspect())
@MooooCat MooooCat added enhancement New feature or request good first issue Good for newcomers difficulty-medium labels Dec 29, 2023
@MooooCat MooooCat added this to the 0.2.0 milestone Dec 29, 2023
@MooooCat MooooCat linked a pull request Jan 18, 2024 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty-medium enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant