Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce inspect_level in inspector and metadata #113

Merged
merged 19 commits into from
Jan 20, 2024

Conversation

MooooCat
Copy link
Contributor

Description

Inspected level is a concept newly introduced in version 0.1.5.

Since a single column in the table may be marked by different inspectors at the same time, which may cause confuse.

Thus, a inspect_level is introduced when determining the specific type of a column.

Motivation and Context

A single column, can be labeled by different multiple inspector.

For example, A email column may be recognized as email, but it may also be recognized as the id column, and it may also be recognized by different inspectors at the same time identified as a discrete column, which will cause confusion in subsequent processing.

Also, you can see from our multi-table demo dataset, child table "train", the Date column is marked twice, as discrete type and date type.

We will preset different inspector levels for different inspectors, usually more specific inspectors will get higher levels, and general inspectors (like discrete) will have inspect_level. In baseclass, the inspect_level is set to 1.

How has this been tested?

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Copy link
Contributor

sweep-ai bot commented Jan 17, 2024

Apply Sweep Rules to your PR?

  • Apply: All new business logic should have corresponding unit tests.
  • Apply: Refactor large functions to be more modular.
  • Apply: Add docstrings to all functions and file headers.

This is an automated message generated by Sweep AI.

@codecov-commenter
Copy link

codecov-commenter commented Jan 17, 2024

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (1bbf3f5) 78.94% compared to head (4e43bdb) 79.13%.

Files Patch % Lines
sdgx/data_models/metadata.py 79.54% 9 Missing ⚠️
sdgx/data_models/inspectors/base.py 87.50% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #113      +/-   ##
==========================================
+ Coverage   78.94%   79.13%   +0.19%     
==========================================
  Files          64       64              
  Lines        2764     2823      +59     
==========================================
+ Hits         2182     2234      +52     
- Misses        582      589       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hitsz-ids hitsz-ids deleted a comment from sweep-ai bot Jan 17, 2024
@Wh1isper
Copy link
Collaborator

I suggest limiting the priority to a range such as (1-100), the larger the priority, the sooner the execution.

Our built-in inspector should be presented in a hierarchy (10,20,30...), so that users can insert a custom inspector from the middle.

@Guo-Yunzhe
Copy link
Collaborator

Guo-Yunzhe commented Jan 17, 2024 via email

@MooooCat MooooCat requested a review from Z712023 January 18, 2024 04:48
@MooooCat MooooCat marked this pull request as ready for review January 18, 2024 04:48
Copy link
Collaborator

@Wh1isper Wh1isper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request some teses on inspect_level

@Wh1isper Wh1isper enabled auto-merge (squash) January 19, 2024 14:32
sdgx/data_models/metadata.py Outdated Show resolved Hide resolved
Co-authored-by: Zhongsheng Ji <9573586@qq.com>
@MooooCat
Copy link
Contributor Author

It seems that all our package build test are failed because the SSL error of datahub.io., causing the fail of some unit test (downloading the demo dataset).

@Wh1isper
Copy link
Collaborator

Wh1isper commented Jan 19, 2024

Can a more stable data source be found? Or hitz-ids could provide a public download source like s3?

@Guo-Yunzhe
Copy link
Collaborator

Guo-Yunzhe commented Jan 20, 2024 via email

@MooooCat MooooCat enabled auto-merge (squash) January 20, 2024 11:14
Copy link
Collaborator

@Wh1isper Wh1isper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one more question, should we add some tests in test_metadata.py for column_inspect_level? We can leave it in next PR.

@MooooCat MooooCat merged commit f2c12a2 into main Jan 20, 2024
11 checks passed
@Wh1isper Wh1isper deleted the feature-inspector-level branch January 20, 2024 11:18
@MooooCat
Copy link
Contributor Author

LGTM, just one more question, should we add some tests in test_metadata.py for column_inspect_level? We can leave it in next PR.

Sure, currently some test are in each inspectors, we can add test cases in test_metadata.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants