Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.2.0] Metadata Implementation #81

Merged
merged 37 commits into from
Dec 27, 2023
Merged

[0.2.0] Metadata Implementation #81

merged 37 commits into from
Dec 27, 2023

Conversation

MooooCat
Copy link
Contributor

@MooooCat MooooCat commented Dec 19, 2023

Description

The metadata component for single table and multi table.

Motivation and Context

This metadata is used to describe:

  • the base attribute (data type, pii, format, etc) of each column in table;
  • the relationship between columns;
  • the relationship between tables (parent table, child table);
  • etc.

Also, metadata will also provide methods to:

  • auto inspections;
  • add/change attributes;
  • add/change/remove relationship;
  • check the validation of relationship;
  • etc.

How has this been tested?

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@MooooCat MooooCat added this to the 0.1.0 milestone Dec 19, 2023
@MooooCat
Copy link
Contributor Author

Currently, metadata is in the design stage.

I will probably complete the specific design tomorrow, I plan to initiate a discussion when completing the design.

Copy link
Contributor

sweep-ai bot commented Dec 19, 2023

Apply Sweep Rules to your PR?

  • Apply: All new business logic should have corresponding unit tests.
  • Apply: Refactor large functions to be more modular.
  • Apply: Add docstrings to all functions and file headers.

@Wh1isper
Copy link
Collaborator

We can ignore the multi-table scenario for now and get the single table fields done. The table-to-table relationship (A.a vs. B.a) can be considered separately from the field (a,b,c,d) case within the table.

I'd like to know what is necessarily contained in a single table Metadata, such as the currently implemented discrete columns: https://github.com/hitsz-ids/synthetic-data-generator/blob/main/sdgx/data_models/inspectors/discrete.py

@MooooCat
Copy link
Contributor Author

We can ignore the multi-table scenario for now and get the single table fields done. The table-to-table relationship (A.a vs. B.a) can be considered separately from the field (a,b,c,d) case within the table.

Yes, we will first update the metadata of a single table.

Currently, the metadata of a single table has the following new information (please help me check if there are any gaps):

  • Primary key: primary key information, here we need to be compatible with composite primary keys;
  • Data type of each column: Use dict description, key is the column name, and value is the DataType object of each column. Different attributes/methods can be provided according to the different DataType.

@MooooCat
Copy link
Contributor Author

I'd like to know what is necessarily contained in a single table Metadata, such as the currently implemented discrete columns: https://github.com/hitsz-ids/synthetic-data-generator/blob/main/sdgx/data_models/inspectors/discrete.py

I think this is very necessary. We may start from support 5 basic types:

  • ID
  • Numeric value
  • Discrete value
  • DataTime
  • Bool value

@MooooCat
Copy link
Contributor Author

However, the multi-table scenario is also the focus of our work.

We have faced the need for multi tables in some actual work, and existing solutions cannot solve their needs.

After finishing single table metadata, I will implement metadata for multi tables based on SingleTableMetadata.

@Wh1isper Wh1isper changed the title [0.1.0] Metadata Implementation [0.2.0] Metadata Implementation Dec 20, 2023
@Wh1isper Wh1isper modified the milestones: 0.1.0, 0.2.0 Dec 20, 2023
@Wh1isper
Copy link
Collaborator

Seems no need to intro a submodule as metadata

metadata.py for single table and relationship.py + The-Multi-Table-Combiner.py for multi table is enough.

@MooooCat
Copy link
Contributor Author

Seems no need to intro a submodule as metadata

metadata.py for single table and relationship.py + The-Multi-Table-Combiner.py for multi table is enough.

It’s okay, there aren’t many new modules in metadata.

@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2023

Codecov Report

Attention: 52 lines in your changes are missing coverage. Please review.

Comparison is base (897e252) 76.29% compared to head (ad312c8) 75.87%.

Files Patch % Lines
sdgx/data_models/multi_table_combiner.py 0.00% 28 Missing ⚠️
sdgx/data_models/relationship.py 0.00% 12 Missing ⚠️
sdgx/data_models/metadata.py 76.19% 10 Missing ⚠️
sdgx/data_models/inspectors/datetime.py 93.75% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #81      +/-   ##
==========================================
- Coverage   76.29%   75.87%   -0.42%     
==========================================
  Files          52       58       +6     
  Lines        2054     2230     +176     
==========================================
+ Hits         1567     1692     +125     
- Misses        487      538      +51     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MooooCat and others added 3 commits December 25, 2023 11:14
Use a single list for single or composite primary key.

Co-authored-by: Zhongsheng Ji <9573586@qq.com>
relationship_cnt = len(self.relationships)
metadata_cnt = len(self.metadata_dict.keys())
if metadata_cnt != relationship_cnt + 1:
raise ValueError("Number of tables should corresponds to relationships.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a new exception in sdgx/exceptions.py? This will help us differentiate between what are expected exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I didn't consider, I'll add new exceptions MultiTableCombineError and MetadataError in sdgx.exceptions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better unit testing and bug fixing, I'd suggest synchronizing the main branch, which also includes Exceptions with error codes and exit codes

sdgx/data_models/relationship.py Outdated Show resolved Hide resolved
@MooooCat
Copy link
Contributor Author

is the current design sufficient for this?

I think the current design of relationship is sufficient to describe existing multi-table relationships.

In synthetic data, the multi-table relationships we currently implement are limited to “one-to-one” or “one-to-many”, that is, a parent table is associated with different child tables through foreign keys. In the current design, relationship is like a line in this correspondence, and each relationship only describes the relationship between two tables. We use multi-table combiner to combine table's meatadata and relationships.

@Wh1isper
Copy link
Collaborator

Sounds fine, looking forward to more commits!

MooooCat and others added 7 commits December 25, 2023 12:02
Key are described using list, which is compatible with single or composite foreign key.

Co-authored-by: Zhongsheng Ji <9573586@qq.com>
Unit testing also needs to be implemented.
sdgx/data_models/metadata.py Outdated Show resolved Hide resolved
Comment on lines 189 to 198
# check missing columns
for each_column in self.column_list:
if each_column not in all_dtype_columns:
raise MetadaError(f"Undefined data type for column {each_column}.")

# check unfamiliar columns in dtypes
for each_dtype_column in all_dtype_columns:
if each_dtype_column not in self.column_list:
raise MetadaError(f"Found undefined column: {each_dtype_column}.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use set(self.column_list) - set(all_dtype_columns) here, which we can indicate all invalid keys once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe we can rename MetadaError to MetadataInvalidError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use set(self.column_list) - set(all_dtype_columns) here, which we can indicate all invalid keys once.

This is a more concise way of writing.

The current implementation is because the error information raised in these two cases is different, which may make it easier for users to locate the two error types.

Copy link
Collaborator

@Wh1isper Wh1isper Dec 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is because the error information raised in these two cases is different, which may make it easier for users to locate the two error types.

There is no conflict, just use two if-expressions instead of two for-ifs.

@Wh1isper
Copy link
Collaborator

Could you please add some tests on Metadata.save and Metadata.load? Although they still seem to be working fine.

I think this PR can be merged at the moment, and one or more PRs may be needed next to intro Relationship and multi-table model code.

@Wh1isper Wh1isper marked this pull request as ready for review December 27, 2023 08:50
@MooooCat
Copy link
Contributor Author

Could you please add some tests on Metadata.save and Metadata.load? Although they still seem to be working fine.

No problem, I will merge after completing the tests of Metadata.save and Metadata.load.

@MooooCat MooooCat merged commit 9b4c683 into main Dec 27, 2023
11 checks passed
@MooooCat MooooCat deleted the feature-metadata branch December 27, 2023 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants