Skip to content

Commit

Permalink
Implement MetadataCombiner, partitial refactoring on Metadata (#96)
Browse files Browse the repository at this point in the history
* Implementing  MetadataCombiner

* MetadataCombiner save and load

* Intro identity as property

* Intro MetadataCombiner.from_dataframe

* Adding docs

* Adding tests for relationship

* Update docs

* Update metadata imp

* Fixing metadata helper

* Fixing CI

* Add docstring

* Add more docstring

* Add more docstring
  • Loading branch information
Wh1isper committed Jan 3, 2024
1 parent 00ba7f1 commit 160f6a2
Show file tree
Hide file tree
Showing 29 changed files with 735 additions and 119 deletions.
5 changes: 5 additions & 0 deletions docs/source/api_reference/data_models/combiner.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
MetadataCombiner
=======================

.. automodule:: sdgx.data_models.combiner
:members:
5 changes: 4 additions & 1 deletion docs/source/api_reference/data_models/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,14 @@ Metadata
:maxdepth: 2

metadata <metadata>
combiner <combiner>
relationship <relationship>


Built-in Inspectors and InspectorManager
-----------------------------------------

.. toctree::
:maxdepth: 2
:maxdepth: 3

inspectors <inspectors/index>
9 changes: 9 additions & 0 deletions docs/source/api_reference/data_models/inspectors/bool.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
BoolInspector
============================

.. autoclass:: sdgx.data_models.inspectors.bool.BoolInspector
:members:
:undoc-members:
:inherited-members:
:show-inheritance:
:private-members:
9 changes: 9 additions & 0 deletions docs/source/api_reference/data_models/inspectors/datetime.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
DatetimeInspector
============================

.. autoclass:: sdgx.data_models.inspectors.datetime.DatetimeInspector
:members:
:undoc-members:
:inherited-members:
:show-inheritance:
:private-members:
9 changes: 9 additions & 0 deletions docs/source/api_reference/data_models/inspectors/i_id.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
IDInspector
============================

.. autoclass:: sdgx.data_models.inspectors.i_id.IDInspector
:members:
:undoc-members:
:inherited-members:
:show-inheritance:
:private-members:
4 changes: 4 additions & 0 deletions docs/source/api_reference/data_models/inspectors/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ Built-in Inspector
:maxdepth: 2

DiscreteInspector <discrete>
NumericInspector <numeric>
BoolInspector <bool>
DatetimeInspector <datetime>
IDInspector <i_id>


Custom Inspectors Relevant
Expand Down
9 changes: 9 additions & 0 deletions docs/source/api_reference/data_models/inspectors/numeric.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
NumericInspector
============================

.. autoclass:: sdgx.data_models.inspectors.numeric.NumericInspector
:members:
:undoc-members:
:inherited-members:
:show-inheritance:
:private-members:
5 changes: 5 additions & 0 deletions docs/source/api_reference/data_models/relationship.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Relationship
=======================

.. automodule:: sdgx.data_models.relationship
:members:
10 changes: 7 additions & 3 deletions docs/source/user_guides/library.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ Use Synthetic Data Generator as a library
Learn more about :ref:`Architecture <architecture>` of our project.


.. Warming::

This guide is not complete yet. Welcome to contribute.

Use SDG as a library allow researchers or developers to build their own project
based on SDG. It's highly recommended to use SDG as a library if people have some
basic programming experience.
Expand Down Expand Up @@ -160,6 +164,6 @@ Evaluation
Next Step
---------------------------------------------------------------------------------

- Learn more about :ref:`Synthetic single-table data <Synthetic single-table data>`
- Learn more about :ref:`Synthetic multi-table data <Synthetic multi-table data>`
- Learn more about :ref:`Evaluation synthetic data <Evaluation synthetic data>`
- :ref:`Synthetic single-table data <Synthetic single-table data>`
- :ref:`Synthetic multi-table data <Synthetic multi-table data>`
- :ref:`Evaluation synthetic data <Evaluation synthetic data>`
4 changes: 4 additions & 0 deletions sdgx/data_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ class DataLoader:
cacher (:ref:`Cacher`, optional): The cacher. Defaults to None.
cache_mode (str, optional): The cache mode(cachers' name). Defaults to "DiskCache", more info in :ref:`DiskCache`.
cacher_kwargs (dict, optional): The kwargs for cacher. Defaults to None
identity (str, optional): The identity of the data source.
When using :ref:`GeneratorConnector`, it can be pointed to the original data source, makes it possible to work with :ref:`MetadataCombiner`.
Example:
Expand Down Expand Up @@ -95,10 +97,12 @@ def __init__(
chunksize: int = 10000,
cacher: Cacher | str | type[Cacher] | None = None,
cacher_kwargs: None | dict[str, Any] = None,
identity: str | None = None,
) -> None:
self.data_connector = data_connector
self.chunksize = chunksize
self.cache_manager = CacherManager()
self.identity = identity or self.data_connector.identity or str(id(self))

if not cacher_kwargs:
cacher_kwargs = {}
Expand Down
255 changes: 255 additions & 0 deletions sdgx/data_models/combiner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
from __future__ import annotations

from pathlib import Path
from typing import Dict, List

import pandas as pd
from pydantic import BaseModel

from sdgx.data_loader import DataLoader
from sdgx.data_models.inspectors.base import Inspector
from sdgx.data_models.inspectors.manager import InspectorManager
from sdgx.data_models.metadata import Metadata
from sdgx.data_models.relationship import Relationship
from sdgx.exceptions import MetadataCombinerInitError, MetadataCombinerInvalidError
from sdgx.utils import logger


class MetadataCombiner(BaseModel):
"""
Combine different tables with relationship, used for describing the relationship between tables.
Args:
named_metadata (Dict[str, Any]): Name of the table: Metadata
relationships (List[Any])
"""

version: str = "1.0"

named_metadata: Dict[str, Metadata] = {}

relationships: List[Relationship] = []

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.check()

def check(self):
"""Do necessary checks:
- Whether number of tables corresponds to relationships.
- Whether table names corresponds to the relationship between tables;
"""
for m in self.named_metadata.values():
m.check()

table_names = set(self.named_metadata.keys())
relationship_parents = set(r.parent_table for r in self.relationships)
relationship_children = set(r.child_table for r in self.relationships)

# each relationship's table must have metadata
if not table_names.issuperset(relationship_parents):
raise MetadataCombinerInvalidError(
f"Relationships' parent table {relationship_parents - table_names} is missing."
)
if not table_names.issuperset(relationship_children):
raise MetadataCombinerInvalidError(
f"Relationships' child table {relationship_children - table_names} is missing."
)

# each table in metadata must in a relationship
if not (relationship_parents | relationship_children).issuperset(table_names):
raise MetadataCombinerInvalidError(
f"Table {table_names - (relationship_parents+relationship_children)} is missing in relationships."
)

logger.info("MultiTableCombiner check finished.")

@classmethod
def from_dataloader(
cls,
dataloaders: list[DataLoader],
max_chunk: int = 10,
metadata_from_dataloader_kwargs: None | dict = None,
relationshipe_inspector: None | str | type[Inspector] = "DefaultRelationshipInspector",
relationships_inspector_kwargs: None | dict = None,
relationships: None | list[Relationship] = None,
):
"""
Combine multiple dataloaders with relationship.
Args:
dataloaders (list[DataLoader]): list of dataloaders
max_chunk (int): max chunk count for relationship inspector.
metadata_from_dataloader_kwargs (dict): kwargs for :func:`Metadata.from_dataloader`
relationshipe_inspector (str | type[Inspector]): relationship inspector
relationships_inspector_kwargs (dict): kwargs for :func:`InspectorManager.init`
relationships (list[Relationship]): list of relationships
"""
if not isinstance(dataloaders, list):
dataloaders = [dataloaders]

metadata_from_dataloader_kwargs = metadata_from_dataloader_kwargs or {}
metadata_from_dataloader_kwargs.setdefault("max_chunk", max_chunk)
named_metadata = {
d.identity: Metadata.from_dataloader(d, **metadata_from_dataloader_kwargs)
for d in dataloaders
}

if relationships is None and relationshipe_inspector is not None:
if relationships_inspector_kwargs is None:
relationships_inspector_kwargs = {}

inspector = InspectorManager().init(
relationshipe_inspector, **relationships_inspector_kwargs
)
for d in dataloaders:
for i, chunk in enumerate(d.iter()):
inspector.fit(chunk)
if inspector.ready or i > max_chunk:
break
relationships = inspector.inspect()["relationships"]

return cls(named_metadata=named_metadata, relationships=relationships)

@classmethod
def from_dataframe(
cls,
dataframes: list[pd.DataFrame],
names: list[str],
metadata_from_dataloader_kwargs: None | dict = None,
relationshipe_inspector: None | str | type[Inspector] = "DefaultRelationshipInspector",
relationships_inspector_kwargs: None | dict = None,
relationships: None | list[Relationship] = None,
) -> "MetadataCombiner":
"""
Combine multiple dataframes with relationship.
Args:
dataframes (list[pd.DataFrame]): list of dataframes
names (list[str]): list of names
metadata_from_dataloader_kwargs (dict): kwargs for :func:`Metadata.from_dataloader`
relationshipe_inspector (str | type[Inspector]): relationship inspector
relationships_inspector_kwargs (dict): kwargs for :func:`InspectorManager.init`
relationships (list[Relationship]): list of relationships
"""
if not isinstance(dataframes, list):
dataframes = [dataframes]
if not isinstance(names, list):
names = [names]

metadata_from_dataloader_kwargs = metadata_from_dataloader_kwargs or {}

if len(dataframes) != len(names):
raise MetadataCombinerInitError("dataframes and names should have same length.")

named_metadata = {
n: Metadata.from_dataframe(d, **metadata_from_dataloader_kwargs)
for n, d in zip(names, dataframes)
}

if relationships is None and relationshipe_inspector is not None:
if relationships_inspector_kwargs is None:
relationships_inspector_kwargs = {}

inspector = InspectorManager().init(
relationshipe_inspector, **relationships_inspector_kwargs
)
for d in dataframes:
inspector.fit(d)
relationships = inspector.inspect()["relationships"]

return cls(named_metadata=named_metadata, relationships=relationships)

def _dump_json(self):
return self.model_dump_json()

def save(
self,
save_dir: str | Path,
metadata_subdir: str = "metadata",
relationship_subdir: str = "relationship",
):
"""
Save metadata to json file.
This will create several subdirectories for metadata and relationship.
Args:
save_dir (str | Path): directory to save
metadata_subdir (str): subdirectory for metadata, default is "metadata"
relationship_subdir (str): subdirectory for relationship, default is "relationship"
"""
save_dir = Path(save_dir).expanduser().resolve()
version_file = save_dir / "version"
version_file.write_text(self.version)

metadata_subdir = save_dir / metadata_subdir
relationship_subdir = save_dir / relationship_subdir

metadata_subdir.mkdir(parents=True, exist_ok=True)
for name, metadata in self.named_metadata.items():
metadata.save(metadata_subdir / f"{name}.json")

relationship_subdir.mkdir(parents=True, exist_ok=True)
for relationship in self.relationships:
relationship.save(
relationship_subdir / f"{relationship.parent_table}_{relationship.child_table}.json"
)

@classmethod
def load(
cls,
save_dir: str | Path,
metadata_subdir: str = "metadata",
relationship_subdir: str = "relationship",
version: None | str = None,
) -> "MetadataCombiner":
"""
Load metadata from json file.
Args:
save_dir (str | Path): directory to save
metadata_subdir (str): subdirectory for metadata, default is "metadata"
relationship_subdir (str): subdirectory for relationship, default is "relationship"
version (str): Manual version, if not specified, try to load from version file
"""

save_dir = Path(save_dir).expanduser().resolve()
if not version:
logger.debug("No version specified, try to load from version file.")
version_file = save_dir / "version"
if version_file.exists():
version = version_file.read_text().strip()
else:
logger.info("No version file found, assume version is 1.0")
version = "1.0"

named_metadata = {p.stem: Metadata.load(p) for p in (save_dir / metadata_subdir).glob("*")}

relationships = [Relationship.load(p) for p in (save_dir / relationship_subdir).glob("*")]

cls.upgrade(version, named_metadata, relationships)

return cls(
version=version,
named_metadata=named_metadata,
relationships=relationships,
)

@classmethod
def upgrade(
cls,
old_version: str,
named_metadata: dict[str, Metadata],
relationships: list[Relationship],
) -> None:
"""
Upgrade metadata from old version to new version
:ref:`Metadata.upgrade` and :ref:`Relationship.upgrade` will try upgrade when loading.
So here we just do Combiner's upgrade.
"""

pass
File renamed without changes.
6 changes: 3 additions & 3 deletions sdgx/data_models/inspectors/manager.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from __future__ import annotations

from typing import Any
from typing import Any, Iterable

from sdgx.data_models import inspectors
from sdgx.data_models.inspectors import extension
Expand Down Expand Up @@ -29,8 +29,8 @@ def init_all_inspectors(self, **kwargs: Any) -> list[Inspector]:

def init_inspcetors(
self,
includes: list[str] | None = None,
excludes: list[str] | None = None,
includes: Iterable[str] | None = None,
excludes: Iterable[str] | None = None,
**kwargs: Any,
) -> list[Inspector]:
includes = includes or self.registed_inspectors.keys()
Expand Down
Loading

0 comments on commit 160f6a2

Please sign in to comment.