Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MetadataCombiner, partitial refactoring on Metadata #96

Merged
merged 13 commits into from
Jan 3, 2024
5 changes: 5 additions & 0 deletions docs/source/api_reference/data_models/combiner.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
MetadataCombiner
=======================

.. automodule:: sdgx.data_models.combiner
:members:
5 changes: 4 additions & 1 deletion docs/source/api_reference/data_models/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,14 @@ Metadata
:maxdepth: 2

metadata <metadata>
combiner <combiner>
relationship <relationship>


Built-in Inspectors and InspectorManager
-----------------------------------------

.. toctree::
:maxdepth: 2
:maxdepth: 3

inspectors <inspectors/index>
9 changes: 9 additions & 0 deletions docs/source/api_reference/data_models/inspectors/bool.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
BoolInspector
============================

.. autoclass:: sdgx.data_models.inspectors.bool.BoolInspector
:members:
:undoc-members:
:inherited-members:
:show-inheritance:
:private-members:
9 changes: 9 additions & 0 deletions docs/source/api_reference/data_models/inspectors/datetime.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
DatetimeInspector
============================

.. autoclass:: sdgx.data_models.inspectors.datetime.DatetimeInspector
:members:
:undoc-members:
:inherited-members:
:show-inheritance:
:private-members:
9 changes: 9 additions & 0 deletions docs/source/api_reference/data_models/inspectors/i_id.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
IDInspector
============================

.. autoclass:: sdgx.data_models.inspectors.i_id.IDInspector
:members:
:undoc-members:
:inherited-members:
:show-inheritance:
:private-members:
4 changes: 4 additions & 0 deletions docs/source/api_reference/data_models/inspectors/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ Built-in Inspector
:maxdepth: 2

DiscreteInspector <discrete>
NumericInspector <numeric>
BoolInspector <bool>
DatetimeInspector <datetime>
IDInspector <i_id>


Custom Inspectors Relevant
Expand Down
9 changes: 9 additions & 0 deletions docs/source/api_reference/data_models/inspectors/numeric.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
NumericInspector
============================

.. autoclass:: sdgx.data_models.inspectors.numeric.NumericInspector
:members:
:undoc-members:
:inherited-members:
:show-inheritance:
:private-members:
5 changes: 5 additions & 0 deletions docs/source/api_reference/data_models/relationship.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Relationship
=======================

.. automodule:: sdgx.data_models.relationship
:members:
10 changes: 7 additions & 3 deletions docs/source/user_guides/library.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ Use Synthetic Data Generator as a library
Learn more about :ref:`Architecture <architecture>` of our project.


.. Warming::

This guide is not complete yet. Welcome to contribute.

Use SDG as a library allow researchers or developers to build their own project
based on SDG. It's highly recommended to use SDG as a library if people have some
basic programming experience.
Expand Down Expand Up @@ -160,6 +164,6 @@ Evaluation
Next Step
---------------------------------------------------------------------------------

- Learn more about :ref:`Synthetic single-table data <Synthetic single-table data>`
- Learn more about :ref:`Synthetic multi-table data <Synthetic multi-table data>`
- Learn more about :ref:`Evaluation synthetic data <Evaluation synthetic data>`
- :ref:`Synthetic single-table data <Synthetic single-table data>`
- :ref:`Synthetic multi-table data <Synthetic multi-table data>`
- :ref:`Evaluation synthetic data <Evaluation synthetic data>`
4 changes: 4 additions & 0 deletions sdgx/data_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ class DataLoader:
cacher (:ref:`Cacher`, optional): The cacher. Defaults to None.
cache_mode (str, optional): The cache mode(cachers' name). Defaults to "DiskCache", more info in :ref:`DiskCache`.
cacher_kwargs (dict, optional): The kwargs for cacher. Defaults to None
identity (str, optional): The identity of the data source.
When using :ref:`GeneratorConnector`, it can be pointed to the original data source, makes it possible to work with :ref:`MetadataCombiner`.

Example:

Expand Down Expand Up @@ -95,10 +97,12 @@ def __init__(
chunksize: int = 10000,
cacher: Cacher | str | type[Cacher] | None = None,
cacher_kwargs: None | dict[str, Any] = None,
identity: str | None = None,
) -> None:
self.data_connector = data_connector
self.chunksize = chunksize
self.cache_manager = CacherManager()
self.identity = identity or self.data_connector.identity or str(id(self))

if not cacher_kwargs:
cacher_kwargs = {}
Expand Down
255 changes: 255 additions & 0 deletions sdgx/data_models/combiner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
from __future__ import annotations

from pathlib import Path
from typing import Dict, List

import pandas as pd
from pydantic import BaseModel

from sdgx.data_loader import DataLoader
from sdgx.data_models.inspectors.base import Inspector
from sdgx.data_models.inspectors.manager import InspectorManager
from sdgx.data_models.metadata import Metadata
from sdgx.data_models.relationship import Relationship
from sdgx.exceptions import MetadataCombinerInitError, MetadataCombinerInvalidError
from sdgx.utils import logger


class MetadataCombiner(BaseModel):
"""
Combine different tables with relationship, used for describing the relationship between tables.

Args:
named_metadata (Dict[str, Any]): Name of the table: Metadata

relationships (List[Any])
"""

version: str = "1.0"

named_metadata: Dict[str, Metadata] = {}

relationships: List[Relationship] = []

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.check()

def check(self):
"""Do necessary checks:

- Whether number of tables corresponds to relationships.
- Whether table names corresponds to the relationship between tables;
"""
for m in self.named_metadata.values():
m.check()

table_names = set(self.named_metadata.keys())
relationship_parents = set(r.parent_table for r in self.relationships)
relationship_children = set(r.child_table for r in self.relationships)

# each relationship's table must have metadata
if not table_names.issuperset(relationship_parents):
raise MetadataCombinerInvalidError(
f"Relationships' parent table {relationship_parents - table_names} is missing."
)
if not table_names.issuperset(relationship_children):
raise MetadataCombinerInvalidError(
f"Relationships' child table {relationship_children - table_names} is missing."
)

# each table in metadata must in a relationship
if not (relationship_parents | relationship_children).issuperset(table_names):
raise MetadataCombinerInvalidError(
f"Table {table_names - (relationship_parents+relationship_children)} is missing in relationships."
)

logger.info("MultiTableCombiner check finished.")

@classmethod
def from_dataloader(
cls,
dataloaders: list[DataLoader],
max_chunk: int = 10,
metadata_from_dataloader_kwargs: None | dict = None,
relationshipe_inspector: None | str | type[Inspector] = "DefaultRelationshipInspector",
relationships_inspector_kwargs: None | dict = None,
relationships: None | list[Relationship] = None,
):
"""
Combine multiple dataloaders with relationship.

Args:
dataloaders (list[DataLoader]): list of dataloaders
max_chunk (int): max chunk count for relationship inspector.
metadata_from_dataloader_kwargs (dict): kwargs for :func:`Metadata.from_dataloader`
relationshipe_inspector (str | type[Inspector]): relationship inspector
relationships_inspector_kwargs (dict): kwargs for :func:`InspectorManager.init`
relationships (list[Relationship]): list of relationships
"""
if not isinstance(dataloaders, list):
dataloaders = [dataloaders]

metadata_from_dataloader_kwargs = metadata_from_dataloader_kwargs or {}
metadata_from_dataloader_kwargs.setdefault("max_chunk", max_chunk)
named_metadata = {
d.identity: Metadata.from_dataloader(d, **metadata_from_dataloader_kwargs)
for d in dataloaders
}

if relationships is None and relationshipe_inspector is not None:
if relationships_inspector_kwargs is None:
relationships_inspector_kwargs = {}

inspector = InspectorManager().init(
relationshipe_inspector, **relationships_inspector_kwargs
)
for d in dataloaders:
for i, chunk in enumerate(d.iter()):
inspector.fit(chunk)
if inspector.ready or i > max_chunk:
break
relationships = inspector.inspect()["relationships"]

return cls(named_metadata=named_metadata, relationships=relationships)

@classmethod
def from_dataframe(
cls,
dataframes: list[pd.DataFrame],
names: list[str],
metadata_from_dataloader_kwargs: None | dict = None,
relationshipe_inspector: None | str | type[Inspector] = "DefaultRelationshipInspector",
relationships_inspector_kwargs: None | dict = None,
relationships: None | list[Relationship] = None,
) -> "MetadataCombiner":
"""
Combine multiple dataframes with relationship.

Args:
dataframes (list[pd.DataFrame]): list of dataframes
names (list[str]): list of names
metadata_from_dataloader_kwargs (dict): kwargs for :func:`Metadata.from_dataloader`
relationshipe_inspector (str | type[Inspector]): relationship inspector
relationships_inspector_kwargs (dict): kwargs for :func:`InspectorManager.init`
relationships (list[Relationship]): list of relationships
"""
if not isinstance(dataframes, list):
dataframes = [dataframes]
if not isinstance(names, list):
names = [names]

metadata_from_dataloader_kwargs = metadata_from_dataloader_kwargs or {}

if len(dataframes) != len(names):
raise MetadataCombinerInitError("dataframes and names should have same length.")

named_metadata = {
n: Metadata.from_dataframe(d, **metadata_from_dataloader_kwargs)
for n, d in zip(names, dataframes)
}

if relationships is None and relationshipe_inspector is not None:
if relationships_inspector_kwargs is None:
relationships_inspector_kwargs = {}

inspector = InspectorManager().init(
relationshipe_inspector, **relationships_inspector_kwargs
)
for d in dataframes:
inspector.fit(d)
relationships = inspector.inspect()["relationships"]

return cls(named_metadata=named_metadata, relationships=relationships)

def _dump_json(self):
return self.model_dump_json()

def save(
self,
save_dir: str | Path,
metadata_subdir: str = "metadata",
relationship_subdir: str = "relationship",
):
"""
Save metadata to json file.

This will create several subdirectories for metadata and relationship.

Args:
save_dir (str | Path): directory to save
metadata_subdir (str): subdirectory for metadata, default is "metadata"
relationship_subdir (str): subdirectory for relationship, default is "relationship"
"""
save_dir = Path(save_dir).expanduser().resolve()
version_file = save_dir / "version"
version_file.write_text(self.version)

metadata_subdir = save_dir / metadata_subdir
relationship_subdir = save_dir / relationship_subdir

metadata_subdir.mkdir(parents=True, exist_ok=True)
for name, metadata in self.named_metadata.items():
metadata.save(metadata_subdir / f"{name}.json")

relationship_subdir.mkdir(parents=True, exist_ok=True)
for relationship in self.relationships:
relationship.save(
relationship_subdir / f"{relationship.parent_table}_{relationship.child_table}.json"
)

@classmethod
def load(
cls,
save_dir: str | Path,
metadata_subdir: str = "metadata",
relationship_subdir: str = "relationship",
version: None | str = None,
) -> "MetadataCombiner":
"""
Load metadata from json file.

Args:
save_dir (str | Path): directory to save
metadata_subdir (str): subdirectory for metadata, default is "metadata"
relationship_subdir (str): subdirectory for relationship, default is "relationship"
version (str): Manual version, if not specified, try to load from version file
"""

save_dir = Path(save_dir).expanduser().resolve()
if not version:
logger.debug("No version specified, try to load from version file.")
version_file = save_dir / "version"
if version_file.exists():
version = version_file.read_text().strip()
else:
logger.info("No version file found, assume version is 1.0")
version = "1.0"

named_metadata = {p.stem: Metadata.load(p) for p in (save_dir / metadata_subdir).glob("*")}

relationships = [Relationship.load(p) for p in (save_dir / relationship_subdir).glob("*")]

cls.upgrade(version, named_metadata, relationships)

return cls(
version=version,
named_metadata=named_metadata,
relationships=relationships,
)

@classmethod
def upgrade(
cls,
old_version: str,
named_metadata: dict[str, Metadata],
relationships: list[Relationship],
) -> None:
"""
Upgrade metadata from old version to new version

:ref:`Metadata.upgrade` and :ref:`Relationship.upgrade` will try upgrade when loading.
So here we just do Combiner's upgrade.
"""

pass
MooooCat marked this conversation as resolved.
Show resolved Hide resolved
File renamed without changes.
6 changes: 3 additions & 3 deletions sdgx/data_models/inspectors/manager.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from __future__ import annotations

from typing import Any
from typing import Any, Iterable

from sdgx.data_models import inspectors
from sdgx.data_models.inspectors import extension
Expand Down Expand Up @@ -29,8 +29,8 @@ def init_all_inspectors(self, **kwargs: Any) -> list[Inspector]:

def init_inspcetors(
self,
includes: list[str] | None = None,
excludes: list[str] | None = None,
includes: Iterable[str] | None = None,
excludes: Iterable[str] | None = None,
**kwargs: Any,
) -> list[Inspector]:
includes = includes or self.registed_inspectors.keys()
Expand Down
Loading