Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce LLM-based single-table model. #129

Merged
merged 38 commits into from
Feb 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
26b5c71
Introduce SingleTableGPTModel
MooooCat Jan 26, 2024
08480df
Update init
MooooCat Jan 29, 2024
90d6604
Update SingleTableGPT.py
MooooCat Jan 29, 2024
8767c08
Update SingleTableGPT.py
MooooCat Jan 29, 2024
0ab296f
Create test_singletableGPT.py
MooooCat Jan 30, 2024
65b098d
Update SingleTableGPT.py
MooooCat Jan 30, 2024
5d2990c
Update SingleTableGPT.py
MooooCat Jan 30, 2024
a2e4fdc
Update SingleTableGPT.py
MooooCat Jan 31, 2024
5d6d73f
Update test_singletableGPT.py
MooooCat Jan 31, 2024
0dd724c
Improve code formatting
MooooCat Jan 31, 2024
61a5888
update testcases
MooooCat Jan 31, 2024
fd361d2
Update SingleTableGPT.py
MooooCat Jan 31, 2024
43d6d6b
Update SingleTableGPT.py
MooooCat Jan 31, 2024
e6eb018
add more test cases
MooooCat Jan 31, 2024
9870b0a
add testcases
MooooCat Jan 31, 2024
6b7681f
bugfix: pd.df initialize
MooooCat Jan 31, 2024
9627c13
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 31, 2024
7e3b8d3
Merge branch 'main' into feature-LLM-models
MooooCat Jan 31, 2024
73d4992
add dependencies
MooooCat Jan 31, 2024
2931250
update dependency
MooooCat Jan 31, 2024
ac5a64f
fix typo
MooooCat Jan 31, 2024
495a4d4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 31, 2024
5664215
add comments
MooooCat Feb 1, 2024
54abf71
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 1, 2024
fc7e8ed
add test cases
MooooCat Feb 1, 2024
449e512
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 1, 2024
d84ecb2
enable get API_key from env
MooooCat Feb 1, 2024
44ae29e
add testcases
MooooCat Feb 1, 2024
b259a33
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 1, 2024
9721936
Merge branch 'main' into feature-LLM-models
MooooCat Feb 12, 2024
a41d1b3
use snakecase on filename
MooooCat Feb 20, 2024
ce1b7bb
use snakecase for filenames
MooooCat Feb 20, 2024
c7fb325
add logs. correct function name.
MooooCat Feb 20, 2024
5abf052
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 20, 2024
ca7d247
Merge branch 'main' into feature-LLM-models
MooooCat Feb 20, 2024
a5a69db
Split single_table.gpt model to base class
MooooCat Feb 20, 2024
4f024a7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 20, 2024
a043c5c
fix typo
MooooCat Feb 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ dependencies = [
"pydantic>=2",
"cloudpickle",
"importlib_metadata",
"openai>=1.10.0",
]
dynamic = ["version"]
classifiers = [
Expand Down
1 change: 1 addition & 0 deletions sdgx/models/LLM/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from . import single_table
123 changes: 123 additions & 0 deletions sdgx/models/LLM/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
from sdgx.exceptions import SynthesizerInitError
from sdgx.models.base import SynthesizerModel
from sdgx.utils import logger


class LLMBaseModel(SynthesizerModel):
"""
This is a base class for generating synthetic data using LLM (Large Language Model).

Note:
- When using the data loader, the original data is transformed to pd.DataFrame format for subsequent processing.
- It is not recommended to use this model with large data tables due to excessive token consumption in some expensive LLM service.
- Generating data based on metadata is a potential way to generate data that cannot be made public and contains sensitive information.
"""

use_raw_data = False
"""
By default, we use raw_data for data access.

When using the data loader, due to the need of randomization operation, we currently use the `.load_all()` to transform the original data to pd.DataFrame format for subsequent processing.

Due to the characteristics of the OpenAI GPT service, we do not recommend running this model with large data tables, which will consume your tokens excessively.
"""

use_metadata = False
"""
In this model, we accept a data generation paradigm that only provides metadata.

When only metadata is provided, sdgx will format the metadata of the data set into a message and transmit it to GPT, and GPT will generate similar data based on what it knows.

This is a potential way to generate data that cannot be made public and contains sensitive information.
"""

_metadata = None
"""
the metadata.
"""

off_table_features = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of variable off_table_features is an interesting idea. :)

"""
* Experimental Feature

Whether infer data columns that do not exist in the real data table, the effect may not be very good.
"""

prompts = {
"message_prefix": """Suppose you are the best data generating model in this world, we have some data samples with the following information:\n\n""",
"message_suffix": """\nGenerate synthetic data samples based on the above information and your knowledge, each sample should be output on one line (do not output in multiple lines), the output format of the sample is the same as the example in this message, such as "column_name_1 is value_1", the count of the generated data samples is """,
"system_role_content": "You are a powerful synthetic data generation model.",
}
"""
Prompt words for generating data (preliminary version, improvements welcome).
"""

columns = []
"""
The columns of the data set.
"""

dataset_description = ""
"""
The description of the data set.
"""

_responses = []
"""
A list to store the responses received from the LLM.
"""

_message_list = []
"""
A list to store the messages used to ask LLM.
"""

def _check_access_type(self):
"""
Checks the data access type.

Raises:
SynthesizerInitError: If data access type is not specified or if duplicate data access type is found.
"""
if self.use_dataloader == self.use_raw_data == self.use_metadata == False:
raise SynthesizerInitError(
"Data access type not specified, please use `use_raw_data: bool` or `use_dataloader: bool` to specify data access type."
)
if self.use_dataloader == self.use_raw_data == True:
raise SynthesizerInitError("Duplicate data access type found.")

def _form_columns_description(self):
"""
We believe that giving information about a column helps improve data quality.

Currently, we leave this function to Good First Issue until March 2024, if unclaimed we will implement it quickly.
"""

raise NotImplementedError

def _form_message_with_offtable_features(self):
"""
This function forms a message with off-table features.

If there are more off-table columns, additional processing is excuted here.
"""
if self.off_table_features:
logger.info(f"Use off_table_feature = {self.off_table_features}.")
return f"Also, you should try to infer another {len(self.off_table_features)} columns based on your knowledge, the name of these columns are : {self.off_table_features}, attach these columns after the original table. \n"
else:
logger.info("No off_table_feature needed in current model.")
return ""

def _form_dataset_description(self):
"""
This function is used to form the dataset description.

Returns:
str: The description of the generated table.
"""
if self.dataset_description:
logger.info(f"Use dataset_description = {self.dataset_description}.")
return "\nThe description of the generated table is " + self.dataset_description + "\n"
else:
logger.info("No dataset_description given in current model.")
return ""
Empty file.
Loading
Loading