-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce LLM-based single-table model. #129
Merged
Merged
Changes from all commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
26b5c71
Introduce SingleTableGPTModel
MooooCat 08480df
Update init
MooooCat 90d6604
Update SingleTableGPT.py
MooooCat 8767c08
Update SingleTableGPT.py
MooooCat 0ab296f
Create test_singletableGPT.py
MooooCat 65b098d
Update SingleTableGPT.py
MooooCat 5d2990c
Update SingleTableGPT.py
MooooCat a2e4fdc
Update SingleTableGPT.py
MooooCat 5d6d73f
Update test_singletableGPT.py
MooooCat 0dd724c
Improve code formatting
MooooCat 61a5888
update testcases
MooooCat fd361d2
Update SingleTableGPT.py
MooooCat 43d6d6b
Update SingleTableGPT.py
MooooCat e6eb018
add more test cases
MooooCat 9870b0a
add testcases
MooooCat 6b7681f
bugfix: pd.df initialize
MooooCat 9627c13
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 7e3b8d3
Merge branch 'main' into feature-LLM-models
MooooCat 73d4992
add dependencies
MooooCat 2931250
update dependency
MooooCat ac5a64f
fix typo
MooooCat 495a4d4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 5664215
add comments
MooooCat 54abf71
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] fc7e8ed
add test cases
MooooCat 449e512
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d84ecb2
enable get API_key from env
MooooCat 44ae29e
add testcases
MooooCat b259a33
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 9721936
Merge branch 'main' into feature-LLM-models
MooooCat a41d1b3
use snakecase on filename
MooooCat ce1b7bb
use snakecase for filenames
MooooCat c7fb325
add logs. correct function name.
MooooCat 5abf052
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] ca7d247
Merge branch 'main' into feature-LLM-models
MooooCat a5a69db
Split single_table.gpt model to base class
MooooCat 4f024a7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] a043c5c
fix typo
MooooCat File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from . import single_table |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
from sdgx.exceptions import SynthesizerInitError | ||
from sdgx.models.base import SynthesizerModel | ||
from sdgx.utils import logger | ||
|
||
|
||
class LLMBaseModel(SynthesizerModel): | ||
""" | ||
This is a base class for generating synthetic data using LLM (Large Language Model). | ||
|
||
Note: | ||
- When using the data loader, the original data is transformed to pd.DataFrame format for subsequent processing. | ||
- It is not recommended to use this model with large data tables due to excessive token consumption in some expensive LLM service. | ||
- Generating data based on metadata is a potential way to generate data that cannot be made public and contains sensitive information. | ||
""" | ||
|
||
use_raw_data = False | ||
""" | ||
By default, we use raw_data for data access. | ||
|
||
When using the data loader, due to the need of randomization operation, we currently use the `.load_all()` to transform the original data to pd.DataFrame format for subsequent processing. | ||
|
||
Due to the characteristics of the OpenAI GPT service, we do not recommend running this model with large data tables, which will consume your tokens excessively. | ||
""" | ||
|
||
use_metadata = False | ||
""" | ||
In this model, we accept a data generation paradigm that only provides metadata. | ||
|
||
When only metadata is provided, sdgx will format the metadata of the data set into a message and transmit it to GPT, and GPT will generate similar data based on what it knows. | ||
|
||
This is a potential way to generate data that cannot be made public and contains sensitive information. | ||
""" | ||
|
||
_metadata = None | ||
""" | ||
the metadata. | ||
""" | ||
|
||
off_table_features = [] | ||
""" | ||
* Experimental Feature | ||
|
||
Whether infer data columns that do not exist in the real data table, the effect may not be very good. | ||
""" | ||
|
||
prompts = { | ||
"message_prefix": """Suppose you are the best data generating model in this world, we have some data samples with the following information:\n\n""", | ||
"message_suffix": """\nGenerate synthetic data samples based on the above information and your knowledge, each sample should be output on one line (do not output in multiple lines), the output format of the sample is the same as the example in this message, such as "column_name_1 is value_1", the count of the generated data samples is """, | ||
"system_role_content": "You are a powerful synthetic data generation model.", | ||
} | ||
""" | ||
Prompt words for generating data (preliminary version, improvements welcome). | ||
""" | ||
|
||
columns = [] | ||
""" | ||
The columns of the data set. | ||
""" | ||
|
||
dataset_description = "" | ||
""" | ||
The description of the data set. | ||
""" | ||
|
||
_responses = [] | ||
""" | ||
A list to store the responses received from the LLM. | ||
""" | ||
|
||
_message_list = [] | ||
""" | ||
A list to store the messages used to ask LLM. | ||
""" | ||
|
||
def _check_access_type(self): | ||
""" | ||
Checks the data access type. | ||
|
||
Raises: | ||
SynthesizerInitError: If data access type is not specified or if duplicate data access type is found. | ||
""" | ||
if self.use_dataloader == self.use_raw_data == self.use_metadata == False: | ||
raise SynthesizerInitError( | ||
"Data access type not specified, please use `use_raw_data: bool` or `use_dataloader: bool` to specify data access type." | ||
) | ||
if self.use_dataloader == self.use_raw_data == True: | ||
raise SynthesizerInitError("Duplicate data access type found.") | ||
|
||
def _form_columns_description(self): | ||
""" | ||
We believe that giving information about a column helps improve data quality. | ||
|
||
Currently, we leave this function to Good First Issue until March 2024, if unclaimed we will implement it quickly. | ||
""" | ||
|
||
raise NotImplementedError | ||
|
||
def _form_message_with_offtable_features(self): | ||
""" | ||
This function forms a message with off-table features. | ||
|
||
If there are more off-table columns, additional processing is excuted here. | ||
""" | ||
if self.off_table_features: | ||
logger.info(f"Use off_table_feature = {self.off_table_features}.") | ||
return f"Also, you should try to infer another {len(self.off_table_features)} columns based on your knowledge, the name of these columns are : {self.off_table_features}, attach these columns after the original table. \n" | ||
else: | ||
logger.info("No off_table_feature needed in current model.") | ||
return "" | ||
|
||
def _form_dataset_description(self): | ||
""" | ||
This function is used to form the dataset description. | ||
|
||
Returns: | ||
str: The description of the generated table. | ||
""" | ||
if self.dataset_description: | ||
logger.info(f"Use dataset_description = {self.dataset_description}.") | ||
return "\nThe description of the generated table is " + self.dataset_description + "\n" | ||
else: | ||
logger.info("No dataset_description given in current model.") | ||
return "" |
Empty file.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduction of variable
off_table_features
is an interesting idea. :)