Introduce LLM-based single-table model. #129

MooooCat · 2024-01-31T07:57:46Z

Description

For a long time, LLM has been used to understand and generate various types of data. In fact, LLM also has certain capabilities in tabular data generation. More over, it has some abilities that cannot be achieved by traditional (based on GAN methods or statistical methods) .

In this PR, we introduce sdgx.models.LLM.single_table.SingleTableGPT.SingleTableGPTModel, our first synthetic data generation model integrating LLM.

Motivation and Context

Compared with existing models, SingleTableGPTModel implements two new features:

Generation without Data: No training data is required, synthetic data can be generated based on metadata data;
Off-Table Feature Inference: Infer new column data based on the existing data in the table and the knowledge mastered by LLM.

In addition, SingleTableGPTModel can directly generate data without complicated and time-consuming steps such as manual labeling and feature engineering, which will save a lot of operator time and allow them to focus on creative work.

How has this been tested?

We currently provide some test cases at tests/models/test_singletableGPT.py. This test file contains some content returned by GPT. We will not repeatedly request GPT in the unit test to avoid consuming a large amount of tokens.

I will continue to improve these test cases.

Types of changes

Maintenance (no change in code, maintain the project's CI, docs, etc.)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

Still Draft

for more information, see https://pre-commit.ci

codecov-commenter · 2024-01-31T08:15:46Z

Codecov Report

Attention: 64 lines in your changes are missing coverage. Please review.

Comparison is base (c55e340) 80.35% compared to head (a043c5c) 79.87%.

Files	Patch %	Lines
sdgx/models/LLM/single_table/gpt.py	71.21%	59 Missing ⚠️
sdgx/models/LLM/base.py	87.80%	5 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #129      +/-   ##
==========================================
- Coverage   80.35%   79.87%   -0.48%     
==========================================
  Files          66       69       +3     
  Lines        3003     3250     +247     
==========================================
+ Hits         2413     2596     +183     
- Misses        590      654      +64

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MooooCat · 2024-01-31T08:17:30Z

Some of the comments are still incomplete at the moment, I will add them as soon as possible.

In addition, the unit test coverage is insufficient, I will add some test cases.

After completing this I will set the PR status to Ready, developers are also welcome to help me improve the above two contents.

for more information, see https://pre-commit.ci

Wh1isper · 2024-02-08T02:34:15Z

sdgx/models/LLM/single_table/SingleTableGPT.py

We should use snakecase for the filename

Wh1isper · 2024-02-08T02:34:54Z

tests/models/test_singletableGPT.py

We should use snakecase for the filename, maybe test_singletable_gpt.py?

This makes sense, I'll change the filename.

for more information, see https://pre-commit.ci

Split part of the single table gpt model as a base class

for more information, see https://pre-commit.ci

Z712023 · 2024-02-20T08:21:03Z

sdgx/models/LLM/base.py

+    the metadata.
+    """
+
+    off_table_features = []


The introduction of variable off_table_features is an interesting idea. :)

Wh1isper · 2024-02-21T10:25:36Z

How about release 0.2.0 for it?

MooooCat · 2024-02-22T07:07:54Z

How about release 0.2.0 for it?

Good idea, I need make another few changes (google colab examples, readme updates...) before releasing.

MooooCat and others added 17 commits January 26, 2024 20:32

Introduce SingleTableGPTModel

26b5c71

Still Draft

Update init

08480df

Update SingleTableGPT.py

90d6604

Still Draft

Update SingleTableGPT.py

8767c08

Create test_singletableGPT.py

0ab296f

Update SingleTableGPT.py

65b098d

Update SingleTableGPT.py

5d2990c

Update SingleTableGPT.py

a2e4fdc

Update test_singletableGPT.py

5d6d73f

Improve code formatting

0dd724c

update testcases

61a5888

Update SingleTableGPT.py

fd361d2

Update SingleTableGPT.py

43d6d6b

add more test cases

e6eb018

add testcases

9870b0a

bugfix: pd.df initialize

6b7681f

[pre-commit.ci] auto fixes from pre-commit.com hooks

9627c13

for more information, see https://pre-commit.ci

hitsz-ids deleted a comment from sweep-ai bot Jan 31, 2024

MooooCat and others added 2 commits January 31, 2024 15:59

Merge branch 'main' into feature-LLM-models

7e3b8d3

add dependencies

73d4992

hitsz-ids deleted a comment from sweep-ai bot Jan 31, 2024

MooooCat and others added 6 commits January 31, 2024 16:23

update dependency

2931250

fix typo

ac5a64f

[pre-commit.ci] auto fixes from pre-commit.com hooks

495a4d4

for more information, see https://pre-commit.ci

add comments

5664215

[pre-commit.ci] auto fixes from pre-commit.com hooks

54abf71

for more information, see https://pre-commit.ci

add test cases

fc7e8ed

pre-commit-ci bot and others added 4 commits February 1, 2024 03:44

[pre-commit.ci] auto fixes from pre-commit.com hooks

449e512

for more information, see https://pre-commit.ci

enable get API_key from env

d84ecb2

add testcases

44ae29e

[pre-commit.ci] auto fixes from pre-commit.com hooks

b259a33

for more information, see https://pre-commit.ci

Wh1isper reviewed Feb 8, 2024

View reviewed changes

sdgx/models/LLM/single_table/SingleTableGPT.py Outdated

Copy link

Collaborator

Wh1isper Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use snakecase for the filename

Wh1isper reviewed Feb 8, 2024

View reviewed changes

MooooCat and others added 9 commits February 12, 2024 10:36

Merge branch 'main' into feature-LLM-models

9721936

use snakecase on filename

a41d1b3

use snakecase for filenames

ce1b7bb

add logs. correct function name.

c7fb325

[pre-commit.ci] auto fixes from pre-commit.com hooks

5abf052

for more information, see https://pre-commit.ci

Merge branch 'main' into feature-LLM-models

ca7d247

Split single_table.gpt model to base class

a5a69db

Split part of the single table gpt model as a base class

[pre-commit.ci] auto fixes from pre-commit.com hooks

4f024a7

for more information, see https://pre-commit.ci

fix typo

a043c5c

MooooCat marked this pull request as ready for review February 20, 2024 08:00

MooooCat requested a review from Z712023 February 20, 2024 08:00

MooooCat enabled auto-merge (squash) February 20, 2024 08:05

Z712023 approved these changes Feb 20, 2024

View reviewed changes

MooooCat merged commit 269063d into main Feb 20, 2024
11 checks passed

MooooCat deleted the feature-LLM-models branch February 20, 2024 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce LLM-based single-table model. #129

Introduce LLM-based single-table model. #129

MooooCat commented Jan 31, 2024 •

edited

Loading

codecov-commenter commented Jan 31, 2024 •

edited

Loading

MooooCat commented Jan 31, 2024 •

edited

Loading

Wh1isper Feb 8, 2024

Wh1isper Feb 8, 2024

MooooCat Feb 12, 2024

Z712023 Feb 20, 2024

Wh1isper commented Feb 21, 2024

MooooCat commented Feb 22, 2024

Introduce LLM-based single-table model. #129

Introduce LLM-based single-table model. #129

Conversation

MooooCat commented Jan 31, 2024 • edited Loading

Description

Motivation and Context

How has this been tested?

Types of changes

Checklist:

codecov-commenter commented Jan 31, 2024 • edited Loading

Codecov Report

MooooCat commented Jan 31, 2024 • edited Loading

Wh1isper Feb 8, 2024

Choose a reason for hiding this comment

Wh1isper Feb 8, 2024

Choose a reason for hiding this comment

MooooCat Feb 12, 2024

Choose a reason for hiding this comment

Z712023 Feb 20, 2024

Choose a reason for hiding this comment

Wh1isper commented Feb 21, 2024

MooooCat commented Feb 22, 2024

MooooCat commented Jan 31, 2024 •

edited

Loading

codecov-commenter commented Jan 31, 2024 •

edited

Loading

MooooCat commented Jan 31, 2024 •

edited

Loading