Skip to content

Commit

Permalink
Forest DiffusionModel RC commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Diyago committed Sep 30, 2023
1 parent 2789612 commit adf8c65
Show file tree
Hide file tree
Showing 3 changed files with 62 additions and 90 deletions.
68 changes: 10 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Downloads](https://pepy.tech/badge/tabgan)](https://pepy.tech/project/tabgan)

# GANs for tabular data
# GANs and Diffusions for tabular data

<img src="./images/tabular_gan.png" height="15%" width="15%">
Generative Adversarial Networks (GANs) are well-known for their success in realistic image generation. However, they can also be applied to generate tabular data. Here will give opportunity to try some of them.
Expand All @@ -29,9 +29,10 @@ test = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD
# generate data
new_train1, new_target1 = OriginalGenerator().generate_data_pipe(train, target, test, )
new_train2, new_target2 = GANGenerator().generate_data_pipe(train, target, test, )
new_train3, new_target3 = ForestDiffusionGenerator().generate_data_pipe(train, target, test, )

# example with all params defined
new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None,
new_train4, new_target4 = GANGenerator(gen_x_times=1.1, cat_cols=None,
bot_filter_quantile=0.001, top_filter_quantile=0.999, is_post_process=True,
adversarial_model_params={
"metrics": "AUC", "max_depth": 2, "max_bin": 100,
Expand All @@ -41,7 +42,10 @@ new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None,
test, deep_copy=True, only_adversarial=False, use_adversarial=True)
```

Both samplers `OriginalGenerator` and `GANGenerator` have same input parameters:
All samplers `OriginalGenerator`, `ForestDiffusionGenerator` and `GANGenerator` have same input parameters.

1. **GANGenerator** based on **CTGAN**
2. **ForestDiffusionGenerator** based on **Forest Diffusion**

* **gen_x_times**: float = 1.1 - how much data to generate, output might be less because of postprocessing and
adversarial filtering
Expand Down Expand Up @@ -132,43 +136,14 @@ To run experiment follow these steps:
add more datasets, adjust validation type and categorical encoders.
5. Observe metrics across all experiment in console or in `./Research/results/fit_predict_scores.txt`

**Task formalization**

Let say we have **T_train** and **T_test** (train and test set respectively). We need to train the model on **T_train**
and make predictions on **T_test**. However, we will increase the train by generating new data by GAN, somehow similar
to **T_test**, without using ground truth labels.

**Experiment design**

In the case of having a smaller **T_train** and a different data distribution, we can use CTGAN to generate additional data **T_synth**. First, we train CTGAN on **T_train** with ground truth labels (step 1), then generate additional data **T_synth** (step 2). Secondly, we train boosting in an adversarial way on concatenated **T_train** and **T_synth** (target set to 0) with **T_test** (target set to 1) (steps 3 & 4). The goal is to apply the newly trained adversarial boosting to obtain rows more like **T_test**. Note that initial ground truth labels aren't used for adversarial training. As a result, we take top rows from **T_train** and **T_synth** sorted by correspondence to **T_test** (steps 5 & 6), and train new boosting on them and check results on **T_test**.

![Experiment design and workflow](./images/workflow.png?raw=true)

**Picture 1.1** Experiment design and workflow

Of course for the benchmark purposes we will test ordinal training without these tricks and another original pipeline
but without CTGAN (in step 3 we won"t use **T_sync**).

**Datasets**

All datasets came from different domains. They have a different number of observations, number of categorical and
numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple:
removed all time-based columns from datasets. Remaining columns were either categorical or numerical.

**Table 1.1** Used datasets

| Name | Total points | Train points | Test points | Number of features | Number of categorical features | Short description |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| [Telecom](https://www.kaggle.com/blastchar/telco-customer-churn) | 7.0k | 4.2k | 2.8k | 20 | 16 | Churn prediction for telecom data |
| [Adult](https://www.kaggle.com/wenruliu/adult-income-dataset) | 48.8k | 29.3k | 19.5k | 15 | 8 | Predict if persons" income is bigger 50k |
| [Employee](https://www.kaggle.com/c/amazon-employee-access-challenge/data) | 32.7k | 19.6k | 13.1k | 10 | 9 | Predict an employee"s access needs, given his/her job role|
| [Credit](https://www.kaggle.com/c/home-credit-default-risk/data) | 307.5k | 184.5k | 123k | 121 | 18 | Loan repayment |
| [Mortgages](https://www.crowdanalytix.com/contests/propensity-to-fund-mortgages) | 45.6k | 27.4k | 18.2k | 20 | 9 | Predict if house mortgage is founded |
| [Taxi](https://www.crowdanalytix.com/contests/mckinsey-big-data-hackathon) | 892.5k | 535.5k | 357k | 8 | 5 | Predict the probability of an offer being accepted by a certain driver |
| [Poverty_A](https://www.drivendata.org/competitions/50/worldbank-poverty-prediction/page/99/) | 37.6k | 22.5k | 15.0k | 41 | 38 | Predict whether or not a given household for a given country is poor or not |

## Results

To determine the best sampling strategy, ROC AUC scores of each dataset were scaled (min-max scale) and then averaged
among the dataset.

Expand Down Expand Up @@ -224,35 +199,12 @@ arxiv publication:
primaryClass={cs.LG}
}
```
library itself:
```bibtex
@misc{Diyago2020tabgan,
author = {Ashrapov, Insaf},
title = {GANs for tabular data},
howpublished = {\url{https://github.com/Diyago/GAN-for-tabular-data}},
year = {2020}
}
```

## References

[1] Jonathan Hui. GAN — What is Generative Adversarial Networks GAN? (2018), medium article

[2]Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
Yoshua Bengio. Generative Adversarial Networks (2014). arXiv:1406.2661

[3] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv:
[1] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv:
1811.11264v1 [cs.LG]

[4] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular Data using Conditional
GAN (2019). arXiv:1907.00503v2 [cs.LG]

[5] Denis Vorotyntsev. Benchmarking Categorical Encoders. Medium post

[6] Insaf Ashrapov. GAN-for-tabular-data. Github repository.

[7] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila. Analyzing and Improving the
Image Quality of StyleGAN (2019) arXiv:1912.04958v2 [cs.CV]

[8] ODS.ai: Open data science, https://ods.ai/
[2] Alexia Jolicoeur-Martineau and Kilian Fatras and Tal Kachman. Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees ((2023) https://github.com/SamsungSAILMontreal/ForestDiffusion [cs.LG]

[3] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, (2019)
65 changes: 34 additions & 31 deletions src/tabgan/sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,11 +307,11 @@ def generate_data(
self.TEMP_TARGET = None
logging.info("Fitting ForestDiffusion model")
if self.cat_cols is None:
forest_model = ForestDiffusionModel(train_df.to_numpy(), label_y=self.TEMP_TARGET, n_t=50,
forest_model = ForestDiffusionModel(train_df.to_numpy(), label_y=None, n_t=50,
duplicate_K=100,
diffusion_type='flow', n_jobs=-1)
else:
forest_model = ForestDiffusionModel(train_df.to_numpy(), label_y=self.TEMP_TARGET, n_t=50,
forest_model = ForestDiffusionModel(train_df.to_numpy(), label_y=None, n_t=50,
duplicate_K=100,
# todo fix bug with cat cols
#cat_indexes=self.get_column_indexes(train_df, self.cat_cols),
Expand Down Expand Up @@ -393,39 +393,42 @@ def get_columns_if_exists(df, col) -> pd.DataFrame:
logging.info(train)
target = pd.DataFrame(np.random.randint(0, 2, size=(train_size, 1)), columns=list("Y"))
test = pd.DataFrame(np.random.randint(0, 100, size=(train_size, 4)), columns=list("ABCD"))
# _sampler(OriginalGenerator(gen_x_times=15), train, target, test)
# _sampler(
# GANGenerator(gen_x_times=10, only_generated_data=False,
# gen_params={"batch_size": 500, "patience": 25, "epochs": 500, }), train, target, test
# )
#
# _sampler(OriginalGenerator(gen_x_times=15), train, None, train)
# _sampler(
# GANGenerator(cat_cols=["A"], gen_x_times=20, only_generated_data=True),
# train,
# None,
# train,
# )
_sampler(OriginalGenerator(gen_x_times=15), train, target, test)
_sampler(
GANGenerator(gen_x_times=10, only_generated_data=False,
gen_params={"batch_size": 500, "patience": 25, "epochs": 500, }), train, target, test
)

_sampler(OriginalGenerator(gen_x_times=15), train, None, train)
_sampler(
GANGenerator(cat_cols=["A"], gen_x_times=20, only_generated_data=True),
train,
None,
train,
)
_sampler(
ForestDiffusionGenerator(cat_cols=["A"], gen_x_times=1, only_generated_data=True),
train,
None,
train,
)
_sampler(
ForestDiffusionGenerator(gen_x_times=10, only_generated_data=False,
gen_params={"batch_size": 500, "patience": 25, "epochs": 500, }), train, target, test
)

min_date = pd.to_datetime('2019-01-01')
max_date = pd.to_datetime('2021-12-31')

d = (max_date - min_date).days + 1

train['Date'] = min_date + pd.to_timedelta(np.random.randint(d, size=train_size), unit='d')
train = get_year_mnth_dt_from_date(train, 'Date')

#
# min_date = pd.to_datetime('2019-01-01')
# max_date = pd.to_datetime('2021-12-31')
#
# d = (max_date - min_date).days + 1
#
# train['Date'] = min_date + pd.to_timedelta(np.random.randint(d, size=train_size), unit='d')
# train = get_year_mnth_dt_from_date(train, 'Date')
#
# new_train, new_target = GANGenerator(gen_x_times=1.1, cat_cols=['year'], bot_filter_quantile=0.001,
# top_filter_quantile=0.999,
# is_post_process=True, pregeneration_frac=2, only_generated_data=False). \
# generate_data_pipe(train.drop('Date', axis=1), None,
# train.drop('Date', axis=1)
# )
# new_train = collect_dates(new_train)
new_train, new_target = GANGenerator(gen_x_times=1.1, cat_cols=['year'], bot_filter_quantile=0.001,
top_filter_quantile=0.999,
is_post_process=True, pregeneration_frac=2, only_generated_data=False). \
generate_data_pipe(train.drop('Date', axis=1), None,
train.drop('Date', axis=1)
)
new_train = collect_dates(new_train)
19 changes: 18 additions & 1 deletion tests/test_sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import numpy as np
import pandas as pd

from src.tabgan.sampler import OriginalGenerator, Sampler, GANGenerator
from src.tabgan.sampler import OriginalGenerator, Sampler, GANGenerator, ForestDiffusionGenerator


class TestOriginalGenerator(TestCase):
Expand Down Expand Up @@ -94,3 +94,20 @@ def test_generate_data(self):
self.assertEqual(np.max(self.target.nunique()), np.max(new_target.nunique()))
self.assertTrue(gen_train.shape[0] > new_train.shape[0])
self.assertEqual(np.max(self.target.nunique()), np.max(new_target.nunique()))

class TestSamplerGAN(TestCase):
def setUp(self):
self.train = pd.DataFrame(np.random.randint(-10, 150, size=(50, 4)), columns=list('ABCD'))
self.target = pd.DataFrame(np.random.randint(0, 2, size=(50, 1)), columns=list('Y'))
self.test = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
self.gen = ForestDiffusionGenerator(gen_x_times=15)
self.sampler = self.gen.get_object_generator()

def test_generate_data(self):
new_train, new_target, test_df = self.sampler.preprocess_data(self.train.copy(),
self.target.copy(), self.test)
gen_train, gen_target = self.sampler.generate_data(new_train, new_target, test_df)
self.assertEqual(gen_train.shape[0], gen_target.shape[0])
self.assertEqual(np.max(self.target.nunique()), np.max(new_target.nunique()))
self.assertTrue(gen_train.shape[0] > new_train.shape[0])
self.assertEqual(np.max(self.target.nunique()), np.max(new_target.nunique()))

0 comments on commit adf8c65

Please sign in to comment.