We well know GANs for success in the realistic image generation. Howev
We well know GANs for success in realistic image generation. However, they can be applied in tabular data generation. We will review and examine some recent papers about tabular GANs in action.
We will show that GAN might be an option for highly skewed data between train and test by applying it with adversarial training.
Tabular GANs might be used in:
Picture 1.1 Experiment design and workflow
Running experiment
To run the experiment follow these steps:
./data
folder.pip install -r requirements.txt
run_experiment.py
. Run all experiments python run_experiment.py
. You may add more datasets, adjust validation types, and categorical encoders../results/fit_predict_scores.txt
To determine the best sampling strategy, ROC AUC scores of each dataset were scaled (min-max scale) and then averaged among the dataset.
We can see that GAN outperformed other sampling types in 2 datasets. Whereas sampling from the original outperformed other methods in 3 of 7 datasets. Of course, there isn’t much difference. but these types of sampling might be an option. Of course, there isn’t much difference. but these types of sampling might be an option.
dataset_name | None | gan | sample_original |
---|---|---|---|
credit | 0.997 | 0.998 | 0.997 |
employee | 0.986 | 0.966 | 0.972 |
mortgages | 0.984 | 0.964 | 0.988 |
poverty_A | 0.937 | 0.950 | 0.933 |
taxi | 0.966 | 0.938 | 0.987 |
adult | 0.995 | 0.967 | 0.998 |
telecom | 0.995 | 0.868 | 0.992 |
Table 1.1 Different sampling results across the dataset, higher is better (100% - maximum per dataset ROC AUC)
sample_type | mean | std |
None | 0.980 | 0.036 |
gan | 0.969 | 0.06 |
sample_original | 0.981 | 0.032 |
Table 1.2 Different sampling results, higher is better for a mean (ROC AUC), lower is better for std (100% - maximum per dataset ROC AUC)
Let’s define same_target_prop as equal 1 then the target rate for train and test is different no more than 5%. So then we have almost the same target rate in train and test None and sample_original better. However, gan is starting to perform noticeably better than target distribution changes.
sample_type | same_target_prop | prop_test_score |
None | 0 | 0.964 |
None | 1 | 0.985 |
gan | 0 | 0.966 |
gan | 1 | 0.945 |
sample_original | 0 | 0.973 |
sample_original | 1 | 0.984 |
Table 1.3 same_target_prop is equal 1 then the target rate for train and test are different no more than 5%. Higher is better.
The author would like to thank Open Data Science community [5] for many valuable discussions and educational help in the growing field of machine and deep learning. Also, special big thanks to Sber [6] for allowing solving such tasks and providing computational resources.
[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Generative Adversarial Networks (2014). arXiv:1406.2661
[2] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv:1811.11264v1 [cs.LG]
[3] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular Data using Conditional GAN (2019). arXiv:1907.00503v2 [cs.LG]
[4] Denis Vorotyntsev. Benchmarking Categorical Encoders (2019). Medium post
[5] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila. Analyzing and Improving the Image Quality of StyleGAN (2019) arXiv:1912.04958v2 [cs.CV]
[6] ODS.ai: Open data science (2020), https://ods.ai/
[7] Sber (2020), https://www.sberbank.ru/
Cookies help us deliver our services. By using our services, you agree to our use of cookies.