TY - JOUR
T1 - Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments
AU - Alabdulwahab, Saleh
AU - Kim, Young Tak
AU - Seo, Aria
AU - Son, Yunsik
N1 - Publisher Copyright:
© 2023 by the authors.
PY - 2023/10
Y1 - 2023/10
N2 - Networks within the Internet of Things (IoT) have some of the most targeted devices due to their lightweight design and the sensitive data exchanged through smart city networks. One way to protect a system from an attack is to use machine learning (ML)-based intrusion detection systems (IDSs), significantly improving classification tasks. Training ML algorithms require a large network traffic dataset; however, large storage and months of recording are required to capture the attacks, which is costly for IoT environments. This study proposes an ML pipeline using the conditional tabular generative adversarial network (CTGAN) model to generate a synthetic dataset. Then, the synthetic dataset was evaluated using several types of statistical and ML metrics. Using a decision tree, the accuracy of the generated dataset reached 0.99, and its lower complexity reached 0.05 s training and 0.004 s test times. The results show that synthetic data accurately reflect real data and are less complex, making them suitable for IoT environments and smart city applications. Thus, the generated synthetic dataset can further train models to secure IoT networks and applications.
AB - Networks within the Internet of Things (IoT) have some of the most targeted devices due to their lightweight design and the sensitive data exchanged through smart city networks. One way to protect a system from an attack is to use machine learning (ML)-based intrusion detection systems (IDSs), significantly improving classification tasks. Training ML algorithms require a large network traffic dataset; however, large storage and months of recording are required to capture the attacks, which is costly for IoT environments. This study proposes an ML pipeline using the conditional tabular generative adversarial network (CTGAN) model to generate a synthetic dataset. Then, the synthetic dataset was evaluated using several types of statistical and ML metrics. Using a decision tree, the accuracy of the generated dataset reached 0.99, and its lower complexity reached 0.05 s training and 0.004 s test times. The results show that synthetic data accurately reflect real data and are less complex, making them suitable for IoT environments and smart city applications. Thus, the generated synthetic dataset can further train models to secure IoT networks and applications.
KW - CTGAN
KW - IoT
KW - advanced persistent threat
KW - information security
KW - intrusion detection system
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85174190917&partnerID=8YFLogxK
U2 - 10.3390/app131910951
DO - 10.3390/app131910951
M3 - Article
AN - SCOPUS:85174190917
SN - 2076-3417
VL - 13
JO - Applied Sciences (Switzerland)
JF - Applied Sciences (Switzerland)
IS - 19
M1 - 10951
ER -