New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset

Esraa Faisal Malik

School of Management Universiti Sains Malaysia 11800 Gelugor, Penang, Malaysia
Khai Wah Khaw

School of Management Universiti Sains Malaysia 11800 Gelugor, Penang, Malaysia
XinYing Chew

School of Computer Science Universiti Sains Malaysia 11800 Gelugor, Penang, Malaysia

New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset

keywords: Cost-sensitive learning, hybrid, imbalance dataset, resampling techniques

One of the most challenging problems in the real-world dataset is the rising numbers of imbalanced data. The fact that the ratio of the majorities is higher than the minorities will lead to misleading results as conventional machine learning algorithms were designed on the assumption of equal class distribution. The purpose of this study is to build a hybrid data preprocessing approach to deal with the class imbalance issue by applying resampling approaches and CSL for fraud detection using a real-world dataset. The proposed hybrid approach consists of two steps in which the first step is to compare several resampling approaches to find the optimum technique with the highest performance in the validation set. While the second method used CSL with optimal weight ratio on the resampled data from the first step. The hybrid technique was found to have a positive impact of 0.987, 0.974, 0.847, 0.853 F2-measure for RF, DT, XGBOOST and LGBM, respectively. Additionally, relative to the conventional methods, it obtained the highest performance for prediction.

reference: Vol. 41, 2022, No. 4, pp. 981–1001

doi: 10.31577/cai_2022_4_981

Computing and Informatics

formerly Computers and Artificial Intelligence

New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset