CFP last date
20 June 2024
Reseach Article

Data Preprocessing to Improve Accuracy in Classification Methods (Case Study: Credit Risk Analysis Dataset Classification)

by Baiq Nurul Azmi, Arief Hermawan
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 5
Year of Publication: 2024
Authors: Baiq Nurul Azmi, Arief Hermawan
10.5120/ijca2024923385

Baiq Nurul Azmi, Arief Hermawan . Data Preprocessing to Improve Accuracy in Classification Methods (Case Study: Credit Risk Analysis Dataset Classification). International Journal of Computer Applications. 186, 5 ( Jan 2024), 22-29. DOI=10.5120/ijca2024923385

@article{ 10.5120/ijca2024923385,
author = { Baiq Nurul Azmi, Arief Hermawan },
title = { Data Preprocessing to Improve Accuracy in Classification Methods (Case Study: Credit Risk Analysis Dataset Classification) },
journal = { International Journal of Computer Applications },
issue_date = { Jan 2024 },
volume = { 186 },
number = { 5 },
month = { Jan },
year = { 2024 },
issn = { 0975-8887 },
pages = { 22-29 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number5/33069-2024923385/ },
doi = { 10.5120/ijca2024923385 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:29:49.741864+05:30
%A Baiq Nurul Azmi
%A Arief Hermawan
%T Data Preprocessing to Improve Accuracy in Classification Methods (Case Study: Credit Risk Analysis Dataset Classification)
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 5
%P 22-29
%D 2024
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This research analyzes the use of various data pre-processing methods in the context of credit risk analysis with Support Vector Machine (SVM) classification models. The background of this research details the complexity of challenges faced in the banking industry regarding credit risk evaluation and how data pre-processing to improve model accuracy. The research method includes four experimental scenarios that consider various combinations of data pre-processing methods. Each scenario is designed to evaluate the performance of SVM models on credit risk datasets. The method steps include data preparation, Missing Data handling with Remove Features for features that have more than 50% Missing Data rate and MICE imputation for features that have less than 50% Missing Data, feature selection based on Correlation Matrix to overcome High Dimensional Data, and data resampling with SMOTE to overcome class imbalance. The test results show that using a combination of data pre-processing methods can significantly improve the accuracy of SVM models on credit risk datasets. The highest accuracy is obtained in the pre-processing scenario when overcoming Missing Data with remove features and MICE imputation with a value of 99.4%.

References
  1. A. Ansori, “Sistem Informasi Perbankan Syariah,” J. Banq., vol. 4, no. 1, pp. 183–204, 2018.
  2. J. A. Ginting, “Data Mining Untuk Analisa Pengajuan Kredit Dengan Menggunakan Metode Logistik Regresi,” J. Algoritm. Log. dan Komputasi, vol. 2, no. 2, pp. 164–169, 2019, doi: 10.30813/j-alu.v2i2.1845.
  3. A. Thennakoon, C. Bhagyani, S. Premadasa, S. Mihiranga, and N. Kuruwitaarachchi, “Real-time Credit Card Fraud Detection Using Machine Learning,” Proc. 13th Int. Conf. Cloud Comput. Data Sci. Eng. Conflu. 2023, pp. 488–493, 2023.
  4. A. P. Nawary and Kurniati, “Penerapan Data Mining Dalam Memprediksi Kelancaran Kredit Nasabah Menggunakan Algoritma C4.5 (Studi Kasus Pada Pt. Astra International (Auto 2000 Plaju),” Bina Darma Conf. Comput. Sci., vol. 5, pp. 1041–1047, 2021.
  5. A. Nazábal, P. M. Olmos, Z. Ghahramani, and I. Valera, “Handling incomplete heterogeneous data using VAEs,” Pattern Recognit., vol. 107, no. 11, p. 107501, Nov. 2020, doi: 10.1016/j.patcog.2020.107501.
  6. H. Benhar, A. Idri, and J. L. Fernández-Alemán, “Data preprocessing for heart disease classification: A systematic literature review,” Comput. Methods Programs Biomed., vol. 195, no. 10, p. 105635, Oct. 2020, doi: 10.1016/j.cmpb.2020.105635.
  7. B. Nugroho and A. Denih, “Perbandingan Kinerja Metode Pra-Pemrosesan Dalam Pengklasifikasian Otomatis Dokumen Paten,” Komputasi J. Ilm. Ilmu Komput. dan Mat., vol. 17, no. 2, pp. 381–387, 2020, doi: 10.33751/komputasi.v17i2.2148.
  8. E. Etriyanti, D. Syamsuar, and N. Kunang, “Implementasi Data Mining Menggunakan Algoritme Naive Bayes Classifier dan C4 . 5 untuk Memprediksi Kelulusan Mahasiswa,” Telematika, vol. 13, no. 1, pp. 56–67, 2020, doi: http://dx.doi.org/10.35671/telematika.v13i1.881.
  9. A. P. Joshi and B. V Patel, “Data Preprocessing: The Techniques for Preparing Clean and Quality Data for Data Analytics Process,” Orient. J. Comput. Sci. Technol., vol. 13, no. 0203, pp. 78–81, Jan. 2021, doi: 10.13005/ojcst13.0203.03.
  10. V. Moscato, A. Picariello, and G. Sperlí, “A benchmark of machine learning approaches for credit score prediction,” Expert Syst. Appl., vol. 165, no. May 2020, p. 113986, 2021, doi: 10.1016/j.eswa.2020.113986.
  11. S. I. Khan and A. S. M. L. Hoque, “SICE: an improved missing data imputation technique,” J. Big Data, vol. 7, no. 1, p. 37, Dec. 2020, doi: 10.1186/s40537-020-00313-w.
  12. L. Li, C. G. Prato, and Y. Wang, “Ranking contributors to traffic crashes on mountainous freeways from an incomplete dataset: A sequential approach of multivariate imputation by chained equations and random forest classifier,” Accid. Anal. Prev., vol. 146, no. July, p. 105744, Oct. 2020, doi: 10.1016/j.aap.2020.105744.
  13. E. N. R. Khakim, A. Hermawan, and D. Avianto, “Implementasi Correlation Matrix Pada Klasifikasi Dataset Wine,” JIKO (Jurnal Inform. dan Komputer), vol. 7, no. 1, p. 158, 2023, doi: 10.26798/jiko.v7i1.771.
  14. A. Hermawan and A. P. Wibowo, “Implementasi Korelasi untuk Seleksi Fitur pada Klasifikasi Jamur Beracun Menggunakan Jaringan Syaraf Tiruan,” INTEK J. Inform. Dan …, vol. 5, no. 1, pp. 63–67, 2022.
  15. R. Ghorbani and R. Ghousi, “Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques,” IEEE Access, vol. 8, pp. 67899–67911, 2020, doi: 10.1109/ACCESS.2020.2986809.
  16. U. e. Laila, K. Mahboob, A. W. Khan, F. Khan, and W. Taekeun, “An Ensemble Approach to Predict Early-Stage Diabetes Risk Using Machine Learning: An Empirical Study,” Sensors, vol. 22, no. 14, pp. 1–15, 2022, doi: 10.3390/s22145247.
  17. P. Ray, S. S. Reddy, and T. Banerjee, “Various dimension reduction techniques for high dimensional data analysis: a review,” Artif. Intell. Rev., vol. 54, no. 5, pp. 3473–3515, Jun. 2021, doi: 10.1007/s10462-020-09928-0.
  18. J. Nalić, G. Martinović, and D. Žagar, “New hybrid data mining model for credit scoring based on feature selection algorithm and ensemble classifiers,” Adv. Eng. Informatics, vol. 45, no. February 2019, p. 101130, 2020, doi: 10.1016/j.aei.2020.101130.
  19. T. Hapsari, R, K. Indriyani, “Implementasi Algoritma SMOTE Sebagai Penyelesaian Imbalance Hight Dimensional Datasets,” in Prosiding Seminar Nasional Teknik Elektro, Sistem Informasi, dan Teknik Informatika, 2022, pp. 427–432, doi: 10.31284/p.snestik.2022.2868.
  20. M. Anis, M. Ali, S. A. Mirza, and M. M. Munir, “Analysis of Resampling Techniques on Predictive Performance of Credit Card Classification,” Mod. Appl. Sci., vol. 14, no. 7, p. 92, 2020, doi: 10.5539/mas.v14n7p92.
  21. E. Ileberi, Y. Sun, and Z. Wang, “Performance Evaluation of Machine Learning Methods for Credit Card Fraud Detection Using SMOTE and AdaBoost,” IEEE Access, vol. 9, pp. 165286–165294, 2021, doi: 10.1109/ACCESS.2021.3134330.
  22. Islahulhaq, W. Wibowo, and I. D. Ratih, “Classification of non-performing financing using logistic regression and synthetic minority over-sampling technique-nominal continuous (SMOTE-NC),” Int. J. Adv. Soft Comput. its Appl., vol. 13, no. 3, pp. 115–128, 2021, doi: 10.15849/ijasca.211128.09.
  23. H. Hairani, K. E. Saputro, and S. Fadli, “K-means-SMOTE for handling class imbalance in the classification of diabetes with C4.5, SVM, and naive Bayes,” J. Teknol. dan Sist. Komput., vol. 8, no. 2, pp. 89–93, 2020, doi: 10.14710/jtsiskom.8.2.2020.89-93.
  24. E. Sutoyo and M. A. Fadlurrahman, “Penerapan SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Television Advertisement Performance Rating Menggunakan Artificial Neural Network,” J. Edukasi dan Penelit. Inform., vol. 6, no. 3, p. 379, 2020, doi: 10.26418/jp.v6i3.42896.
  25. R. K. Kim et al., “Data integration of National Dose Registry and survey data using multivariate imputation by chained equations,” PLoS One, vol. 17, no. 6, pp. 1–14, 2022, doi: 10.1371/journal.pone.0261534.
  26. R. Kurniawan, P. Pizaini, and F. Insani, “Penerapan Algoritma K-Means Clustering dan Correlation Matrix Untuk Menganalisis Risiko Penyebaran Demam Berdarah di Kota Pekanbaru,” JIMP (Jurnal Inform. Merdeka Pasuruan), vol. 6, no. 3, pp. 1–6, 2021, doi: http://dx.doi.org/10.37438/jimp.v6i3.353.
  27. J. Daemen and V. Rijmen, The Design of Rijndael. Berlin, Heidelberg: Springer Berlin Heidelberg, 2020.
  28. D. Dablain, B. Krawczyk, and N. V. Chawla, “DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data,” IEEE Trans. Neural Networks Learn. Syst., vol. 33, no. 1, pp. 1–15, 2022, doi: 10.1109/TNNLS.2021.3136503.
  29. E. Y. Boateng, J. Otoo, and D. A. Abaye, “Basic Tenets of Classification Algorithms K-Nearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review,” J. Data Anal. Inf. Process., vol. 08, no. 04, pp. 341–357, 2020, doi: 10.4236/jdaip.2020.84020.
  30. R. Dietrich, M. Opper, and H. Sompolinsky, “Statistical Mechanics of Support Vector Networks,” Phys. Rev. Lett., vol. 82, no. 14, pp. 2975–2978, Apr. 1999, doi: 10.1103/PhysRevLett.82.2975.
  31. R. Mehta, “Credit Risk Analysis,” kaggle.com, 2021. [Online]. Available: https://www.kaggle.com/datasets/rameshmehta/credit-risk-analysis.
Index Terms

Computer Science
Information Sciences

Keywords

Credit Risk Analysis MICE SMOTE Support Vector Machine Correlation Matrix