Mengatasi Imbalanced Class Pada Software Defect Prediction Menggunakan Two-Step Clustering-Based Undersampling dan Bagging Tehcnique

Muhammad Faittullah Akbar, Ilham Kurniawan, Ahmad Fauzi

Abstract


Ketidakseimbangan kelas seringkali menjadi masalah di berbagai set data dunia nyata, di mana satu kelas (yaitu kelas minoritas) berisi sejumlah kecil titik data dan yang lainnya (yaitu kelas mayoritas) berisi sejumlah besar titik data. Sangat sulit untuk mengembangkan model yang efektif dengan menggunakan data mining dan algoritma machine learning tanpa mempertimbangkan preprocessing data untuk menyeimbangkan set data yang tidak seimbang. Random undersampling dan oversampling telah digunakan dalam banyak penelitian untuk memastikan bahwa kelas yang berbeda mengandung jumlah titik data yang sama. Dalam penelitian ini, kami mengusulkan kombinasi two-step clustering-based random undersampling dan bagging technique untuk meningkatkan nilai akurasi software defect prediction. Metode yang diusulkan dievaluasi menggunakan lima set data dari repositori program data metrik NASA dan area under the curve (AUC) sebagai evaluasi utama. Hasil telah menunjukkan bahwa metode yang diusulkan menghasilkan kinerja yang sangat baik untuk semua dataset (AUC> 0,9). Dalam hal SN, percobaan kedua mengungguli percobaan pertama di hampir semua dataset (3 dari 5 dataset). Sementara itu, dalam hal SP, percobaan pertama tidak mengungguli percobaan kedua di semua dataset. Secara keseluruhan percobaan kedua mengungguli dan lebih baik daripada percobaan pertama karena evaluasi utama dalam klasifikasi kelas yang tidak seimbang seperti SDP adalah AUC Oleh karena itu, dapat disimpulkan bahwa metode yang diusulkan menghasilkan kinerja yang optimal baik untuk set data skala kecil maupun besar. 


References


Arar, Ö. F., & Ayan, K. (2015). Software defect prediction using cost-sensitive neural network. Applied Soft Computing Journal, 33, 263–277. https://doi.org/10.1016/j.asoc.2015.04.045

Barandela, R., Sánchez, J. S., & Valdovinos, R. M. (2003). New Applications of Ensembles of Classifiers. Pattern Analysis and Applications, 6(3), 245–256. https://doi.org/10.1007/s10044-003-0192-z

Bbeiman, L. E. O. (1996). Bagging Predictors, 140, 123–140.

Catal, C. (2011). Expert Systems with Applications Software fault prediction : A literature review and current trends. Expert Systems With Applications, 38(4), 4626–4636. https://doi.org/10.1016/j.eswa.2010.10.024

Chawla, N. V, Lazarevic, A., Hall, L. O., & Bowyer, K. W. (n.d.). SMOTEBoost : Improving Prediction, 107–119.

Chiu, T., Fang, D., Chen, J., Wang, Y., & Jeris, C. (2001). A robust and scalable clustering algorithm for mixed type attributes in large database environment. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’01, 263–268. https://doi.org/10.1145/502512.502549

Czibula, G., Marian, Z., & Czibula, I. G. (2014). Software defect prediction using relational association rule mining. Information Sciences, 264, 260–278. https://doi.org/10.1016/j.ins.2013.12.031

Galar, M., Fern, A., Barrenechea, E., & Bustince, H. (2012). Hybrid-Based Approaches, 42(4), 463–484.

Gorunescu, F. (2011). Data Mining: Concepts,Models and Techniques. Berlin: Springer-Verlag Berlin Heidelberg. https://doi.org/10.1360/zd-2013-43-6-1064

Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-Means Clustering Algorithm. Applied Statistics, 28(1), 100. https://doi.org/10.2307/2346830

Hu, S. (2009). 2009 Second International Workshop on Computer Science and Engineering MSMOTE : Improving Classification Performance when Training Data is imbalanced, 627–631. https://doi.org/10.1109/WCSE.2009.756

Jain, A, K., Murty, M, P., & Flynn, P, J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3), 264–323. https://doi.org/10.1145/345966.346030

Laradji, I. H., Alshayeb, M., & Ghouti, L. (2015). Software defect prediction using ensemble learning on selected features. Information and Software Technology, 58, 388–402. https://doi.org/10.1016/j.infsof.2014.07.005

Learning, C., Liu, X., Wu, J., Zhou, Z., & Member, S. (2009). Exploratory Undersampling for, 39(2), 539–550.

Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4), 485–496. https://doi.org/10.1109/TSE.2008.35

Li, D., Liu, C., & Hu, S. C. (2010). A learning method for the class imbalance problem with medical data sets. Computers in Biology and Medicine, 40(5), 509–518. https://doi.org/10.1016/j.compbiomed.2010.03.005

Lin, W., Tsai, C., Hu, Y., & Jhang, J. (2017). Clustering-based undersampling in class-imbalanced data, 410, 17–26. https://doi.org/10.1016/j.ins.2017.05.008

Michailidou, C., Maheras, P., Arseni-Papadimititriou, A., Kolyva-Machera, F., & Anagnostopoulou, C. (2009). A study of weather types at Athens and Thessaloniki and their relationship to circulation types for the cold-wet period, part I: Two-step cluster analysis. Theoretical and Applied Climatology, 97(1–2), 163–177. https://doi.org/10.1007/s00704-008-0057-x

Rana, Z. A., Mian, M. A., & Shamail, S. (2015). Improving Recall of software defect prediction models using association mining. Knowledge-Based Systems, 90, 1–13. https://doi.org/10.1016/j.knosys.2015.10.009

Satish, S. M., & Bharadhwaj, S. (2010). Information search behaviour among new car buyers: A two-step cluster analysis. IIMB Management Review, 22(1–2), 5–15. https://doi.org/10.1016/j.iimb.2010.03.005

Schapire, R. E. (1990). The Strength of Weak Learnability, 227, 197–227.

Seiffert, C., Khoshgoftaar, T. M., Hulse, J. Van, & Napolitano, A. (2010). RUSBoost: A Hybrid Approach to Alleviating, 40(1), 185–197. https://doi.org/10.1109/TSMCA.2009.2029559

Siers, M. J., & Islam, Z. (2015). Software defect prediction using a cost sensitive decision forest and voting and a potential solution to the class imbalance problem. Information Systems, 1–10. https://doi.org/10.1016/j.is.2015.02.006

Solis, J., Avizzano, C. A., & Bergamasco, M. (2002). Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models. Proceedings - 10th Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems, HAPTICS 2002, 255–262. https://doi.org/10.1109/HAPTIC.2002.998966

Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., & Zhou, Y. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), 1623–1637. https://doi.org/10.1016/j.patcog.2014.11.014

Wahono, R. S., & Herman, N. S. (2014). Genetic feature selection for software defect prediction. Advanced Science Letters, 20(1), 239–244. https://doi.org/10.1166/asl.2014.5283




DOI: https://doi.org/10.31294/ji.v6i1.5448

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Index by:

 
  
Published by Department of Research and Public Service (LPPM) Universitas Bina Sarana Informatika with supported Relawan Jurnal Indonesia

Jl. Kramat Raya No.98, Kwitang, Kec. Senen, Kota Jakarta Pusat, DKI Jakarta 10450
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License