PENERAPAN TEKNIK BAGGING PADA ALGORITMA KLASIFIKASI UNTUK MENGATASI KETIDAKSEIMBANGAN KELAS DATASET MEDIS

Rizki Tri Prasetio, Pratiwi Pratiwi

Abstract


ABSTRACT – The class imbalance problems have been reported to severely hinder classification performance of many standard learning algorithms, and have attracted a great deal of attention from researchers of different fields. Therefore, a number of methods, such as sampling methods, cost-sensitive learning methods, and bagging and boosting based ensemble methods, have been proposed to solve these problems. Some medical dataset has two classes has two classes or binominal experiencing an imbalance that causes lack of accuracy in classification. This research proposed a combination technique of bagging and algorithms of classification to improve the accuracy of medical datasets. Bagging technique used to solve the problem of imbalanced class. The proposed method is applied on three classifier algorithm i.e., naïve bayes, decision tree and k-nearest neighbor. This research uses five medical datasets obtained from UCI Machine Learning i.e.., breast-cancer, liver-disorder, heart-disease, pima-diabetes and vertebral column. Results of this research indicate that the proposed method makes a significant improvement on two algorithms of classification i.e. decision tree with p value of t-Test 0.0184 and k-nearest neighbor with p value of t-Test 0.0292, but not significant in naïve bayes with p value of t-Test 0.9236. After bagging technique applied at five medical datasets, naïve bayes has the highest accuracy for breast-cancer dataset of 96.14% with AUC of 0.984, heart-disease of 84.44% with AUC of 0.911 and pima-diabetes of 74.73% with AUC of 0.806. While the k-nearest neighbor has the best accuracy for dataset liver-disorder of 62.03% with AUC of 0.632 and vertebral-column of 82.26% with the AUC of 0.867.
Keywords: ensemble technique, bagging, imbalanced class, medical dataset.

ABSTRAKSI – Masalah ketidakseimbangan kelas telah dilaporkan sangat menghambat kinerja klasifikasi banyak algoritma klasifikasi dan telah menarik banyak perhatian dari para peneliti dari berbagai bidang. Oleh karena itu, sejumlah metode seperti metode sampling, cost-sensitive learning, serta bagging dan boosting, telah diusulkan untuk memecahkan masalah ini. Beberapa dataset medis yang memiliki dua kelas atau binominal mengalami ketidakseimbangan kelas yang menyebabkan kurangnya akurasi pada klasifikasi. Pada penelitian ini diusulkan kombinasi teknik bagging dan algoritma klasifikasi untuk meningkatkan akurasi dari klasifikasi dataset medis. Teknik bagging digunakan untuk menyelesaikan masalah ketidakseimbangan kelas. Metode yang diusulkan diterapkan pada tiga algoritma classifier yaitu, naïve bayes, decision tree dan k-nearest neighbor. Penelitian ini menggunakan lima dataset medis yang didapatkan dari UCI Machine Learning yaitu, breast-cancer, liver-disorder, heart-disease, pima-diabetes dan vertebral column. Hasil penelitian menunjukan bahwa metode yang diusulkan membuat peningkatan yang signifikan pada dua algoritma klasifikasi yaitu decision tree dengan P value of t-Test sebesar 0,0184 dan k-nearest neighbor dengan P value of t-Test sebesar 0,0292, akan tetapi tidak signifikan pada naïve bayes dengan P value of t-Test sebesar 0,9236. Setelah diterapkan teknik bagging pada lima dataset medis, naïve bayes memiliki akurasi paling tinggi untuk dataset breast-cancer sebesar 96,14% dengan AUC sebesar 0,984, heart-disease sebesar 84,44% dengan AUC sebesar 0,911dan pima-diabetes sebesar 74,73% dengan AUC sebesar 0,806. Sedangkan k-nearest neighbor memiliki akurasi yang paling baik untuk dataset liver-disorder sebesar 62,03% dengan AUC sebesar dan 0,632 dan vertebral column dengan akurasi sebesar 82,26% dengan AUC sebesar 0,867.
Kata Kunci: teknik ensemble, bagging, ketidakseimbangan kelas, dataset medis.

References


Alfaro, E., Gamez, M., & Garcia, N. (2013). adabag: An R Package for Classification with Boosting and Bagging. Journal of Statistical Software, 11-35.

Alfisahrin, S. N. (2014). Komparasi Algoritma C4.5, Naive Bayes dan Neural Network Untuk Memprediksi Penyakit Jantung. Jakarta: Pascasarjana Magister Ilmu Komputer STMIK Nusa Mandiri.

Alpaydin, E. (2010). Introduction to Machine Learning. London: The MIT Press.

Barandela, R., Sanchez, J., Garcia, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 849-851.

Bramer, M. (2013). Pronciple of Data Mining Second Edition. London: Springer.

Breiman, L. (1996). Bagging Predictors. Machine Learning, 123-140.

Chao, W., Liu, J., & Ding, J. (2013). Facial age estimation based on label-sensitive learning and age-oriented regression. Pattern Recognition, 628-641.

Chawla, N., Japkowicz, N., & Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explore, 1-6.

Han, J., & Kamber, M. (2006). Data Mining Concepts and Techniques Second Edition. San Francisco: Diane Cerra.

Hido, S., Kashima, H., & Takahashi, Y. (2009). Roughly balanced bagging for imbalanced data. Statistic. Analysis Data Mining, 412-426.

Kim, M.-J., & Kang, D.-K. (2012). Classifier Selection in Ensembles using Genetic Algorithm for Bankruptcy Prediction. Expert System with Application: An International Journal, 9308-9314.

Korada, N., Kumar, N., & Deekshitulu, Y. (2012). Implementation of Naïve Bayesian Classifier and Ada-Boost. International Journal of Information Sciences and Techniques, 63-75.

Larose, D. T. (2005). Discovering Knowledge in Data: An Introduction to Data Mining. New Jersey: John Wiley & Sons, Inc.

Liang, G., & Zhang, C. (2011). Empirical Study of Bagging Predictors on Medical Data. Proceedings of the 9-th Australasian Data Mining Conference, 31-40.

Maimon, O., & Rokach, L. (2010). Data Mining and Knowledge Discovery Handbook Second Edition. New York: Springer.

Mardiana, T. (2011). Rancang Bangun Pada PT. Indoaja Menggunakan Model ITPOSMO. Paradigma, XII(02), 34-45.

Peng, Y., & Yao, J. (2010). AdaOUBoost: Adaptive Over-sampling and Under-sampling to Boost the Concept Learning in Large Scale Imbalanced Data Sets. Proceedings of the international conference on Multimedia information retrieval, 111-118.

Seiffert, C., Khoshgoftaar, T., Hulse, J., & Napolita, A. (2008). Resampling or Reweighting: A Comparison of Boosting Implementations. 20th IEEE International Conference on Tools with Artificial Intelligence, 445-451.

Sun, Y., Kamel, M., & Wong, A. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 3358-3378.

Twala, B. (2010). Multiple Classifier Application to Credit Risk Assessment. Expert System with Application, 3326-3336.

Wu, X., & Kumar, V. (2009). The Top Ten Algorithms in Data Mining. New York: CRC Press.

Yap, B., Rani, K., Rahman, H., Fong, S., Khairudin, Z., &

Abdullah, N. N. (2014). An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets. Proceedings of the First International Conference on Advanced Data and Information Engineering, 13-22.

Zhang, D., Liu, W., Gong, X., & Jin, H. (2011). A Novel Improved SMOTE Resampling Algorithm Based on Fractal. Computational Information Systems, 2204-2211.

Zhou, Z., & Liu, X. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transaction, 63-77.




DOI: https://doi.org/10.31294/ji.v2i2.118

Refbacks

  • There are currently no refbacks.




Index by:

 
  
Published by Department of Research and Public Service (LPPM) Universitas Bina Sarana Informatika with supported Relawan Jurnal Indonesia

Jl. Kramat Raya No.98, Kwitang, Kec. Senen, Kota Jakarta Pusat, DKI Jakarta 10450
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License