Penanganan Ketidakseimbangan Data pada Prediksi Customer Churn Menggunakan Kombinasi SMOTE dan Boosting
Sari
Industri telekomunikasi menghadapi persaingan yang ketat antara penyedia layanan (service provider). Persaingan ini mengakibatkan customer churn atau berpindahnya pelanggan dari satu layanan ke layanan lain. Customer churn menjadi masalah utama karena dapat mempengaruhi pendapatan perusahaan, profitabilitas, serta kelangsungan hidup perusahaan. Oleh karena itu, mengetahui pelanggan yang akan melakukan churn secara dini menjadi salah satu cara yang cukup efektif dilakukan, karena dapat membantu perusahaan dalam membuat rencana yang efektif untuk tetap mempertahankan pelanggannya. Jumlah pelanggan yang mengundurkan diri dari layanannya saat ini biasanya dimiliki perusahaan dalam jumlah yang sedikit. Kondisi kekurangan data ini menyebabkan kesulitan dalam memprediksi customer churn. Tujuan umum dari penelitian ini adalah memprediksi pelanggan yang akan berpindah ke layanan lain atau mengundurkan diri dari layanannya saat ini. Sementara tujuan khusus penelitian Penelitian ini berusaha menangani ketidakseimbangan data dalam prediksi customer churn menggunakan optimasi pada level data melalui metode sampling yaitu Synthetic Minority Over Sampling. Kemudian dikombinasikan dengan optimasi level algoritma melalui pendekatan teknik Boosting. Pada penelitian beberapa algoritma prediksi seperti random forest, naïve bayes, decision tree, k-nearest neighbor dan deep learning yang akan diimplementasikan untuk mengetahui algoritma yang paling baik setelah dilakukan optimasi menggunakan SMOTE dan Boosting. Metode penelitian yang digunakan pada penelitian ini adalah CRISP-DM, yang merupakan kerangka penelitian data mining untuk penelitian lintas industri. Hasil penelitian ini menunjukan bahwa algoritma random forest merupakan algoritma yang menghasilkan akurasi paling optimal setelah dioptimasi menggunakan SMOTE dan Boosting dengan hasil akurasi 89,19%.
The telecommunications industry faces stiff competition between service providers. This competition results in customer churn. Customer churn is a major problem because it can affect company revenue, profitability, survival, and service quality of the company. Therefore, knowing which customers will churn in the future early is one of the most effective ways to do it, because it can help companies make an effective plan to keep their customers. The number of customers who withdrew from its current services is usually owned by a small number. This lack of data causes difficulties in predicting customer churn. This problem then becomes a challenging issue in machine learning. The general purpose of this research is to predict customers who will churn. While the specific purpose of this research is to try to deal with data imbalances in predicting customer churn using optimization at the data level through the sampling method, namely Synthetic Minority Over Sampling (SMOTE). Then combined with algorithm level optimization through the Boosting technique approach. In this study, several prediction algorithms like the random forest, naïve Bayes, decision tree, k-nearest neighbor, and deep learning will be implemented to find out the best algorithm after optimization using SMOTE and Boosting. The method used in this study is CRISP-DM, which is a data mining research framework for cross-industry research. The results of this study indicate that the random forest algorithm is an algorithm that produces the most optimal accuracy after being optimized using SMOTE and Boosting with an accuracy of 89.19%.
Teks Lengkap:
PDFReferensi
Awalludin, Adiwijaya, & Bijaksana, M. (2017). Churn Prediction on Fix Broadband Internet Using Combined Feed Forward Neural Network and SMOTEBoost Algorithm.
Dalvi, P. K., Khandge, S. K., Deomore, A., Bankar, A., & Kanade, P. V. (2016). Analysis of customer churn prediction in telecom industry using decision trees and logistic regression. IEEE- Symposium on Colossal Data Analysis and Networking (CDAN).
Dittman, D. J., Khoshgoftaar, T. M., & Napolitano, A. (2015). The effect of data sampling when using random forest on imbalanced bioinformatics data. IEEE 16th International Conference on Information Reuse and Integration.
Dwiyanti, E., Adiwijaya, & Ardiyanti, A. (2016). Handling Imbalanced Data in Churn Prediction Using RUSBoost and Feature Selection. International Conference Soft Computeing and Data Mining.
Effendy, V., Adiwijaya, & Baizal, Z. A. (2014). Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest. Information and Communication Technology.
Galar, M., & Fernandez, A. (2011). A review on ensembles for the class imbalance problem : bagging-, boosting-, and hybrid-based approaches. IEEE Transc. On System, MAN and Cybernetics-Part C: Application and Review.
He, H., Zhang, W., & Zhang, S. (2018). A novel ensemble method for credit scoring: Adaption of different imbalance ratios. Expert Systems with Applications.
Jian, C., Gao, J., & Ao, Y. (2016). A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing.
Lariviere, B., & Poel, D. V. (2005). Predicting Customer Retention and Profitability by Using Random Forest and Regression Forest Techniques. Expert System and Applications.
Park, B. J., Oh, S. K., & Pedryez, W. (2013). The Design of Polynomial Function-Based Neural Network Predictors for Detection of Software Defects. Information Sciences.
Prasetio, R. T., & Pratiwi, P. (2015). PENERAPAN TEKNIK BAGGING PADA ALGORITMA KLASIFIKASI UNTUK MENGATASI KETIDAKSEIMBANGAN KELAS DATASET MEDIS. Jurnal Informatika.
Prasetio, R. T., & Riana, D. (2015). A comparison of classification methods in vertebral column disorder with the application of genetic algorithm and bagging. 2015 4th international conference on instrumentation, communications, information technology, and biomedical engineering (ICICI-BME) (hal. 163-168). Bandung: IEEE.
Prasetio, R. T., & Susanti, S. (2019). Implementasi Algoritma Genetika pada k-nearest neighbours untuk Klasifikasi Kerusakan Tulang Belakang. Jurnal Responsif, 64-69.
Riana, D., Ramdhani, Y., Prasetio, R. T., & Hidayanto, A. N. (2018). Improving Hierarchical Decision Approach for Single Image Classification of Pap Smear. International Journal of Electrical and Computer Engineering.
Saifudin, A., & Wahono, R. S. (2015). Penerapan Teknik Ensemble untuk Menangani Ketidakseimbangan Kelas pada Prediksi Cacat Software. Journal of Software Engineering.
Xiao, J., Jiang, X., He, C., & Teng, G. (2016). Churn prediction in customer relationship management via gmdh-based multiple classifers ensemble. IEEE Computer Society.
Yu, D., Hu, J., Tang, Z., Shen, H., & Yang, J. (2013). Neurocomputing Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing.
Zhongbin, S., Qinbao, S., Xiaoyan, Z., Heli, S., Baowen, X., & Yuming, Z. (2015). A novel ensemble method for classifying imbalanced data . Elsevier Pattern Recognition.
Zieba, M., Tomzcak, J. M., Lubicz, M., & Swiatek, J. (2014). Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Applied Soft Computing.
DOI: https://doi.org/10.31294/ijcit.v6i1.9545
##submission.copyrightStatement##
##submission.license.cc.by-sa4.footer##
P-ISSN: 2527-449X E-ISSN: 2549-7421
Statistik Pengunjung Jurnal IJCIT