Performance Evaluation of Classification Methods on Big Data: Decision Trees, Naive Bayes, K-Nearest Neighbors, and Support Vector Machines
DOI:
https://doi.org/10.20956/j.v20i3.32970Keywords:
Performance Evaluation, Classification Method, Big DataAbstract
Performance evaluation of classification methods on big data is becoming increasingly important in addressing the challenges of data analysis at scale. This study aims to conduct a comparative evaluation of the classification method, namely Decision Trees (DT), Naive Bayes (NB), k-Nearest Neighbors (KNN), and Support Vector Machines (SVM), in analysis on big data evaluated from data simulation and application of real data available in the Rstudio package, namely ISLR. The simulation data used consisted of 2 types of datasets generated based on predictor variables that were normally distributed with different averages and variants and response variables generated in classes adjusted to the characteristics of predictor variables with different proportions. Real data are taken from two types of numeric variables and predictor variables available in the package. The number of sample sizes to be evaluated in each method is n = 500, n = 1000 and n = 5000. In real data, sample division is done randomly to maintain data representativeness. At the evaluation stage, the performance of the method is measured using accuracy metrics. The results of the evaluation of the simulation of Dataset 1 show that the methods that have an influence on the quality of the classification produced if applied to Big Data are the DT and KNN methods. However, in Dataset 2 there is a change in the results of the DT method, because of the influence on the number of classes and the proportion of class distribution in the data. The results obtained from data simulation, proven by applying to real data by showing that similar methods provide a quality influence if applied to Big Data, while the NB and SVM methods do not show a consistent influence when applied to Big Data. The results of observations in this study show that the DT and KNN methods have several advantages that make them suitable for application to Big Data.Downloads
References
. Boris, M. & Milovic, M., 2012. Prediction and decision making in health care using data mining. Kuwait chapter of arabian journal of business and management review, Vol. 1, No. 12, 1–11.
. Chrisinta, D. & Simarmata, J.E., 2023. Analisis Sentimen Penilaian Masyarakat Terhadap Pejabat Publik Menggunakan Algoritma Naïve Bayes Classifier. Komputika: Jurnal Sistem Komputer, Vol. 12, No. 1, 93–101.
. Chen, C.L.P. & Zhang, C.-Y., 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inf Sci (N Y), 314–347.
. Fathi, M., Haghi Kashani, M., Jameii, S.M., & Mahdipour, E., 2022. Big data analytics in weather forecasting: A systematic review. Archives of Computational Methods in Engineering, Vol. 29, No. 2, 1247–1275.
. Gaye, B., Zhang, D., & Wulamu, A., 2021. Improvement of support vector machine algorithm in big data background. Mathematical Problems in Engineering, 1–9.
. Ginting, R., 2022. Analisis Big Data. Klaten: CV. Penerbit Lakeisha.
. Jin, X., Wah, B.W., Cheng, X., & Wang, Y., 2015. Significance and challenges of big data research doi: 10.1016/j.bdr.2. Big data research, Vol. 2, No. 2, 59–64.
. Kramer, O., 2013. K-nearest neighbors. In: Dimensionality reduction with unsupervised nearest neighbors. 13–23.
. Kumar, N. & Maurya, V., 2020. A review on machine learning (feature selection, classification and clustering) approaches of big data mining in different area of research. Journal of Critical Reviews, Vol. 7, No. 19, 2610–2626.
. Kwang, K.J. & Wang, Z., 2019. Sampling techniques for big data analysis. International Statistical Review, Vol. 87, S177–S191.
. Pham, Q. V., Nguyen, D.C., Huynh-The, T., Hwang, W.J., & Pathirana, P.N., 2020. Artificial intelligence (AI) and big data for coronavirus (COVID-19) pandemic: a survey on the state-of-the-arts. IEEE access, Vol. 8, 130820–130839.
. Robert, N., Elder, J., & Miner, G.D., 2009. Handbook of statistical analysis and data mining applications. Academic press.
. Rojas, J.A.R., Kery, M.B., Rosenthal, S., & Dey, A., 2017. Sampling techniques to improve big data exploration. In: IEEE 7th symposium on large data analysis and visualization (LDAV). 26–35.
. Saadoon, M., Hamid, S.H.A., Sofian, H., Altarturi, H.H., Azizul, Z.H., & Nasuha, N., 2022. Fault tolerance in big data storage and processing systems: A review on challenges and solutions. Ain Shams Engineering Journal, Vol. 13, No. 2, 101538.
. Sujatha, R., Chatterjee, J.M., Jhanjhi, N., & Brohi, S.N., 2021. Performance of deep learning vs machine learning in plant leaf disease detection. Microprocess Microsyst, 80 (103615).
. Sunil, K. & Mohbey, K.K., 2022. A review on big data based parallel and distributed approaches of pattern mining. Journal of King Saud University-Computer and Information Sciences, Vol. 34, No. 5, 1639–1662.
. Tanveer, M., Rajani, T., Rastogi, R., Shao, Y.H., & Ganaie, M.A., 2022. Comprehensive review on twin support vector machines. Annals of Operations Research, 1–46.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Author and publisher
This work is licensed under a Creative Commons Attribution 4.0 International License.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Jurnal Matematika, Statistika dan Komputasi is an Open Access journal, all articles are distributed under the terms of the Creative Commons Attribution License, allowing third parties to copy and redistribute the material in any medium or format, transform, and build upon the material, provided the original work is properly cited and states its license. This license allows authors and readers to use all articles, data sets, graphics and appendices in data mining applications, search engines, web sites, blogs and other platforms by providing appropriate reference.