RFKNN: ROUGH-FUZZY KNN FOR BIG DATA CLASSIFICATION

Mohamed A. Mahfouz

Abstract


The K-nearest neighbors (kNN) is a lazy-learning method for classification and regression that has been successfully applied to several application domains. It is simple and directly applicable to multi-class problems however it suffers a high complexity in terms of both memory and computations. Several research studies try to scale the kNN method to very large datasets using crisp partitioning. In this paper, we propose to integrate the principles of rough sets and fuzzy sets while conducting a clustering algorithm to separate the whole dataset into several parts, each of which is then conducted kNN classification. The concept of crisp lower bound and fuzzy boundary of a cluster which is applied to the proposed algorithm allows accurate selection of the set of data points to be involved in classifying an unseen data point. The data points to be used are a mix of core and border data points of the clusters created in the training phase. The experimental results on standard datasets show that the proposed kNN classification is more effective than related recent work with a slight increase in classification time.

Keywords


classification, kNN, big data, clustering, fuzzy sets, rough sets

Full Text:

PDF

References


W.-J. Hwang and K.-W. Wen, "Fast kNN classification algorithm based on partial distance search," Electronics letters, vol. 34, pp. 2062-2063, 1998.

Y. Song, J. Liang, J. Lu, and X. Zhao, "An efficient instance selection algorithm for k nearest neighbor regression," Neurocomputing, vol. 251, pp. 26-34, 2017.

R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Machine learning: An artificial intelligence approach: Springer Science & Business Media, 2013.

S. A. Medjahed, T. A. Saadi, and A. Benyettou, "Breast Cancer Diagnosis by using k-Nearest Neighbor with Different Distances and Classification Rules," International Journal of Computer Applications, vol. 62, 2013.

G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, "An affinity-based new local distance function and similarity measure for kNN algorithm," Pattern Recognition Letters, vol. 33, pp. 356-363, 2012.

M. J. Islam, Q. J. Wu, M. Ahmadi, and M. A. Sid-Ahmed, "Investigating the performance of naive-bayes classifiers and k-nearest neighbor classifiers," in Convergence Information Technology, 2007. International Conference on, 2007, pp. 1541-1546.

T. İnkaya, S. Kayalıgil, and N. E. Özdemirel, "An adaptive neighbourhood construction algorithm based on density and connectivity," Pattern Recognition Letters, vol. 52, pp. 17-24, 2015.

S. Zhang, X. Li, M. Zong, X. Zhu, and D. Cheng, "Learning k for knn classification," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 8, p. 43, 2017.

I. Mani and I. Zhang, "kNN approach to unbalanced data distributions: a case study involving information extraction," in Proceedings of workshop on learning from imbalanced datasets, 2003.

V. Ganganwar, "An overview of classification algorithms for imbalanced datasets," International Journal of Emerging Technology and Advanced Engineering, vol. 2, pp. 42-47, 2012.

M.-L. Hou, S.-L. Wang, X.-L. Li, and Y.-K. Lei, "Neighborhood rough set reduction-based gene selection and prioritization for gene expression profile analysis and molecular cancer classification," BioMed Research International, vol. 2010, 2010.

O. Okun and H. Priisalu, "Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors," Artificial intelligence in medicine, vol. 45, pp. 151-162, 2009.

S. D. Bay, "Nearest neighbor classification from multiple feature subsets," Intelligent data analysis, vol. 3, pp. 191-209, 1999.

X. Wu, C. Zhang, and S. Zhang, "Efficient mining of both positive and negative association rules," ACM Transactions on Information Systems (TOIS), vol. 22, pp. 381-405, 2004.

X. Zhu, L. Zhang, and Z. Huang, "A sparse embedding and least variance encoding approach to hashing," IEEE transactions on image processing, vol. 23, pp. 3737-3750, 2014.

X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, "Missing value estimation for mixed-attribute data sets," IEEE Transactions on Knowledge and Data Engineering, vol. 23, pp. 110-121, 2011.

Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, "Efficient kNN classification algorithm for big data," Neurocomputing, vol. 195, pp. 143-148, 2016.

Z. Pawlak and R. Sets, "Theoretical aspects of reasoning about data," Kluwer, Netherlands, 1991.

L. A. Zadeh, "Fuzzy sets," in Fuzzy Sets, Fuzzy Logic, And Fuzzy Systems: Selected Papers by Lotfi A Zadeh, ed: World Scientific, 1996, pp. 394-432.

A. K. Jain and R. C. Dubes, "Algorithms for clustering data," 1988.

R. J. Hathaway and J. C. Bezdek, "Extending fuzzy and probabilistic clustering to very large data sets," Computational Statistics & Data Analysis, vol. 51, pp. 215-234, 2006.

S. Z. Selim and M. A. Ismail, "Soft clustering of multidimensional data: a semi-fuzzy approach," Pattern Recognition, vol. 17, pp. 559-568, 1984.

"K. Bache,M.Lichman, UCIMach.Learn.Repos.(2013).", ed.

C.-C. Chang and C.-J. Lin, "LIBSVM: a library for support vector machines," ACM transactions on intelligent systems and technology (TIST), vol. 2, p. 27, 2011.

G. Song, J. Rochas, F. Huet, and F. Magoules, "Solutions for processing k nearest neighbor joins for massive data on mapreduce," in Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on, 2015, pp. 279-287.




DOI: https://doi.org/10.26483/ijarcs.v9i2.5667

Refbacks

  • There are currently no refbacks.




Copyright (c) 2018 International Journal of Advanced Research in Computer Science