Document Clustering Using Cosine Similarity
Main Article Content
Abstract
Clustering or Cluster Analysis is a process of grouping similar objects in such a way that the objects in the group (cluster) are similar to each other than the objects in other groups (clusters). Clustering is an unsupervised machine learning technique where only the input data is served 3 (unlike as in supervised, a set of sample input and output pair is provided) to the system corresponding to which it recognizes a pattern and predicts the output automatically, Hence complete automation is achieved here. In specific to our work that is Document clustering is organizing the text files into clusters containing similar files (File Content). High precise clustering algorithms like K-means play an important role in data storage, data manipulation and information retrieval systems. Search engines like Google, Yahoo, Bing etc. uses Document clustering in addition to high-end processors and servers to retrieve the information in response to the various search queries. The most commonly used clustering technique is K-means, where the objects are divided into ‘k’ number of clusters with similar objects in it. The present work is focused on Document clustering using ‘Cosine Similarity’ where the pre-processing work is carried out by a readymade Java library known as ‘Apache Lucene’. The texts in the documents are broken down into strings, and the extracted strings is fed to the Apache Lucene which pre-processes the data, the number of repetitions of each word and gives the output as JSON objects. Then the cosine similarity is calculated with these indexed words. The final result of this work outputs the documents that are similar to each other, that are exactly similar to each other (copy documents) and the ones which are unique (outlier). The applications of document clustering include mining useful data in large datasets, web page clustering, search engines, anti-plagiarism checkers etc.
Downloads
Article Details
COPYRIGHT
Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
- The journal allows the author(s) to retain publishing rights without restrictions.
- The journal allows the author(s) to hold the copyright without restrictions.
References
Jiahui Liu, Peter Dolan ,“Personalized news recommendation based on click behaviorâ€, 15th international conference on Intelligent user interfaces, ACM 2010, Pages 31-4, 10.1145/1719970.1719976 [2] Noam Slonim , Naftali Tishby, “The power of Word Clusters for Text Classificationâ€, 23rd European Colloquium on Information Retrieval Research, 2001, [3] Michael Steinbach, George Karypis, Vipin Kumar, “A comparison of Document Clustering Techniquesâ€, KDD Workshop on Text Mining, 2000. [4] Christopher D.Manning, Prabhakar Raghavan, and Hinrich Schutze, “An Introduction to Information Retrieval†Cambridge University, England. [5] Yieng Chen and Bing Quin, “A Comparison between the Algorithms: SOM and K-Meansâ€, May 2010. [6] Kristof Csorba ; Istvan Vajk , “Term Clustering and Confidence Measurement in Document Clusteringâ€, IEEE February 2007, DOI: 10.1109/ICCCYB.2006.305694 [7] Haojun Sun ; Zhihui Liu ; Lingjun Kong, “A Document Clustering Method Based on Hierarchical Algorithm with Model Clusteringâ€, IEEE April 2008, DOI: 10.1109/WAINA.2008.45 [8] X. Cui ; T.E. Potok ; P. Palathingal, “Document clustering using particle swarm optimization†IEEE August 2005, DOI: 10.1109/SIS.2005.1501621