CLASSIFICATION TECHNIQUES USING SPAM FILTERING EMAIL

The general data mining model with the complex sample data solves the problem on data classification. The preprocessing step of complex data in data mining solves the problem of accuracy caused by the mass data. 
 The growing volume of spam mails annoys people and affects work efficiency significantly. The work focused on developing spam filtering algorithm, using statistics or data mining approach to develop precise spam rules. The main propose of an anti spam approach combining both data mining and statistical test approach. The efficiency of spam rules, only significant rules will be used to classify emails and the rest of rules can be eliminated for performance improvement. 
 The effective decision tree classifiers are used to classify whether the mail is spam or ham. Various filtering techniques are used to find the spam mails and filter them but the accuracy and performance of the algorithms is distinct from each other. Two decision tree algorithms that are basically used as classifiers namely J48 or C4.5, Rndtree. The algorithms are studied, analyzed and test results are shown in WEKA tool for efficient spam filtering.The results are compared and RndTree algorithm shows almost 99% accuracy level in filtering the spam mails and it shows best results among other classifiers.


What is a Spam
The task of utomatically f ecision tree c om other type pecifically on pplication of he spam filtrat ot. [3]The dec asy to unde erformance as ataset is traine he performanc ased on the lassifier. The nhanced to p mplemented in Classifier were analyzed for Spam filtration. Among several approaches, the top most are SVM [12] (Support Vector Machines) and the well known Naive Bayes classifier.

Muthukaruppan
Weka, an open source, GUI based, portable workbench has been used to perform the analysis of various email spam filtering techniques with a rigorous data set applied. Data set of emails is created using attributes and relations from the spam mails received in the mailbox for over six months. The 105 attributes and 300 instances taken as a total data set and 10 fold cross validations has been done to test the result and compare the different results. The different decision tree algorithms are run using Weka are NBTree, C4.5 decision tree classifier and Logistic Model Tree classifier are analyzed based on the performances with different criteria in terms of time, result efficiency and accuracy achieved by the various decision tree classifiers and also some other criteria like false positive, false negative rates of decisions taken by the classifiers. Catarina Silva et al (2012) using hybrid system for text classification based on the ensemble of both Artificial Immune Systems (AIS) and SVM approaches. [6]The advantage of a non-evolutionary implementation that produced remarkable results with text classification and showing the classification performance gains, resulting in a classification has improved. Manjusha et al (2013) used method for Binary Decision Tree Multi Class Support Vector Machine approach are using the advantages of SVM and decision tree [11], that is Decision Tree (DT) s are much faster than SVM s in classifying new instances while SVM perform better then DTs in terms of classification accuracy. To include both this advantages we will reduce the size of record set will be fed to the SVM. Normal data points are classified by decision tree while some crucial data points were difficult for decision tree to classify to multiclass SVM. Malti Sarangal (2014) proposed method is K Means clustering and Support Vector Machine (SVM) based classification algorithm are considered to classify the spam base dataset [10]. The main advantages is improved classification accuracy and reduces the false positive and time cost. K Means algorithm, is numerical and one of the hard clustering method, this means that a data point can belong to only one cluster. The decision tree classifiers provide great results as far as spam detection concerned. By comparing all the three classifiers, yield best results and provides 90% accuracy in performance. That algorithm takes more processing time than that of other classifiers. The one of the most disadvantages exhibited in this classifier. This is better than the earliest algorithms such as Naive Bayes and many other spam detection techniques.

EXISTING SYSTEM
The Immune System evolved to become an extremely complex resistance system that has the capability to identify foreign substances and to differentiate between harmless and harmful. Immune System is decomposed in two main layers of resistance that is innate and adaptive. Innate recognizes precised substances and its conduct is similar to all individuals of the same species. Adaptive layer is able to learn to identify new forms of anomalous pathogens that regularly change during the time hence it provides an extremely complicated adaptive form of identification.
The Immune System is also supported by a pathogens are divided into small peptides by Antigen Presenting Cell (APC). The peptides are then accessible by the lymphocytes also called as Transaction Cells. The Transcation cells have a particular set of receptors that used to bind peptides with a certain degree of affinity that are being offered by Antigen Presenting Cells. Artificial Immune Systems (AIS) is an adaptive system inspired by biological immune system and it is based on theoretical immunology.

K MEANS CLUSTERING
Automated mechanism uses unsupervised learning for classification purposed. Unsupervised learning means there is no supervisor is needed to train the mechanism. [4]Clustering is one type of unsupervised learning. Clustering is designed to aim for grouping similar type of data together. Clustering process data is divided into similar type of groups where each group contains the data which have more similarity. The groups are called as clusters. K Means clustering is the most useful method for finding natural groups of similar type of data. A classification technique the objects are assigned to predefined categories whereas in clustering the classes are formed and two categories available for dividing clustering methods on the basis of character of the data and the reason for that cluster has being used. The categories are fuzzy clustering and hard clustering in the fuzzy clustering to every data element can belong to more than one cluster. Resolve it fuzzy clustering uses a mathematical model for classification and hard clustering every data element is divided into separate cluster.
K Means clustering algorithm is a hard clustering method so it can be applied for spam filtering. [14]The research utilized the K Means clustering algorithm to classify the emails. Classifies incoming email as spam or legitimate on the basis of similar attributes or features. The K Means clustering K is a positive number initialized in the starting and algorithm refer it to as the number of clusters required for classification. K Means clustering algorithm inspects the feature vector of each incoming email, such that the items within every cluster are similar to each other. The basis of this inspection it form two clusters, one is spam and another is legitimate. The iterative process where initial set of clusters and the clusters are frequently updated until no more upgrading is possible or the number of iterations reached to a specified limit.

Figure 1.1 An overview of Local Concentration Based K Means Clustering
The local concentration based feature extraction method with artificial immune system has five processing stages are involved to generate final results. Each of them is discussed given blow. Preprocessing of incoming email is essential task before process to classify it. The setup is working with real time spam filter, incoming email is processed and when working in an experimental environment sample datasets are preprocessed. Used string tokenizer in this phase for generating dictionary of the words. Irrelevant words are discarded and after it processed data is passed to term selection stage of the model. Information Gain is used as term selection strategy for our model. [5]Algorithm for term selection is discussed as ds generation and term selection algorithm given below.
Step 1 : Initialize preselected set and DS == Empty set.
Step 2 : Every term in the terms set Do Calculate weight of the term according to a certain term selection strategy End Step 3 : Arrange the terms in decreasing order of the weight Step 4 : Join the front % terms to the preselected set Step 5 : For all terms in the preselected set Do Calculate Tendency as (t k )=P(t k |c l )-P (t k |c s ) if || P(t k |c l )-P(t k |c s )||>, >=0 then if || P(t k |c l ) -P (t k |c s ) || > , >=0 then Add the term to DS s Else Add the term to DS l endif . Else Discard the term endif endfor P (t k |c l ) is probability of t k as legitimate P (t k |c s ) is probability of t k as spam. DS s is spam detector set and DS l spam detector set. Model used local concentration based feature extraction approach with artificial immune system. Algorithm used for feature extraction is discussed to local concentration based feature extraction approach with artificial immune system.
Step 1 : Move a sliding window of w n term length over a given message With a step of w n term.
Step 2 : for every position of the sliding window Do Calculate the spam genes concentration in the window by formula: SC j = N s /N t Calculate the legitimate genes concentration of the window by formula: LC j = N l /N t end for.
Step 3 : Construct feature vector: (<SC 1 , LC 1 >,<SC 2 ,LC 2 >…<SC n , C n >) SC j is spam gene concentration in j th window. LC j is legitimate gene concentration in j th window. N t is the number of dissimilar terms in the window. N s is the number of the dissimilar terms in the window which corresponding to detectors in D s . The work applied KMeans clustering for classification. Fourth and very important stage of spam filtering. The stage of measuring to effectiveness in this entire system by evaluating classification result. Algorithm used for K Means clustering at classification phase is discussed K Means clustering for classification Step 1: Initialize spam and legitimate Centroids Step 2: Centroids = kMeansInitCentroids(X, k) Step 3: for iter = 1 iterations Cluster assignment step Assign each data point to the closest centroid. idx(i) corresponds to cˆ(i), the indexof the centroid assigned to example i Step 4: idx = findNearestCentroids(X, centroids); Move centroid step Compute means based on centroidassignments Step 5: centroids = computeMeans(X, idx, K) Step 6: end

VARIOUS CLASSIFIERS IN EMAIL SPAM FILTERING PROBLEM DEFINITION
The various decision tree classifiers are taken for evaluation and apart from other types of data mining classifiers it is emphasized specifically on decision tree classifiers for the particular application of spam filtration technique. The main task of the spam filtration is to identify whether the mail is spam or not. The decision tree filters are easy to implement and easy to understand. Provides an overall satisfactory performance as far as spam mail detection is concerned. The dataset is trained and tested with various decision trees and the performance evaluation criteria of various classifiers are based on the precision, accuracy and time taken by the classifier. The classifier which is evaluated best is further enhanced to provide more accuracy and the algorithm is implemented in the WEKA tool.

SPAM DATASET
The spam dataset was taken from UCI machine learning

FEATURE REDUCTION TECHNIQUES
Complex data analysis and mining on huge amounts of data take a very long time making practical analysis infeasible.
[12]Feature reduction techniques have been helpful in analyzing reduced representation of dataset without compromising the integrity of the original data and produce Step usin Step Step Step Step attri Step Step Step Gen Step Step

Ran trees grow Algo
Step Step colle the c Step orig Step in th each Step split Step and of at

CONCLUSION AND FUTURE ENHANCEMENT
Email spam is a serious threat in corporate world and also in business. Reducing the spam mails and preventing the accumulation of spam mails storing in user's mailbox is a great challenge to the users. The identification of best algorithm to classify the spam mails is an important task. Decision tree algorithms are used in filtering the spam mails because the main task is to classify the mails whether it belongs to spam or ham. The algorithms are trained, tested before and applying filtering algorithms. The results of the different algorithms are evaluated based on the Accuracy, Error rate, Precision and False positive rate. The comparison of the above algorithms based on their performance shows that the Random forest classifier exhibit best results when compared to other classifiers before and after applying weka filters.
The bugs that are identified when this classification algorithm was built are when handling with the missing values. Split point is the point at the tree splits up the instances in two instances by assigning weights to the branch at the splitting point. The attribute has some missing values, the attributes carry some information after the split points. The results in additional branches in the tree. Sometimes, the split will have a reduction of entropy of 0 and have a small positive value which leads to additional branches in the tree. The algorithm can be further enhanced by improving the Out of Bag estimate (OOB) it supports multithreading.