COMPARATIVE ANALYSIS OF MULTIPLE CLASSIFIERS FOR HEART DISEASE CLASSIFICATION

Over the last decade heart disease remains the main reason for death in the world wide. Several data mining techniques and analysis have been used by the researchers to help health care professionals in the diagnosis of heart disease but using the old traditional techniques can reduce the number of test that is required. With the vast growing death rate in heart disease worldwide it is sure that there must be a quick and efficient detection technique. Supervised machine learning algorithm is one of the effective data analysis methods used. This research compares different algorithms of Logistic regression (LR), artificial neural network (ANN), KNearest Neighbor (KNN), Naïve Bayes (NB), and Random Forest (RF) classification seeking better performance in heart disease diagnosis. The algorithms are tested in Anaconda platform (J-Python). The existing datasets of heart disease patients from Google scholar database is used to test and justify the performance of all the algorithms. This datasets (Framingham) consists of 23138 instances and 16 attributes. Subsequently, the classification algorithm that has optimal potential will be suggested for use in sizeable data. The aim of this work is to design a model to enter the patient record and predict whether the patient is having Heart disease by using machine learning techniques with accurate prediction.


I. INTRODUCTION
Day by day incidences of chronic diseases are increasing with the advance in living standards. In a report by McKinsey [1], he had mentioned that 50% of Americans suffer from one or more chronic diseases, and 80% of American medical care fee is spent on the treatment of these chronic diseases. Statistically it is shown that on an average annually US spend 2.7 trillion USD to treat chronic diseases. The healthcare problems related to chronic diseases are very important in many other countries and make it necessary to conduct risk assessments for chronic diseases and heart diseases. Human anatomy is comprised of so many vital organs and among all; heart also plays an important role. Heart pumps out blood to all part of human body and if it does not function properly it will cause death of the person. Diseases related to coronary artery; problems in heart rhythm (cardiac arrhythmias); congenital heart defects; all of these conditions affect the heart and causes heart diseases. With the escalation in medical data [2], collecting electronic health records (EHR) will diminish the cost spend in chronic diseases. A resourceful flow estimating algorithm, which was made for the telehealth cloud system was mentioned in some paper [3] and a data coherence protocol for the PHR (Personal Health Record)-based distributed system was designed for the same. In the field of healthcare Bates et al. [4] proposed six applications of big data and one of the applications is to identify high-risk patients which can be helped to minimize medical cost as patients with high risk to develop a chronic diseases often require expensive healthcare. Predictions made by traditional disease risk models like data analysis has the drawbacks like adjustment of less supervised data, though they have high accuracy but in case of big data analysis more numbers of structured, non-structured, supervised data can be analyzed. Data mining technique, which examines a large datasets to extract hidden and previously unidentified patterns, is another tool which had been developed by the researchers to assist the doctors, nurses or pharmacists in the diagnosis of heart diseases [5]. In recent day's medical organizations, all around the world assemble a mixture of data on problems related to health [6], which can be exploited using several machine learning techniques to achieve useful insights. As the data acquired are too enormous for human minds to comprehend, can be easily understandable using the machine learning techniques. Algorithms, which have been used in this paper, are useful to anticipate the occurrence of heart related diseases with accuracy. This paper presents a comparison among the 6 classifiers algorithm models for big data analysis that improves the data accuracy to distinguish between heart disease patients from non heart disease patients and can be able to determine more probabilistically that the patients will be diagnosed with heart disease.

II. LITERATURE SURVEY
• Akram Pasha and P H. Latha [7] had worked on machine learning system and investigated the range of machine learning classification models trained with the optimal subset of features of Parkinson's disease data set for efficient Parkinson's disease classification. For their work they have used algorithms like Genetic Algorithm and Binary Particle Swarm Optimization in different machine learning classifiers and found that Genetic Algorithm produced the maximum dimensionality reduction with maximum classification accuracy than others. Shah, and G. Escobar: This paper basically focuses on health care in big data that identifies and manages the high quality risk and cost in patients by adopting EHR which increases quantity outbreak of diseases. • Monika Gandhi et.al, [8] used Naive Bayes, Decision tree and neural network algorithms and analyzed the medical dataset. There are a huge number of features involved. So, there is a need to reduce the number of features. This can be done by texture selection. On doing this, they say that time is reduced. They made a use of decision tree and neural networks. • Helma, C., E. Gottmann, and S. Kramer, "Knowledge discovery and data mining in toxicology": Techniques in this paper majority focused on machine learning in symbolic form developed by toxicological applications mainly in detecting the structure relationships.

• Dhomse Kanchan B and Mahale Kishor M. et al. "Study of Machine Learning Algorithms for Disease Prediction using PCA Analysis":
Healthcare industry collects large amounts of data which unfortunately are not "Extracted" for discovering sight information for effective decision making. In this paper, study of PCA has been done which finds the minimum number of attributes required to enhance the precision of various supervised machine learning algorithms. • M. Nikhil Kumar et al. [9] used various algorithms like Decision tree, random forest, Naive Bayes, KNN, Support vector machine; logistic model tree and Naive Bayes algorithm tend to gave better performance when compared to other algorithms. They have used UCI repository of heart disease dataset. The result of their work showed that time taken to build UCI J48 algorithm was less compared to other and better in work.
• Min Chen, Yixue Hao, Kai Hwang, Fellow, IEEE, Lu Wang, and Lin Wang; "Predicting heart diseases in machine learning over big data Communities: In this paper, they had design machine learning algorithms for efficient prediction of disease outbreak in various communities and experiment the customized minor changes in prediction of models over real-life hospital data models collected from central China. Here in this paper they had used Machine Learning algorithms to various medical datasets to automate the analysis of large and complex data. This paper presents various ssupervised models such as SVM, KNN, Naïve Bayes, Decision Trees (DT), Random Forest (RF) and ensemble models to check the risk of cardiovascular diseases.

III. APPROACH AND METHODOLOGY
A.
Steps followed for the recent work: 6. Then we used SMOT libraries, which was to organize the imbalance data set to balance data set. 7. Then by using confusion matrix, a well train data set was obtained. 8. Lastly the evaluated kappa value and the accuracy of the data set were calculated by using different algorithms.

B. For all algorithms we have been used below
libraries. • Scikit-learn library: It is a library in Python. It provides information about many unsupervised and supervised learning algorithms. • Pandas library: It is a software library written for the Python programming language for data manipulation and analysis. Generally it offers data structures and operations for manipulating numerical tables and time series. • NumPy library: It adds support for large, multidimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. All algorithms we are using correlation matrix as a preprocessing. We have used same preprocessing techniques in every algorithm. After that we are balancing our dataset using SMOT (Synthetic Minority over Sampling) Technique. Different machine learning algorithms are explained below: • K-Nearest Neighbour (KNN): KNN is a nonparametric machine learning algorithm. The KNN algorithm is a supervised learning method. This means that all the data is labeled and the algorithm learns to predict the output from the input data. The data is divided into training and test sets. The train set is used for model building and training. A kvalue is decided which is often the square root of the number of observations. Now the test data is predicted on the model built [10]. The formula for Euclidean distance is as follows: d = v?^k (xi -yi)2 i =1 • Naïve Bayes (NB): This is a classification algorithm which is used when the dimensionality of the input is very high. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. It is based on Bayes theorem [11]. The Bayes theorem is as follows: P(Y/X) = P(X/Y) P(X) This calculates the probability of Y given X where X is the prior event and Y is the dependence event. It needs less training data. It can be used for binary classification problems and is very simple.

•
Random Forest (RF): It is a supervised machine learning algorithm. This technique is used for both regression and classification tasks. It performs better in classification tasks. It is nothing but ensemble of multiple decision trees. It is used for both classification as well as regression but in our work we have used it for classification purpose only [12]. It uses a voting system and then decides the class; for classification. It works well with large datasets with high dimensionality and that is proved in our study also. • Artificial neural network (ANN): Artificial neural network; as the name indicates these are nothing but the computational models designed like animal's central nervous systems (in particular the brain) that are capable of machine learning and pattern recognition. They are usually presented as systems of interconnected "neurons" that can compute values from inputs by feeding information through the network [13]. ANN can be used for pattern recognition or data classification, through a learning process. Along with the advantages like the capacity to find complex relations among variables, with a high tolerance to data uncertainty, and providing predicted variable patterns in-real time ANN also can perform tasks that a linear output cannot, as one of the neurons fail the others will work in a parallel way [14]. Their main advantages the capacity to find complex relations among variables, with a high tolerance to data uncertainty, and providing predicted variable patterns in-real time.
• Logistic Regression (LR): It is a supervised learning classification algorithm used to estimate the probability of a target variable. In this the nonlinear regression gets transformed to the linear regression. It is the S-shaped distribution function contains the estimated probabilities to lie between 0 and 1 [15]. Logistic regression works by exploring the problems where one or more independent variables can determine a dependent variable, which in nothing but the outcome. Many of its functions can be found in the medical field.
• Stochastic Gradient Descent (SGD): Stochastic gradient descent is a type of gradient descent, which is used to find out the values of parameters (coefficients) of a function that minimizes a cost function for large number of data. Gradient descent is best used when the parameters cannot be calculated analytically.

IV. RESULT AND DISCUSSION
After using the six methods we have got the accuracy and Kappa values for all the models. The machine learning models is evaluated using the AUC-ROC confusion metric and comparing the Kappa values of different models. This metric is used for understanding the performances of model. Here we used 6 models in machine learning. The ROC curve is the Receiver Operating Characteristic curve. The AUC is the area under the curve. If the Kappa value is high, the accuracy of the model will also be high and vice versa. To measure the models' performance, the experiments are done by K-Nearest Neighbor, Naïve Bayes, Random Forest, Artificial neural network, Logistic regression, Stochastic Gradient Descent. These models are very popular in big data analytics for classification in health, banking and ecommerce sectors, which made them easily vulnerable to use. The accuracy and Kappa values produced by KNN, NB, RF, ANN, LR, SGD of heart disease patients is shown in Table III. When compared in Table III; all together it is observed that Random forest has more kappa value (0.9804) which is higher than all other models and so the accuracy of the random forest model is higher than all other model. This is again graphically represented in fig. 4. The highest accuracy of a class using mentioned classifiers will help to determine more accurately that which are the patients may have heart disease.  If the AUC value is high, the model performance is high and vice versa. Here we compared the AUC-ROC graphs it was observed that the AUC value of RF is more compared to ANN, LR, KNN which again indicates the higher accuracy of RF machine learning algorithm. The highest accuracy of a class using mentioned classifiers will help to determine more accurately that which are the patients may have heart disease.

V. CONCLUSION
According to the work done in this paper it can be concluded that there is huge opportunity for machine learning algorithms. Random forest gives us a accurate value compared to other algorithms and the least one is Linear Regression .Random Forest creates decision tree of data sets which gives us a correct performance of prediction from each of them and the models of linear regression creates the prone to outliers and noise, when analyzed in graph it as to be removed and also it leads to over fit when observations are lesser then features. The model of ANN is slightly better than KNN because ANN has the capacity to accumulate the information in entire network and it can also able to work with incomplete information which is not possible with KNN and ANN provides Distributed memory into chunks of data. When using a large set it does not work with large data in case of KNN and does not do anything with training data as it randomly divides the training data so it is also called as Lazy learning algorithm. Models based on machine learning algorithms and techniques have been very precise to predict the heart related diseases but still there is a lot of scope for the researchers to work and learn how to handle high dimensional data, outliers and over fitting. A bunch of research can also be done on the accurate ensemble of algorithms to make use of some meticulous type of data.