Distinguishing Cardiac Arrhythmia: A Data Mining Study

This research ensures to distinguish between the presence and absence of cardiac arrhythmia and classify it in one of the 16 groups. The research questions are related to the accuracy of heat beats. The methodology utilized for this project includes, CRISP-DM procedure, which is a well-proven methodology and known for its robustness. Here, the provided data set is analysed using the methods like KNN, Naïve Bayes, SVM, Gradient Boosting, Model tree and Random Forest. The database comprises of 279 attributes, out of which 206 are linear valued and the remaining are nominal. The
Class 01 denotes ‘normal’ then the ECG classes ranging from 02 to 15 denotes different classes of arrhythmia and Class 16 denote the remaining unclassified ones. Current there exists a computer program which makes all these classifications. But, there exists some differences amongst the cardiologs and programs classification. In this case, the cardiologs are taken as a gold standard, for decreasing such difference, with the help of machine learning tools.

As a whole, this project mainly aims to leverage the methods learnt in the course module and to execute a significant data mining study. The provided data set is used as a source, for this project. 


The objective of this project is to distinguish between the presence and absence of cardiac arrhythmia and classify it in one of the 16 groups. Thus, the purpose is to decrease the differences among the cardiologs and programs classification. 

According to [1], it is believed that for predicting the project’s disappointment risk or scheme several models are utilized and one of them includes Naïve Bayes. The Naive Bayes prototype was used for creating the confusion matrix. Here, the result were compare and the result showed the dissimilar quantity than the inhabitants which are utilized for scoring the set and it represented that the validation must be corrected. The only reason for selecting Naïve Bayes is its capacity of handling the missing data that are useful for projects like CSI, where the data goes missing every now and then. However, selection of Naïve Bayes contains a dualistic approach such as, the missing information get seized and it helps to derive simple probabilistic classifier based on Bayes assumptions. It is observed that the results of Naïve Bayes can be often inaccurate, but its performance in the organizational activities are good. However, in certain cases, poor calibration of Naïve Bayes is possible. The performance is satisfactory due to highly interdependent probability with the factual potentials. When they are largely interdependent with each other, this lets to instantly calculate large samples and allows to indirectly handle the data with the missing interdependent element.


As per [2], KNN algorithm is referred as a simplest machine learning algorithms, which idealizes that, the objects which closer to each other will contain same characteristics. Therefore based on the characteristic features of the nearby objects, the nearest neighbor is predicted. Generally, KNN deals with continuous attributes, but it can also work with the discrete attributes. While the discrete attributes are dealt, if the attribute values for the two instances a2, b2 are different thus, the difference among them is equivalent to one, if not it is equivalent to zero.  According to the results found related KNN shows the sensitivity, specificity, and accuracy, for diagnosis of patients with heart diseases. The value of K ranges between 1 to 13 and the achieved accuracy ranges from 94 percent to 97.4 percent, which contains different K values. The value of K equal to 7 achieved the highest accuracy and specificity (97.4% and 99% respectively). This paper are shows that the KNN is the widely utilized data mining method, for the classification problems. On the other hand, KNN’s simplicity is regarded as best. It is considered to have relatively high convergence speed, which makes it famous to select. Further, the major demerit of the KNN classifiers is the requirement of large memory, for storing the complete sample. When the sample is large, response time on a sequential computer is also large. KNN is used to find the closest neighbors of the given data with all the available training data. In this paper, if a label is found then the algorithm quits, otherwise the system classifier is applied. The proposed algorithm was used to recognize the object. The results are compared to those obtained with single system classifier and KNN.

According to [3], in this paper the authors ensures that, SVM’s effectiveness is analyzed, where the medical dataset classifying (Heart disease classification) is done with the classification techniques. Further, even the Naïve Bayes classifier, RBF network and SVM Classifier’s performance is analyzed. With respective to SVM, the observation proves that it can generate effective accuracy level in classification. Here, the authors have used WEKA environment for retrieving the results. Especially for medical dataset the SVM classifier results prove to be robust and effective too.

The authors in [4], have concluded that bagging works fine for most of the decision tree types but needs some tuning. Whereas the neutral nets and the SVMs need careful selection of parameters. In terms of Random Forest, boosting, SVM and other methods showed significant performance for the STATLOG data set.

Literature Review

It is stated in [5] that, the Model trees are referred as a type of decision tree, which at the leaves has functions of linear regression. It is considered as a successful method to predict continuous numeric values.

The general methodology used for this project includes CRISP-DM (Cross-Industry Process for Data Mining) procedure. This process helps to construct a structured method for planning a data mining project. Because, it is an effective process which is robust and is termed as a well-proven methodology. This model’s objective is to set the objectives, produce a project plan and lay out the business success criteria. The below figure represents the CRISP-DM model [6].

Figure: CRISP-DM model

The CRISP-DM procedure includes six main stages such as [6],  

  • Knowledge of business

The business knowledge aims to understand the business’s requirements and vision. Then it accordingly modifies the elements to description of information mining questions, then it develops a plan to meet the organization’s aim.

  • Knowledge of information

This stage the data which is unstructured is utilized for understanding the information that is hidden from the data, where the hypothesis is assumed and it recognizes information’s superiority that the data contains.

  • Data preparation

This stage contains data selection for analysis, cleaning the data, constructing the required data and integrating the data.

  • Model designing

Once the existing subgroups are understood, the model is ready for construction, where table formation, reporting, understanding characteristics, and data preprocessing takes place.

  • Estimation

This stage assess the level of met business objectives and determines if there exists any business reason for the model to be deficient. The results of the generated data mining will be evaluated in this phase. The challenges and future direction will also be revealed in this stage.

  • Distribution

This stage summarizes the strategy used for deployment along with important steps and their functioning. For instance, plan monitoring and its maintenance. Finally, a final report will be written based on the deployment plan, where the project’s experiences and summary are documented.

Naive Bayes classification algorithm refers to a probabilistic classifier, which works depending on the probability models that contains strong independence assumptions. It was named after Thomas Bayes. Mostly, on reality, the independence assumptions don’t have any impact so they are regarded as naive. The probability models can be derived with the help of Bayes’ theorem. Depending on the nature of the probability model, the Naive Bayes algorithm can be trained, where the learning settings can be supervised. The Naive Bayes model comprises of a large cube, which contains the dimensions such as follows [7]:

  • Name of the input field.
  • Value of input field for discrete fields, or for continuous fields.
  • The Naive Bayes algorithm classifies the continuous fields into discrete bins.
  • How many times the target field value appears is recorded, with the input field value.


Data Mining Methodology

KNN stands for K Nearest Neighbors – Classification, which is a simple algorithm that stores all the available cases and classifies new cases based on a similarity measure (e.g., distance functions). In early 1970’s, it was utilized in statistical estimation and pattern recognition, as a non-parametric technique [8].

The K Nearest Neighbors – Classification can be utilized for classification as well as for regression predictive problems. But, mostly it is utilized for classification related industry problems [9]. For evaluation any technique, the following aspects are checked [10]:

  1. Comfort of interpreting the output.
  2. Time of calculation.
  3. Predictive Power

The short-term traffic prediction for the intelligent traffic systems is important and it gets influenced by its neighboring traffic conditions. In such case, the Gradient boosting decision trees (GBDT), which is an ensemble learning method, helps in making short-term traffic prediction depending on the traffic volume data set. In such a case it was proved that the GBDT’s prediction accuracy is actually higher when compared to SVM, whereas the accuracy of multi-step-ahead models is comparatively lower than the 1-step-ahead models. GBDT is smaller when compared to SVM in terms of prediction errors [11].

In this work the SVM algorithm is utilized for classification. SVM is a machine-learning system which has set up itself as a powerful tool in many classification problems. Essentially expressed, the SVM distinguishes the best isolating hyperplane between the two classes of the preparation tests inside the element space by concentrating on the preparation cases set at the edge of the class descriptors. Along these lines, an ideal hyperplane is fitted, as well as less preparing tests are adequately utilized; in this manner high preparation accuracy is accomplished with little preparing sets. In spite of the fact that SVM isolates the information just into two classes, grouping into extra classes is conceivable by applying either the one against all (OAA) or one against one (OAO) techniques [12].

  • The One against all method is uses the set of binary classifier and it able to divide the each class from all classes and each data objects are classified into easily determine the largest decision value.
  • The One against one method is used to constructs the parallel SVMs and it able to trained on the data from the two classes. It use the voting strategy to predict the data objects in the class.

The Random forest is an easy and flexible to use data mining algorithm that produces the great results. It has the simplicity, so it is one of most used algorithms and it can be used for both regression and classification tasks. It can handle missing values and modeller for categorical values. It adds the additional randomness to the model while growing the trees. It has following advantages compared with other classification techniques.

  1. It avoid the overfitting problem in classification problems.
  2. In regression and classification task, same random forest algorithm can be used.
  3. It used for identifying the most important features from the training data set [13].
  4. Increasing the predictive power
  5. Increasing the models speed.
  6. In medical domain, it can be utilized to both recognize the right combination of parts in pharmaceutical, and to distinguish illnesses by investigating the patient’s medicinal records.

The Model trees are referred as a type of decision tree, which at the leaves has functions of linear regression. It is considered as a successful method to predict continuous numeric values [5]. It can be used for classification problems, with the help of a standard method of transforming a classification problem into a problem of function approximation. 

Result Interpretation

In this section, we will show the consequences of our experiments as indicated by random forests and support vector machine algorithm [14]. It will give trial comes about resampling arrhythmia dataset. Also, we will make utilization of benchmarking datasets like thyroid, cardiotocography, and audiology and to test productivity of our classification algorithm. The short outcomes for assessments of thyroid, cardiotocography, and audiology will likewise be given. The consequences of the examination are summarized, correlation of the accuracy and learning time on the dataset between random forest, KDD and SVM.

SVM is compelling in high dimensional spaces like the arrhythmia data set. To start with, mRMR include determination was performed. The data set was then part into 70%-30% amongst prepare and test individually. Since we are managing a skewed data set with modest number of lines, we utilized bootstrapping to enhance the execution of the algorithm. The training data was multiplied in estimate utilizing irregular testing, while at the same time ensuring every one of the data focuses in the first prepare data were spoken to in any event once. To decide the sort of portion most fitting, the SVM model was constructed utilizing polynomial bits of fluctuating degrees and a Gaussian kernel. The quadratic kernel, brought about a decent model fit, limiting the generalization error as can be found in Figure 1 [15].

This led us to the assuming that there were significant second order interactions among the feature variables in the design matrix. Figure 2 plots the generalization accuracy on the test set with the quantity of best highlights chose. It can be seen that the best accuracy is gotten with around 254 highlights. Also, due to the greatly based dissemination of classes, the model demonstrated inefficient in anticipating classes with low density. In particular the two classes meant just a single tuple in the test set. The sheer absence of data, implied that there was no real way to fabricate important dispersions of the highlights expected to arrange classes one and two. To address the issue of misclassifying class 0 (sinus Tachycardia) as class 1 (ordinary), we utilized an anomaly indicator. We regarded the SVM as a one class classifier and isolated every one of the data focuses from the birthplace (in highlight space F) so as to augment the separation from this hyperplane to the starting point [16]. This outcomes in a binary function which catches areas in the data space where the likelihood density of the data lives. In this manner the capacity returns +1 in the district (catching the preparation data focuses) and 1 somewhere else. On discovering abnormalities in the data set, we utilized our instinctive thinking from the SVM confusion matrix viz. that class 5 were for the most part misclassified as 1. Subsequently we found the inconsistencies which lied a long way from the data collection, nearest to the root and identified the focuses anticipated by our SVM model as 5. We renamed these conditions of conceivable sinus tachycardia as typical state. This enhanced the accuracy to 70%.  The below images are displays the accuracy based on algorithm.

A basic decision tree gives great expectations when there is a huge number of indicator factors like in this data set. Early strategies to build decision trees were unsteady with little bothers in data bringing about vast changes in forecasts. Random forests is a gathering classifier that comprises of numerous decision trees and yields the class that is the method of the classes yield by singular trees [17]. Along these lines, a RF group classifier performs superior to a solitary tree. It indicate a convention for tests and assessment of arrhythmia order techniques. It likewise stipulates which databases ought to be utilized. Be that as it may, it doesn’t determine which patients/heartbeats ought to be utilized to develop the model to be grouped (preparing stage) and which patients/heartbeats ought to be utilized for assessment techniques, i.e., the testing stage, which may render one-sided comes about. For example, showed that the utilization of heartbeats from a similar patient for both the preparation and the testing influences the assessment to process one-sided. This is on the grounds that the models have a tendency to take in the particularities of the patient’s heartbeat during the preparation, getting expressive numbers during the test (near 100%). As already said, this heartbeat division convention is brought in the writing intra-patient program [18].

In any case, in a clinical domain, a completely programmed algorithm or strategy will discover heartbeats of patients not the same as those they used to learning in the preparation stage. Planning to determine a convention, made by proposed a division of the heartbeats from the database into two sets so the database turns out to be more rational with the real world. Also, re-executed a few models that displayed a general accuracy of almost 100% and were not worried about the heartbeat determination conspire [19]. Subsequently, they reconsidered the outcomes created by the techniques with the target of detailing tests as per the convention. Lookout that the selected strategies for this test are sensibly later and think about the utilization of different classifiers and different types of highlight extraction. Breaking down the qualities, it can be viewed that the outcomes acquired by a similar characterization technique utilizing a plan of random determination are fundamentally better than the qualities got with tests. It play out a reasonable assessment of ECG-based heartbeat characterization techniques, heartbeats of a similar patient ought not be available in both preparing and testing sets, since it’s anything but a practical situation. Something else, the classifiers will learn subtleties of patients in the preparation set and all things considered, the assessment of a technique on the testing set utilizing heartbeats of a patient whose heartbeats are available in the preparation set too, is one-sided, regardless of whether the heartbeats of a similar patient are extraordinary [20].

In general speculation accuracy was 72.3%. Henceforth we utilized a serial classifier comprising of RF and straight part SVM which gave us a generalization error of 22.6% or accuracy of 77.4%. This gives a peripheral change over the generalization errors. In this investigation, the forecast capacity of random forests classifier is expanded in deciding minority test classes via preparing the algorithm with simple random sampled data. Besides, this testing system did not reduce the execution of the classifier in forecast of real classes. Along these lines, a resampling procedure may be considered for datasets of uneven class conveyances to enhance the forecast capacity of classifiers for classes spoke to with modest number of occurrences. One of the essential points of interest of the proposed algorithm when contrasted with the ECG-based methodologies in the writing is that it is totally in view of the HRV flag which can be removed from the underlying ECG motion with a high accuracy notwithstanding for uproarious as well as muddled chronicles [21]. This is while, most ECG-based strategies utilize the morphological highlights of the ECG, which is genuinely influenced by noise. As a last indicate, due the short handling time and moderately high accuracy of the proposed strategy, it can be utilized as a continuous arrhythmia order framework.


The aim of this project are met, where methods learnt in the course module are leveraged and data mining study is executed, using the provided data set. The objective is met, where the presence and absence of cardiac arrhythmia are distinguished and are classified in one of the 16 groups. The literature review provides a background for understanding the different methods, its uses and effects in various studies. The methodology used is CRISP-DM procedure. t is observed that, the accuracy of Heart beat based on the classes is 77.3. The methods like KNN, Naïve Bayes, SVM, Gradient Boosting, Model tree and Random Forest are discussed.

The future work will be on how to improvise the accuracy of the heat beats, using online detection method instead of using the other available methods, along with improvement in time and budget. Moreover, it is believed that the performance can be improved, so the future work will also focus on performance.

Therefore, the purpose of decreasing the differences among the cardiologs and programs classification is achieved.


