Heart Disease Data **Find link and data description. Add the data as media if they are small enough**

A support vector machine (SVM) is a Machine Learning method used for classification. The SVM is based on the Maximum Margin Classifier, which is a geometrical approach to finding the best plane to separate two groups of data points. It claims that, when the data are separable (there is not overlapping of the two classes), this plane should maximise the total distance of all the points from the plane itself. The SVM is a generalisation of this technique to work on data that cannot be separable. This supervised method of classification, becomes more and more popular over the past years, since it says to combine the geometric separation abilities of neural networks and the speed of logistic regression.

But the real power of SVMs lies on the use of kernels. Sometimes the data are not linearly separeble at all. The main idea behind kernels is that if we transform the data (by the use of some kernel) so that they are spread around additional dimensions, they will be linearly separable in that higher dimensional space. For example the following data in the one dimansional space are not linearly separable (with the use of one point:

But after we use the ploynomial kernel:

(K(x, y) = (x, x^2))

the data are being tranformed and, as shown, the become are linearly separable in the newly defined space.

The data being used for Machine Learning are often of high dimension, hence not easily to visualise, perceive and interpret be humans. One method used to visualise these types of data is called dimensionality reduction. The basic idea behind dimensionality reduction is that we can project our data to some lower dimensional space in a way that we preserve as much information as possible. Dimensionality reduction can be applied as a preprocessing step for Machine Learning to improve performance or even to improve quality of the results. It can also be used for visualisation, where the dimansionality of the data has to be reuced to two or three. Although this is a nice mathod to visualise the raw data themselves, if the original data were of high dimension, dimensionalty reduction could lead to a big loss of information.

Principal Components Analysis (PCA) is the most commonly used technique for Dimensionality Reduction. PCA is basically an an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA relies on a matri factorisation technique called Singular Value Decomposition (SVD), with is the equivalent of Eigen Decomposition for non square matrices.

According to SVD any (nXm) matrix (X) can be factorised as follows:

(X = USW^T)

where (X) is diagonal matrix of positive values, the Singular Values.

We can now compute the projected data of dimandionality (L) using the (L) first singular values and their left singular vectors:

(T = U_LS_L)

Evaluation of the quality of a Machine Learning technique can be a difficult task. One way to evaluate supervised Learning techniques is to compute the ratio of correctly predicted responses to new data, the accuracy. Although this will give a good intuition of how good a classifier would perform on unknown data, it could easily lead to overfitting (finding a model for the available data that describes them very well, but doesn’t generalise well enough).

The confusion matrix is a table layout that allows us to see the performance of a classifier. The confusion matrix has the following layout:

| | |Actual response | | | | |Positive |Negative | |Predicted response |Positive |TP |FP | | |Negative |FN |TN |

Although a confusion matrix still describes the correctly and not correctly classified responses it gives a better description of what’s happening with the classifier (for example are there much more positive responses than the should?).

Based on the confusion matrix we can calculate the True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) of the classifier. True Positives are simply the counts of correclty predicted positive responses, while True Negatives the wrongly predicted positive ones. The same applies to the Negatives.

A much better metric for the quality of a classifier to avoid or reduce over fitting is the precision-recall pair. These values are both calculated by the TP, TN, FP and FN, as follows:

(Precision = {TP over TP + TP}) (Recall = {TP over TP + FN})

If we want to have just one metric for easier comparison-visualisations we can combine these two with the F-measure, which is their harmonic mean:

(F-measure = {Precision cdot Recall over Precision + Recall})

The SVM Example miniapp, is created as an eample of how we can use and tune an SVM classifier on any data. We can select the dataset to be used with the SVM classifier, as well as the features and the response variable. To move on, we have to split the data to training and testing subsets and select the kernel to be used by the SVM classifier. After we have configured the SVM classifier we can browse through the rest of the miniapp and evaluate how well the classifier performed.

In this test case we are using the Heart Disease Data as discribed earlier. As discribed earlier we have to configure the SVM classifier before we evaluate its quality. So we are selecting the heart_disease_data, using all the fields provided and the field num as the variable to predict:

We will split the dataset to a 75% training subset and a 25% testing subset and we are going to use the linear kernel to begin with (which basically means no transformation is being applied to the data):

In the first tab “Principal Component Analysis” we can see the original data projected to their two Principal Components. There are two plots side by side, one coloured by the existing, correct labels of the data points, and another one with the predicted by the SVM labels:

Although with a brief look it might seem like the classifier missed a lot of information, if we observe a bit more closely we can actually see that the class 0 is identified quite well in the test subset, meaning both that most of the Class 0 data points are predicted to belong to Class 0, as well as not many points from other classes are classified as Class 0. Class 2 is also identified relatively well, while Class 1 and Class 3 much worse. There are also not data points predicted to belong in Class 4. It is obvious that this is happening because of the difference in the number of examples per class. Since Class 4 has the least examples in it we are unable to describe it efficiently using the current setting.

We also have the “Confusion Matrix”” tab, displayed in a one-vs-all type of classification. Since the target variable has five classes, we have five different confusion matrices that describe how well each of these classes is desrcibed by our classifier, in comparison to the rest of the classes. Example confusion matrices:

Finally, we also have the f-measure plot, that gives us a clear image about the quality of our results:

We can observe that, as it is expected by our comments on the PCA plots, the f-measure for Class 0 is really high (around 75%), Class 2 follows with 40% and then we have Class 1 and Class 3 with their f-measure being 15-20%. Finally, the f-measure for Class 4 in uncomputable, since there are no instances classified as Class 4, and both precision and recall are 0. These numbers verify our first observations: Class 0 is identified very well, Class 2 is identified somewhat worse, followed by Class 1 and Class 3. Class 4 is not identified at all.

To continue on we can keep changing the SVM classifier configuration and see how the performance of the SVM imporoves or worsens so that we can understand it and use it more efficiently to produce better results.

- Machine Learning book used
- Heart disease reference

Elpida joined Aridhia in September 2014 after completing an MSc in Artificial Intelligence from the University of Edinburgh. Her MSc thesis was on the “Social network models for relationships between Java classes”. Elpida holds a BSc in Computer Science from the National and Kapodistrian University of Athens (NKUA). Her final year dissertation was entitled “Evaluation of Dimensionality Reduction Algorithms using the MapReduce programing paradigm”. Prior to joining Aridhia, Elpida worked with NKUA as a Research Associate, where her main responsibilities included some early stage data analysis tasks on telecommunication, such as ETL, data cleaning and reporting on KPIs. She has a keen interest in machine learning and big data technologies and their application in the healthcare domain. Elpida has been heavily involved in developing AnalytiXagility’s mini-apps, our web-based interactive data visualisations based on the R Shiny framework.

- Upgrade your AnalytiXagility mini-apps by creating animated time series charts with dygraphs and htmlwidgets
- Building trust and improving participation in clinical trials using innovative electronic data capture platforms
- GA4GH Federated Analysis Proof of Concept
- Interactively visualising DICOM volumes and header data
- The sky is not the limit – embedding raw HTML and JavaScript to create dynamic UI elements in Shiny applications

- App Development
- Blog
- Chronic Diseases
- Clinical Case Studies
- Clinical Context & Interpretation
- Data
- Data Science
- Genomics
- Health Informatics
- Healthcare
- HR
- Information Governance
- News Coverage
- Open Source Technology
- Precision Medicine
- Press Releases
- Quality Improvement
- Research
- Service Offers
- Technical Tutorials
- Transform
- Visualise

AHSN
AnalytiXagility
Announcement
Anonymised data
app development
apps
Case study
cheats
CLAHRC
Community
data-types
Data Science
dataset
Datasheet
date-times
dates
diabetes
disease profile
dplyr
e-health
function
G-Cloud 6
ggplot2
Healthcare
high frequency data
Imperial College London
Innovate UK
lightbox
lubridate
madlib
maps
NHS
NIHR
partnership
Research
Research Fellow
rstats
run charts
scales
spc
statistics
telemedicine
translational
Video
Whitepaper