SVM Example

September 29, 2015 | Elpida

Data source

Heart Disease Data Find link and data description. Add the data as media if they are small enough

The SVM modelling mini-app

The SVM Example miniapp, is created as an eample of how we can use and tune an SVM classifier on any data. We can select the dataset to be used with the SVM classifier, as well as the features and the response variable. To move on, we have to split the data to training and testing subsets and select the kernel to be used by the SVM classifier. After we have configured the SVM classifier we can browse through the rest of the miniapp and evaluate how well the classifier performed.

A test case with the Heart Data

In this test case we are using the Heart Disease Data as discribed earlier. As discribed earlier we have to configure the SVM classifier before we evaluate its quality. So we are selecting the heart_disease_data, using all the fields provided and the field num as the variable to predict:

data_config

We will split the dataset to a 75% training subset and a 25% testing subset and we are going to use the linear kernel to begin with (which basically means no transformation is being applied to the data):

data_config2

In the first tab “Principal Component Analysis” we can see the original data projected to their two Principal Components. There are two plots side by side, one coloured by the existing, correct labels of the data points, and another one with the predicted by the SVM labels:

PCA

Although with a brief look it might seem like the classifier missed a lot of information, if we observe a bit more closely we can actually see that the class 0 is identified quite well in the test subset, meaning both that most of the Class 0 data points are predicted to belong to Class 0, as well as not many points from other classes are classified as Class 0. Class 2 is also identified relatively well, while Class 1 and Class 3 much worse. There are also not data points predicted to belong in Class 4. It is obvious that this is happening because of the difference in the number of examples per class. Since Class 4 has the least examples in it we are unable to describe it efficiently using the current setting.

We also have the “Confusion Matrix”” tab, displayed in a one-vs-all type of classification. Since the target variable has five classes, we have five different confusion matrices that describe how well each of these classes is desrcibed by our classifier, in comparison to the rest of the classes. Example confusion matrices:

confusion_matrix

Finally, we also have the f-measure plot, that gives us a clear image about the quality of our results:

f_measure

We can observe that, as it is expected by our comments on the PCA plots, the f-measure for Class 0 is really high (around 75%), Class 2 follows with 40% and then we have Class 1 and Class 3 with their f-measure being 15-20%. Finally, the f-measure for Class 4 in uncomputable, since there are no instances classified as Class 4, and both precision and recall are 0. These numbers verify our first observations: Class 0 is identified very well, Class 2 is identified somewhat worse, followed by Class 1 and Class 3. Class 4 is not identified at all.

To continue on we can keep changing the SVM classifier configuration and see how the performance of the SVM improves or worsens so that we can understand it and use it more efficiently to produce better results.

Further reading

  • Machine Learning book used
  • Heart disease reference



 

elpida

Elpida joined Aridhia in September 2014 after completing an MSc in Artificial Intelligence from the University of Edinburgh. Her MSc thesis was on the “Social network models for relationships between Java classes”. Elpida holds a BSc in Computer Science from the National and Kapodistrian University of Athens (NKUA). Her final year dissertation was entitled “Evaluation of Dimensionality Reduction Algorithms using the MapReduce programing paradigm”. Prior to joining Aridhia, Elpida worked with NKUA as a Research Associate, where her main responsibilities included some early stage data analysis tasks on telecommunication, such as ETL, data cleaning and reporting on KPIs. She has a keen interest in machine learning and big data technologies and their application in the healthcare domain. Elpida has been heavily involved in developing AnalytiXagility’s mini-apps, our web-based interactive data visualisations based on the R Shiny framework.

Leave a Reply

Your email address will not be published. Required fields are marked *