September 29, 2015 | Elpida
Heart Disease Data Find link and data description. Add the data as media if they are small enough
The SVM modelling mini-app
The SVM Example miniapp, is created as an eample of how we can use and tune an SVM classifier on any data. We can select the dataset to be used with the SVM classifier, as well as the features and the response variable. To move on, we have to split the data to training and testing subsets and select the kernel to be used by the SVM classifier. After we have configured the SVM classifier we can browse through the rest of the miniapp and evaluate how well the classifier performed.
A test case with the Heart Data
In this test case we are using the Heart Disease Data as discribed earlier. As discribed earlier we have to configure the SVM classifier before we evaluate its quality. So we are selecting the heart_disease_data, using all the fields provided and the field num as the variable to predict:
We will split the dataset to a 75% training subset and a 25% testing subset and we are going to use the linear kernel to begin with (which basically means no transformation is being applied to the data):
In the first tab “Principal Component Analysis” we can see the original data projected to their two Principal Components. There are two plots side by side, one coloured by the existing, correct labels of the data points, and another one with the predicted by the SVM labels:
Although with a brief look it might seem like the classifier missed a lot of information, if we observe a bit more closely we can actually see that the class 0 is identified quite well in the test subset, meaning both that most of the Class 0 data points are predicted to belong to Class 0, as well as not many points from other classes are classified as Class 0. Class 2 is also identified relatively well, while Class 1 and Class 3 much worse. There are also not data points predicted to belong in Class 4. It is obvious that this is happening because of the difference in the number of examples per class. Since Class 4 has the least examples in it we are unable to describe it efficiently using the current setting.
We also have the “Confusion Matrix”” tab, displayed in a one-vs-all type of classification. Since the target variable has five classes, we have five different confusion matrices that describe how well each of these classes is desrcibed by our classifier, in comparison to the rest of the classes. Example confusion matrices:
Finally, we also have the f-measure plot, that gives us a clear image about the quality of our results:
We can observe that, as it is expected by our comments on the PCA plots, the f-measure for Class 0 is really high (around 75%), Class 2 follows with 40% and then we have Class 1 and Class 3 with their f-measure being 15-20%. Finally, the f-measure for Class 4 in uncomputable, since there are no instances classified as Class 4, and both precision and recall are 0. These numbers verify our first observations: Class 0 is identified very well, Class 2 is identified somewhat worse, followed by Class 1 and Class 3. Class 4 is not identified at all.
To continue on we can keep changing the SVM classifier configuration and see how the performance of the SVM improves or worsens so that we can understand it and use it more efficiently to produce better results.
- Machine Learning book used
- Heart disease reference