Sunday, May 27, 2018

Finding the optimal value of k in kNN classifier


ABSTRACT: A major issue in k-nearest neighbor classification is how to choose the optimum value of the neighbourhood parameter ‘K’. Popular cross-validation techniques often fail to guide us well in selecting k mainly due to the presence of multiple minimizers of the estimated misclassification rate. In this study we will discuss the effect of change in the value of ‘k’ on different data sets and find out the optimal value of ‘k’.


Keywords: Cross-validation; Misclassification rate; Neighbourhood parameter; Non-informative prior;

1. Introduction

The k-Nearest-Neighbors (kNN) is a non-parametric classification method, which is simple but effective in many cases. For a data record t to be classified, its k nearest neighbors are retrieved, and this forms a neighborhood of t. Majority voting among the data records in the neighborhood is usually used to decide the classification for t with or without consideration of distance-based weighting. However, to apply kNN we need to choose an appropriate value for k, and the success of classification is very much dependent on this value.



kNN has a high cost of classifying new instances. This is single-handedly due to the fact that nearly all computation takes place at classification time rather than when the training examples are first encountered. Its efficiency as being a lazy learning method without pre-modelling prohibits it from being applied to areas where dynamic classification is needed for large repository.


2. Problem Definition

The kNN method is biased by k. There are many ways of choosing the k value, but a simple one is to run the algorithm many times with different k values and choose the one with the best performance. In this study we are going to do the same.

3. Experimental Design

For the study, we have taken 4 benchmark training datasets from UCI machine learning repository i.e. Heart Disease, Dermatology, Diabetes and Haberman’s Survival to examine the changes in correctly classified instances of kNN classifier.

This chosen approach is going to be implemented on Weka tool. Weka is an open source software tool which consist of an accumulation of machine learning algorithms for data mining undertakings.

Attribute information:


Name of the Data Set
No of
No of
Class Information


Attributes
Instances

1
Heart Disease
13
301
{ '<50>50_1', '>50_2', '>50_3', '>50_4'}
2
Dermatology
34
360
{1,2,3,4,5,6}
3
Diabetes
8
763
{tested negative, tested positive}
4
Haberman’s Survival
4
301
{'survived 5yr','not survived 5yr'}

[Table 1: Visualization of training datasets in weka with respect to class attribute]
  

4. Results and Discussion

Experiment using the 5-fold cross validation method has been carried out to evaluate the prediction accuracy of kNN Model, and to compare the experimental results. Some information about these datasets is listed in Table 2.

Value of K
1
3
5
7
9
13
Data Set






Heart Disease
76.49
79.47
82.45
82.45
83.11
83.44
Dermatology
93.35
95.56
96.12
95.84
96.39
96.12
Diabetes
68.67
73.26
74.04
72.73
74.04
74.44
Haberman’s Survival
67.1
70.09
70.76
72.75
72.75
72.42
[Table 2: The Change in CCI with respect to the value of k observation]


In the table 2, the meaning of the title in each columns represents different values of k and rows represent the classification accuracy in different dataset. From this we can conclude that the optimum value of k purely depends upon the training dataset.

The table 3 demonstrated the comparison between the expected class value and predicted results for different data sets with their optimum ‘k’ value. As the accuracy of Heart Disease and Dermatology are very high, so they predicted the test set instances very precisely. Other two data set also performed well according to their accuracy. 
One more thing we have observed that the accuracy i.e. Correctly Classified Instances are directly proportional to the number of attributes in the training data set. A brief comparison is provided in chart 1.

5. Conclusion

The optimum K will always vary depending of your data-set. It should be as big that noises won't affect the prediction highly. And as low that one factor won't dominate another. Some claims that square root of n is a good number. But, from this study I think the best method is to try many K values and use Cross-Validation to see which K value is giving the best result.

6.  References

  1. Pınar U.: HABERMAN’S SURVIVAL IN ARTIFICIAL NEURAL NETWORKS, EECS-589 İntroduction to Artificial Neural Network, (05)2014.

  1. Yang S., Jian H., Ding Z., H. Zha1, and C. Lee Giles: IKNN: Informative K-Nearest Neighbor Pattern Classification, J.N. Kok et al. (Eds.): PKDD 2007, LNAI 4702, pp. 248–264, 2007.

  1. Gongde G., Hui W., David B., Yaxin Bi, and Kieran G.: KNN Model-Based Approach in Classification, European Commission project ICONS, project no. IST-2001-32429.

  1. Ahmad Basheer H., Mohammad Ali A., Ahmad Ali A.: Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach, (IJCSIS) International Journal of Computer Science and Information Security, Vol. 12, No. 8, August 2014 

No comments:

Linear Algebra for Machine Learning (Ch1: Vector)

Linear Algebra is a field of mathematics that could be called as the mathematics of data. It is undeniably a pillar of the field of machine ...