ABSTRACT: A
major issue in k-nearest neighbor classification is how to choose
the optimum value of the neighbourhood parameter ‘K’. Popular cross-validation
techniques often fail to guide us well in selecting k mainly due to the
presence of multiple minimizers of the estimated misclassification rate. In
this study we will discuss the effect of change in the value of ‘k’ on
different data sets and find out the optimal value of ‘k’.
Keywords:
Cross-validation; Misclassification rate; Neighbourhood parameter;
Non-informative prior;
1. Introduction
The k-Nearest-Neighbors (kNN) is a non-parametric
classification method, which is simple but effective in many cases. For a data
record t to be classified, its k nearest neighbors are retrieved, and this
forms a neighborhood of t. Majority voting among the data records in the
neighborhood is usually used to decide the classification for t with or
without consideration of distance-based weighting. However, to apply kNN we
need to choose an appropriate value for k, and the success of classification is
very much dependent on this value.
kNN has a high cost of classifying new instances. This is
single-handedly due to the fact that nearly all computation takes place at
classification time rather than when the training examples are first
encountered. Its efficiency as being a lazy learning method without
pre-modelling prohibits it from being applied to areas where dynamic
classification is needed for large repository.
2. Problem Definition
The kNN method is biased by k. There are many ways of
choosing the k value, but a simple one is to run the algorithm many times with
different k values and choose the one with the best performance. In this study
we are going to do the same.
3. Experimental Design
For the study, we have taken 4
benchmark training datasets from UCI machine learning repository i.e. Heart
Disease, Dermatology, Diabetes and Haberman’s Survival to examine the changes
in correctly classified instances of kNN classifier.
This chosen approach is going to be implemented on Weka
tool. Weka is an open source software tool which consist of an accumulation of
machine learning algorithms for data mining undertakings.
Attribute information:
Name of the Data Set
|
No of
|
No of
|
Class Information
|
|
Attributes
|
Instances
|
|||
1
|
Heart Disease
|
13
|
301
|
{ '<50>50_1', '>50_2',
'>50_3', '>50_4'}50>
|
2
|
Dermatology
|
34
|
360
|
{1,2,3,4,5,6}
|
3
|
Diabetes
|
8
|
763
|
{tested negative, tested positive}
|
4
|
Haberman’s Survival
|
4
|
301
|
{'survived 5yr','not survived 5yr'}
|
[Table 1:
Visualization of training datasets in weka with respect to class attribute]
4. Results and Discussion
Experiment using the 5-fold cross validation method has
been carried out to evaluate the prediction accuracy of kNN Model, and to
compare the experimental results. Some information about these datasets is
listed in Table 2.
Value of K
|
1
|
3
|
5
|
7
|
9
|
13
|
Data Set
|
||||||
Heart Disease
|
76.49
|
79.47
|
82.45
|
82.45
|
83.11
|
83.44
|
Dermatology
|
93.35
|
95.56
|
96.12
|
95.84
|
96.39
|
96.12
|
Diabetes
|
68.67
|
73.26
|
74.04
|
72.73
|
74.04
|
74.44
|
Haberman’s Survival
|
67.1
|
70.09
|
70.76
|
72.75
|
72.75
|
72.42
|
[Table
2: The Change in CCI with respect to the value of k observation]
In the table 2, the meaning of the title in each columns
represents different values of k and rows represent the classification accuracy
in different dataset. From this we can conclude that the optimum value of k
purely depends upon the training dataset.
The table 3 demonstrated the comparison between the
expected class value and predicted results for different data sets with their
optimum ‘k’ value. As the accuracy of Heart Disease and Dermatology are very
high, so they predicted the test set instances very precisely. Other two data
set also performed well according to their accuracy.
One more thing we have observed that the accuracy i.e.
Correctly Classified Instances are directly proportional to the number of
attributes in the training data set. A brief comparison is provided in chart 1.
5. Conclusion
The optimum K will always vary depending of your
data-set. It should be as big that noises won't affect the prediction highly.
And as low that one factor won't dominate another. Some claims that square root
of n is a good number. But, from this study I think the best method is to try
many K values and use Cross-Validation to see which K value is giving the best
result.
6. References
- Pınar
U.: HABERMAN’S SURVIVAL IN ARTIFICIAL NEURAL NETWORKS, EECS-589
İntroduction to Artificial Neural Network, (05)2014.
- Yang
S., Jian H., Ding Z., H. Zha1, and C. Lee Giles: IKNN: Informative
K-Nearest Neighbor Pattern Classification, J.N. Kok et al. (Eds.): PKDD
2007, LNAI 4702, pp. 248–264, 2007.
- Gongde G., Hui W., David B., Yaxin Bi, and
Kieran G.: KNN Model-Based Approach in Classification, European Commission
project ICONS, project no. IST-2001-32429.
- Ahmad
Basheer H., Mohammad Ali A., Ahmad Ali A.: Solving the Problem of the K
Parameter in the KNN Classifier Using an Ensemble Learning Approach,
(IJCSIS) International Journal of Computer Science and Information
Security, Vol. 12, No. 8, August 2014
No comments:
Post a Comment