K-NEAREST NEIGHBORS ALGORITHM !!

K-Nearest Neighbors (or KNN for short)

In this blog, we will talk about a widely used classification technique called K-Nearest Neighbors (KNN).
KNN belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection.
 It does not make any underlying assumptions about the distribution of data.
We are given some prior data (also called training data), which classifies coordinates into groups identified by an attribute.
The model for KNN is the entire training dataset. When a prediction is required for an unseen data instance, the KNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.
The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be used.
Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (xi) across all input attributes j.
EuclideanDistance(x, xi) = sqrt( sum( (xj – xij)^2 ) )

Let m be the number of training data samples. Let p be an unknown point.

1.     Store the training samples in an array of data points arr[]. This means each element       of this array represents a tuple (x, y).

2.       for i=0 to m;

3.      Calculate Euclidean distance d(arr[i],p).

4.     Make set S of K smallest distances obtained. Each of these distances correspond to      an already classified data point.

5.     Return the majority label among S.

Another Example



We can see from the picture:
As k increases, the result will be more stable.  

Advantages

 • Can be applied to the data from any distribution
 • for example, data does not have to be separable with a linear boundary
 • Very simple and intuitive
 • Good classification if the number of samples is large enough

 Disadvantages

 • Choosing k may be tricky
 • Test stage is computationally expensive
 • No training stage, all the work is done during the test stage
 • This is actually the opposite of what we want. Usually, we can afford training step to     take a long time, but we want fast test step
 • Need large number of samples for accuracy

Comments