K-Nearest Neighbors (or KNN for short)
In this blog, we will talk
about a widely used classification technique called K-Nearest Neighbors (KNN).
KNN belongs to
the supervised learning domain and finds intense application in pattern
recognition, data mining and intrusion detection.
It does not make any underlying
assumptions about the distribution of data.
We are
given some prior data (also called training data), which classifies coordinates
into groups identified by an attribute.
The model for KNN is the
entire training dataset. When a prediction is required for an unseen data instance, the KNN algorithm will
search through the training dataset for the k-most similar instances. The
prediction attribute of the most similar instances is summarized and returned
as the prediction for the unseen instance.
The
similarity measure is dependent on the type of data. For real-valued data, the
Euclidean distance can be used.
Euclidean
distance is calculated as the square root of the sum of the squared differences
between a new point (x) and an existing point (xi) across all input attributes
j.
EuclideanDistance(x, xi) = sqrt( sum( (xj – xij)^2 ) )
Let m be the number of training data samples. Let
p be an unknown point.
1. Store the training samples
in an array of data points arr[]. This means each element of this array
represents a tuple (x, y).
2. for i=0 to m;
3. Calculate Euclidean distance d(arr[i],p).
4. Make set S of K smallest
distances obtained. Each of these distances correspond to an already classified
data point.
5. Return the majority label
among S.
Another Example
We can see
from the picture:
As k
increases, the result will be more stable.
Advantages
• Can be applied to the data from any
distribution
• for example, data does not have to be
separable with a linear boundary
• Very simple and intuitive
• Good classification if the number of samples
is large enough
Disadvantages
• Choosing k may be tricky
• Test stage is computationally expensive
• No training stage, all the work is done
during the test stage
• This is actually the opposite of what we
want. Usually, we can afford training
step to take a long time, but we want
fast test step
• Need large number of samples for accuracy
Comments
Post a Comment