on 10 Apr 2018 10:19 AM

We have seen K-Means algorithm using R & Python before, now let me explain very basic classification algorithm with R. Briefly, I will introduce to K - Nearest Neighbour concept with various steps involved in it and how to implement those steps in R.

**K — NN algorithm:**

K-NN is one of the simplest supervised learning algorithm based on similarity function. In KNN there will be a target categorical variable which is partitioned into pre-determined classes/categories. The procedure follows a simple and easy way to classify and/or predict a target variable and/or target observation based on all the similar characteristics (similar distances).

Initially, K-NN used to solve travelling salesman problem. Salesman needs to cover all the cities for selling the product with the simple logic of how far they are. He looks for every cities' distance and starts with the smallest distanced city. Similarly, KNN stores all possible cases and then it comes up with the best available nearest case (i.e. with the similar functionality of distance function).

**K-NN Algorithm Understandings:**

K-nearest neighbour algorithm is most often used for classification, although it can be used for estimation and prediction. It is an example of instance-based learning, in which the training dataset is stored, so that a classification for a new unclassified record may be found simply by comparing it to the most similar records in the training set.

**How to classify?**

As we have mentioned above for a new record, KNN assigns the classification of the most similar records. But how we define Similar? As in example that we have a new 24-year-old 'X' customer, intuitively which customer is more similar, a 25-year-old customer or a 50-year-old customer?

‘‘Data Analysts define distance metrics to measure similarity. **A distance metric or distance function is a real-valued function d to measure the similarity, such that for any coordinates x, y
1. d(x,y) ? 0, and d(x,y) = 0 if and only if x = y
2. d(x,y) = d(y,x)
3. d(x,z) ? d(x,y) + d(y,z)**

Property 1 assures us that distance is always nonnegative, and the only way for distance to be zero is for the coordinates (e.g., in the scatter plot) to be the same. Property 2 indicates commutativity, so that, for example, the distance from New York to Los Angeles is the same as the distance from Los Angeles to New York. Property 3 is the triangle inequality, which states that introducing a third point can never shorten the distance between two other points.’’

The most common distance function is

When measuring distance for machine learning models, every variable must have the same scale which can be done with normalization. This is to ensure that some variables do not have an influence over others while calculating distance. For example, Shoe-Price can possibly have influence on the other variable like shoe-size. To avoid this, the data analyst must to make sure to normalize the attribute values.

For continuous variables, the min—max normalization or Z—score standardization, may be used as the normalization methods.

Load the data into R (the data should be ready for the modelling after mining and cleaning process)

Add the Class library which supports the different classification algorithms

Normalize the data and calculate distance function

Split the data into training and testing dataset (generally 70% Training data and 30% testing data)

Model the data using KNN

Cross Validate the results to find the accuracy

Let's take an inbuilt ‘‘iris flower’’ dataset to understand the classification using KNN. The data set includes 50 observations from three species named ‘‘setosa’’, ‘‘virginica’’, ‘‘versicolor’’ of Irish flower and their characteristics like length, width of the sepals and petals are measured for classification analysis. By using these four characteristics we can classify the flower belongs to which species.

rm(list=ls())

data("iris")

View(iris)

## Define a normalization funcition!!!

mmnorm <- function(x) {return ((x-min(x, na.rm = TRUE))/(max(x, na.rm = TRUE) -

min (x, na.rm = TRUE)))}

training_norm <-data.frame(apply(iris[,-5],2,mmnorm))

training_norm$Species <- iris$Species

#View(training_norm)

idx<-sample(nrow(training_norm),as.integer(.70*nrow(training_norm)))

train<-training_norm[idx,]

test<-training_norm[-idx,]

#?knn() you can have the idea how the knn function works

#training2<-training[,-5] #data-cases use for classification reference

#test2<-test[,-5] #data-cases needs to be classified

#outcome<-training[,5] #Target categorical variableon which classification applied

#knn(train[,-5],test[,-5],train[,5],k=3)

# KNN Algorithm Function to Model the data using 3 nearest neighbours

predict<-knn(train[,-5],test[,-5],train[,5],k=4)

table(predict,test[,5] )

The KNN has classified 45 observations into the following categories.

predict setosa versicolor virginica

setosa 14 0 0

versicolor 0 14 1

virginica 0 2 14

From the cross validation, we can say that the KNN classified all the ‘‘setosa’’ species correctly, but with ‘‘versicolor’’ and ‘‘virginica’’ species respectively 2 and 1 observations are not classified correctly. In versivolor species 2 obs are belongs to versicolor species but it was classified as virginica and same with virginica 1 obs was wrongly classified as versicolor.

How should one go about choosing the value of k?

There may not be an obvious best solution. Consider choosing a small value for k. Then it is possible that the classification or estimation may be unduly affected by outliers or unusual observations (‘‘noise’’). With small k (e.g., k = 1), the algorithm will simply return the target value of the nearest observation, a process that may lead the algorithm toward overfitting, tending to memorize the training data set at the expense of generalizability.

On the other hand, choosing a value of k that is not too small will tend to smooth out any idiosyncratic behaviour learned from the training set. However, if we take this too far and choose a value of k that is too large, locally interesting behaviour will be overlooked. The data analyst needs to balance these considerations when choosing the value of k. It is possible to allow the data itself to help resolve this problem, by following a cross-validation procedure. *

Classification is the most useful task in Data Science as well as in Machine Learning. It is very helpful in many different sectors to solve the complex business problems. Couple of the sectors are listed below.

(*Book: Discovering Knowledge in Data, by Daniel T. Larose)

Book: Discovering Knowledge in Data, by Daniel T. Larose

https://en.wikipedia.org/wiki/Iris_flower_data_set