K means Learning Algorithms Using R & Python

• Home
• /
• Blog List
• /
• K means Learning Algorithms Using R & Python on 18 Feb 2017 04:01 AM
• Rang Technologies
• Data Science

We have seen Gradient decent using R & Python before now let us try a clustering algorithm with R & Python. I will briefly introduce the K means and the steps involved in it and implement these steps in both R & Python.

K - means algorithm

K-means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem.The procedure follows a simple and easy way to group a given observations set into a certain number of clusters.

K-means algorithm consists of following steps:

1. Place K random points into the space represented by the objects that are being clustered. These points represent initial group centroids.

2. Assign each object to the group that has the closest centroid.

3. When all objects have been assigned, recalculate the positions of the K centroids.

4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

R implementation:

Importing the unlabeled data set for performing the algorithm:

#distance calculation function
setwd("S:\\ANALYTICS\\R_Programming\\Clustering")

## x y
## 1 70.76015 30.86922
## 2 53.06926 29.58294
## 3 44.23282 27.13823
## 4 46.07397 23.75309
## 5 60.35504 34.88655
## 6 56.06321 29.20010

dim(k_Data[1,])

##  1 2

x = k_Data

#Random samples for centers
centers <- k_Data[sample(nrow(k_Data), 2),]

#centers <- k_Data[18:19,]

In the above steps, we are considering "x" as our data and "centers" as the randomly initializing centroid. Here in this case we are randomly selecting two data points from the data set itself as the initial centroid.

Now our next step is to calculate Euclidean distances from these initial centroids to the data points. Using the below function, we are going to find the distances. We are storing the distances between the points and the initial centroid in distance Matrix, which is 100 * 2 matrix. First row consists of values of distances from the first centroid and second row consists of values of distance from the second initial centroid. This step is scalable even if we initialize more than 2 centroids.

x is the data matrix with points as rows and dimensions as columns. centers are the matrix of centers (points as rows again). The distance Matrix is just defines the answer matrix (which will have as many rows as there are rows in the data matrix and as many columns as there re centers). So the point (i,j) in the result matrix will be the distance from the ith point to the jth center.

Then the for loop iterates over all centers. For each center, it computes the Euclidean distance from each point to the current center and returns the result.

euclidean distance: sqrt(rowSums(t(t(x)-centers[i,])^2))
euclidnew <- function(x, centers) {
distanceMatrix <- matrix(NA, nrow=nrow(x), ncol=nrow(centers))
for (j in 1:nrow(centers) ){
for(i in 1:nrow(x) ) {
distanceMatrix[i,j] <- sqrt(rowSums((x[i,]-centers[j,])^2))
}
}
distanceMatrix
}

Now after calculating the distances from all the points next step is to identify the minimum distances among them and assign with the closet centroid.

In the below function, we are creating two empty vectors for storing cluster and centers history for all the iteration we are going to perform.

To identify the minimum distances among the distance matrix (row wise), we are using the "apply" function by which we can apply any function to dataset. Here we are using "which.min" function to identify the minimum distances.

Here, we perform the last and the fourth step that I mentioned in introduction. We repeat the two steps assigning the closest centroids and calculating the centroids of new clusters. This process should be repeated until the centroid no longer move its position.

K_means <- function(x, centers, distFun, nItter) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")

for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)

# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}

list(clusters=clusterHistory, centers=centerHistory)
}

In the above function the major step is to assign the new centroids to the data points which are closed to them. To do this again i used "apply" function to apply the mean to the data points with the new cluster assignment in the previous step.

The steps followed in this function:

1. Finding the euclidean distance with the initial centroids.

2. Identifying the minimum distances and assigning clusters to them.

3. Now repeating the process of finding the centroids for all the data points through all the iterations untill the centroid is no moving.

Finally, we are calling the cluster and center history of number of iteration we are performing as list.

The clusters are defined by assigning the closest center for each point. And centers are updated as a mean of the points assigned to that center.

Reading the function output into results. Here we can look at the centers and clusters for specific iteration. I am presenting the final iteration results below.

Results <- K_means(x, centers, euclidnew, 8)

Results\$clusters[]

##  1 1 2 2 1 1 2 1 1 2 1 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 1 2
##  2 1 2 2 2 2 1 2 2 2 2 2 1 1 2 1 1 2 1 1 1 2 2 2 2 1 2 2 1 1 1 2 1 2 2
##  1 2 2 1 2 1 2 2 2 2 2 1 1 1 2 2 2 2 2 1 2 2 1 2 2 2 1 1 1 2

Results\$centers[]

## x y
## 1 59.75652 32.9848
## 2 45.12800 27.1765

Above centers and clusters are the results of the final iteration.

Python Implementation:

Now implementing the k-means algorithm in python. When compared to R implementation there is no logical difference and the steps performed only syntactical difference which I'm going to explain all the functions and steps I used.
Importing the required libraries and same data set used for R implementation. Here we are using the data frames to implement the algorithm.

from numpy import *
import numpy as np
import random
from scipy import stats
import pandas as pd
import os

path = "S:\ANALYTICS\Python_learn\Clustering"
os.chdir(path)

k_Data = genfromtxt("Data.csv", delimiter=",")

#k_Data = np.matrix(k_Data)

k_Data = pd.DataFrame(k_Data)

# As in the R implementation we are representing the variables and data set names same. "x" is the data set and "centers" are the initial centroids for which we are using the random. Choice function to select the random data points from the data set as centroids.

x = (k_Data)

centers = k_Data.ix[np.random.choice(k_Data.index, 2)]

#centers = k_Data.iloc[17:19,0:]

# Next step is to determining the euclidean distances between the initial centroids and the data points.

# To slice the data from the initial centroids and the data set using iloc as it is a data frame. Implementing the same euclidean distance formula in below function.

def euclidnew(x, centers):
distanceMatrix = pd.DataFrame(np.nan * np.ones(shape=(x.shape,centers.shape)))
for j in range(0, centers.shape):
for i in range(0, x.shape):
distanceMatrix.iloc[i][j] = sqrt((centers.iloc[j] - x.iloc[i])**2 + (centers.iloc[j] - x.iloc[i])**2)
#print (distanceMatrix)
return distanceMatrix

#euclidnew(x, centers)

# In the final step of python implementation I came up with the "lambda" function(similar to apply function in R) to identify the minimum distances and assigning the clusters to the closest data points. Argmin to identify the minimum distances.

# The additional step I used in below function is adding the newly assigned clusters to the data points through every iteration which will help in the next step to determine the new centers this is because of the lambda functionality.
# Using group by option with in the lambda function to group the data points on closest cluster point and finding the average to identify new centroids.

def kmeans(x, centers, euclidnew, niter):
clusterHistory = [] * 10
centerHistory = [] * 10

for i in range (1, niter):
distsToCenters = euclidnew(x, centers)
clusters = distsToCenters.apply(lambda x: x.argmin(), axis=1)
x.loc[:,2] = clusters
centers = x.groupby(x.loc[:,2]).apply(lambda x:np.average(x.loc[:,0:1], axis=0))
clusterHistory[i] = clusters
centerHistory[i] = centers
return(clusterHistory, centerHistory)

kmeans(x, centers, euclidnew, 8)

Centers

0 [59.7565210374, 32.9847970918]

1 [45.1280007, 27.176495799]