on 18 Feb 2017 04:01 AM

We have seen Gradient decent using R & Python before now let us try a clustering algorithm with R & Python. I will briefly introduce the K means and the steps involved in it and implement these steps in both R & Python.

K-means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem.The procedure follows a simple and easy way to group a given observations set into a certain number of clusters.

- Place K random points into the space represented by the objects that are being clustered. These points represent initial group centroids.
- Assign each object to the group that has the closest centroid.
- When all objects have been assigned, recalculate the positions of the K centroids.
- Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

In the above steps, we are considering "x" as our data and "centers" as the randomly initializing centroid. Here in this case we are randomly selecting two data points from the data set itself as the initial centroid.

Now our next step is to calculate Euclidean distances from these initial centroids to the data points. Using the below function, we are going to find the distances. We are storing the distances between the points and the initial centroid in distance Matrix, which is 100 * 2 matrix. First row consists of values of distances from the first centroid and second row consists of values of distance from the second initial centroid. This step is scalable even if we initialize more than 2 centroids.

x is the data matrix with points as rows and dimensions as columns. centers are the matrix of centers (points as rows again). The distance Matrix is just defines the answer matrix (which will have as many rows as there are rows in the data matrix and as many columns as there re centers). So the point (i,j) in the result matrix will be the distance from the ith point to the jth center.

Then the for loop iterates over all centers. For each center, it computes the Euclidean distance from each point to the current center and returns the result.

Now after calculating the distances from all the points next step is to identify the minimum distances among them and assign with the closet centroid.

In the below function, we are creating two empty vectors for storing cluster and centers history for all the iteration we are going to perform.

To identify the minimum distances among the distance matrix (row wise), we are using the "apply" function by which we can apply any function to dataset. Here we are using "which.min" function to identify the minimum distances.

Here, we perform the last and the fourth step that I mentioned in introduction. We repeat the two steps assigning the closest centroids and calculating the centroids of new clusters. This process should be repeated until the centroid no longer move its position.

In the above function the major step is to assign the new centroids to the data points which are closed to them. To do this again i used "apply" function to apply the mean to the data points with the new cluster assignment in the previous step.

The steps followed in this function:

- Finding the euclidean distance with the initial centroids.
- Identifying the minimum distances and assigning clusters to them.
- Now repeating the process of finding the centroids for all the data points through all the iterations untill the centroid is no moving.

Finally, we are calling the cluster and center history of number of iteration we are performing as list.

The clusters are defined by assigning the closest center for each point. And centers are updated as a mean of the points assigned to that center.

Reading the function output into results. Here we can look at the centers and clusters for specific iteration. I am presenting the final iteration results below.

Above centers and clusters are the results of the final iteration.

Now implementing the k-means algorithm in python. When compared to R implementation there is no logical difference and the steps performed only syntactical difference which I'm going to explain all the functions and steps I used.

Importing the required libraries and same data set used for R implementation. Here we are using the data frames to implement the algorithm.

Headquartered in New Jersey, Rang Technologies has dedicated over a decade delivering innovative solutions and best talent to help businesses get the most out of the latest technologies in their digital transformation journey.