Tuesday, April 19, 2016

Clustering using Simulated Annealing

For our next Data Mining project, we're supposed to implement a clustering algorithm that uses simulated annealing. I could, of course, just look one up but, I decided it might be more interesting to come up with my own. Since my interest in clustering is actually partitioning, I'm going to look at cutting clusters along dimensional boundaries. Therefore, we'll use a distance metric that simply looks at whether a point is in or outside of the dimensional bounds for a cluster. If it's in, the distance is obviously zero. If it's out, the distance is how much needs to be added to the rectangular (or hyper-rectangular, in the case of more than 2 dimensions) perimeter of the cluster. The idea is to minimize the total perimeters, while capping the number of rows that can be in any one cluster.

Here's the basic algorithm:

clusters = null list
pool = all points
t = starting temperature (parameter strictly between 0 and 1)
while t > final temperature
   select all but n * t points for insertion into clusters
   assign points to (possibly new) clusters that minimizes total perimeter growth
   shrink perimeter of each cluster to force out edge points
   drop clusters with too few points
   t = t * cooling constant

I'm also considering coding it entirely in R, though I may change my mind on that if it gets unwieldy. While it's an interesting exercise, I don't want to spend too much time on it.
   

No comments:

Post a Comment