Wednesday, April 20, 2016

Wag the dog

Writing this clustering assignment in R has had some unintended, but interesting, consequences. Namely, the "need" to vectorize everything. I put that in quotes because you can do things iteratively in R, but why would you want to? The vector capabilities are what make the language powerful.

That actually changes the algorithm more than I would have thought. Normally, I'd not be a big fan of changing functionality to accommodate an implementation. If an implementation can't handle it, use something else. However, in this case it's been both instructive and quite possibly beneficial (though I have no intention of putting the time in to verify the latter assertion - this is just a homework assignment).

Things that have to change to get the vectorized version to work:

  • I can't start with an empty set of clusters. The first pass will try to map all points simultaneously so I have to have something to map them to. We'll need to start with a seed set of clusters and they may overlap. The next item addresses that.
  • Don't create new clusters during the assignment phase. Just tag everything to the closest one. Since there will be ties in the case of overlaps, always choose the earlier cluster in the list. This way, redundant clusters will get killed off due to lack of points.
  • Pick a few points that are far from their assigned clusters to create new clusters. The first step is a special case of this where every point is considered far at initialization.
  • Rather than shrinking the boundaries to jettison points, do the assignments based on a probabilistic check of how far they are outside the boundary. Points inside will also have to meet a probabilistic threshold based on the temperature (as it cools, points inside become more likely to stay). Then recompute the boundary to cover just the points that got assigned.
So, the updated algorithm looks something like this:

set all points' cluster distance to a non-zero initial value
t = starting temperature (between 0 and 1)
while t > final temperature
   select new cluster points (function of t, and point's cluster distance)
   recompute cluster distances and nearest cluster for all points
   assign cluster if probability threshold met (function of t, and cluster distance)
   drop clusters with too few points
   t = t * cooling constant
   
If all this looks like a lot more fuss than simply coding up a known SA Clustering algorithm, it is. Sometimes you just have to indulge your curiosity. Fortunately, it's still not that big of a deal. The hardest part is that I'm still not really fluent in R so coding takes longer than it should. Then again, that's the point. I won't ever get fluent if I don't write some stuff.


No comments:

Post a Comment