Monday, October 2, 2017

Dual space

One thing I got fairly good at when I was at Cornell was looking at a problem from the opposite direction. Mathematically, it's known as the dual space and it's a standard trick when dealing with optimality problems (which are pretty much what you do in Operations Research, which was my major at Cornell). Essentially it means that your set of good answers is bounded by a set of infeasible answers and if you can find a point right on that boundary, you've found the best one.

For example, suppose you're trying to figure out the decimal expansion of 3/8. Consider all the decimal number less than or equal to 3/8. Start with a number you know is less, say, 3/10=.3. What about .4? Nope that's greater; it's in the dual space. Now you know that the answer has to be somewhere between those two. Try .35. Less. Great, now you're down to 0.05 difference. Keep going until you have it close enough (or, in this case, you'll probably hit .375 exactly because this is a ridiculously easy example).

That's a purely numeric problem, but the same thing comes up in algorithms, heuristics, and even general approaches. Sometimes, the best way to solve a problem is to consider all the solutions that don't solve it and why.

All this brings me to Huebner Estimators. Huebner estimators fall into a broader class of estimators known as robust statistics. The idea is that these are statistics that don't get creamed just because one or two outliers make it into the data. The median is much more robust than the mean because throwing a crazy number into the mix doesn't change it much. The mean can get completely whacked by one big number. That's why you see things like median household income being reported as opposed to the mean. The fact that one person got rich doesn't really change the outlook for the general population, so you want the statistic to reflect that.

Huebner Estimators clamp observations. That is, any observation within a "normal" range is treated as itself. If it's outside of that range, it's treated as if it was at the end of the normal range. Huebner Estimators tend towards the mean for large samples, but are more robust, like the median, with smaller ones.

This is, of course, the exact opposite of what I want for CISS. Those outliers are crucial. We need to find them. The middle of the distribution isn't particularly interesting. So, look at it as a dual problem. Instead of clamping the outliers and focusing on the middle, grab all the outliers and don't really worry too much about the middle. That's really what CISS does. It gets just enough data from the central strata that the estimates aren't total bogus. But, the further you get from the mean, the more it grabs.

Because we don't really know much about the underlying distribution, we can't completely ignore the middle, but in the theoretical case where the distribution was actually known, the sampling algorithm that returns the closest estimate of the full sample sum is one that strictly samples blocks in order of their variance.

No comments:

Post a Comment