Saturday, April 7, 2018

My point is...

This is not the progress my adviser was likely hoping for today as I said I'd have actual results on Monday. But, while the morning has not produced any such results (or even done much to forward that effort), it has provided a significant revelation, that should help in wrapping all this up.

People have been asking me what my paper is about. I've been having a lot of trouble answering that because the topic has shifted so many times. Frankly, when I read the first part on the uniform sampling and then look at the second part with the generalizations, it looks a bit scattered. This was brought to my mind when my adviser suggested we just use the empirical distribution of the partition parameters to construct the sum directly and then get the variance of that rather than using the block sums. For reasons I couldn't put my finger on, that seemed horribly inconsistent with the rest of the paper. Not inconsistent in the sense that it contradicted anything; just that it didn't fit.

This morning, while making a few false starts on implementing the general block variance computation, I thought more about why I wasn't keen on using the total sum. There are some technical ramifications that I'd have to work through, but they aren't a big deal. I then realized why it doesn't fit.

This paper really isn't about the sum. It's about the block variance. If we know the block variance, we can always get the variance for an estimator for the population. More importantly, in the finite sampling case, this is true for any estimator. The sum is just a convenient choice to demonstrate the technique.

Viewing the paper through this lens, the title becomes a bit more obvious (something like: Estimating Query Values from Finite Multi-dimensional Populations using Blocked Sampling). The comparison with the BMH and BLB algorithms to estimate the block variance is suddenly a sane thing to do. The introduction explaining why one would even use blocks rather than true iid sampling fits. And, though I haven't written it yet, the conclusion can talk about future research where we don't just consider sums, but generic population parameters estimated by repeatedly sampling blocks.

Most importantly, this points the way for proceeding to the real problem I want to solve: self-tuning multi-dimensional data stores. The objective is the restructure the data so the block variances are minimized. A secondary objective would be to make them easy to compute. I've already got quite a bit of work done on both of those.


No comments:

Post a Comment