Tuesday, August 28, 2018

The variable in question

The blog is back for another school year. I did do a fair bit of research this summer, as well as a bunch of other interesting stuff at work (and took a nice vacation, too). However, I'm going to skip all that for the moment and get right to the question of the day because I'm discussing it with my adviser this evening.

We've been proceeding through all this analysis somewhat blindly assuming that it's a given that we should use the normed sample sum as our estimator. I give a brief (and, as it turns out, less than compelling) proof that this is the Uniform Minimum Variance Unbiased Estimator and then move on to the variance of the estimator, which is the real subject of the paper.

Not so fast on that first point. It's true that the normed sample average would be the UMVUE for the mean of an infinite population, but that doesn't translate as naturally as you might think to the population sum when things are finite. In fact, given what we know about the composition of the blocks, the estimator can actually be improved upon.

First, though, let's pretend we don't know anything about the composition of the blocks other than that they may contain partitions of correlated data. In that case, the sampled pairs of observations, that is, both the dependent and independent variables, are a sufficient statistic. We'll call this D and the realization of a sample will be the vector d=(x,y), the individual rows from all the sampled blocks. One important item to note here is that d is considered unordered. That is, we don't care which blocks got sampled first or where a row is placed within a block.

Now, we pick an arbitrary unbiased estimator. The first row in the sample times the population size will do, we call that S=Ny1. We now use the Rao-Blackwellization process to make a better estimator by conditioning this on the observed sample, that is S' = E(S|D=d). I'll spare the algebra, because it's simple and I don't feel like typesetting it here, but it's pretty clear that, if ordering isn't important, the expected value of the first item in a finite sample, given the sample, is the sample mean. So, S' is the sample mean times the total number of rows in the population, which gets us right back to where we started. In fact, it's not hard particularly hard to prove that any unbiased estimator conditioned on the sample will give this result. So, our estimator is fine in this case.

HOWEVER...

That's not really our case. We have a lot more information available to us. When loading the data, we can pre-compute all sorts of things. We know the true sum for each partition (that is the sum when all rows from that partition meet the query criteria). We know how many rows are in each partition. We know which blocks those rows wound up in. We probably know even more than that if this isn't the first query we've ever received. We can track historical hit rates. Hit rates correlated to specific attributes in the query. Which partitions are particularly susceptible to variation based on certain query attributes. Heck, we might even know the actual answer; query processors frequently cache the last few queries they've returned because people have a tendency to hit "refresh" and there's no reason to go through the trouble of recomputing a result you just produced.

Why is this a problem? Well, it's not if you're happy with the naive estimator, but the point of this exercise is to return the tightest confidence intervals in the least time, so if we can produce a better estimator using it, we should.

Here's one quick example: Suppose we know that the data has only two partitions and one of those partitions is in a single block. We could randomly sample blocks, but we might wind up with a sample that excludes the single-block partition. That introduces extra variance to our estimator. On the other hand, we could always sample the single partition block, and then proceed with the rest of the sampling as in the uniformly-distributed case (which yields much tighter bounds). The resulting estimator will have less variance.

Yes, that's contrived, but the point is that the extra information can help. A lot. I'd like to leave that out of this paper because the point of this one was simply to demonstrate that we can create reliable confidence intervals for highly-correlated data. I think we've proven that both mathematically and empirically.

However, this plays really well into the research path I had set out for the rest of the dissertation. I now have a template for showing that we can not only assess the quality of our estimator, we can actually improve on it by leveraging the structure of the data. More importantly, we can change the structure of the data to make that lever even longer. As I argued when creating DBESt (which, by the way, is still a thing and I intend to use it to restructure the data), the optimal structure is an np-complete problem, so there's no point in trying to solve it. But, it can be approximated using either the Evolutionary Strategy of DBESt, or layering some sort of Dirichlet analysis on top of that to try to get rows from like distributions into the same blocks.