I'm not particularly worried about the distribution of the query results from each of the stata. Simply using the data so far divided by the proportion of rows sampled gives a non-biased point estimate. However, I can't very well call this algorithm "Confidence-based" if I don't compute real confidence intervals. So, that's where I need to go.
I'm not going to do the derivations right now, but here's the general plan:
First off, note that the actual magnitude of the measurement in each stratum is fairly tightly constrained. As such, that's not a big source of variance. So, we'll just assume that the absolute value for all measures in stratum k are roughly 2k.
That leaves just the number of measures that meet the query criteria as the big source of variance. We know the total row count for the stratum going in. What we don't know is how many of those rows are included in the query. Nor do we know the sign. So, what we really have is a sum of identically distributed random variables which can take on values of 0, -2k, and 2k. They aren't really independent, but I don't think that's violated in a way that matters.
Knocking out the zeroes is easy. Let θ be the proportion of rows in the query. We start with a prior that nearly all the rows are included in the query result (we can't set it all the way to 100% or the posterior becomes degenerate). As blocks get sampled, we update that with the posterior given the rate that we've been including rows. There are several possible distributions that could be used for the prior on θ. The Beta distribution is the obvious choice, since it forms a conjugate prior, but it may not be the best one.
The positive/negative split, which we'll call λ, is more problematic. I think the most defensible position is to set the prior to the overall +/- proportion of the stratum. Again, there are a number of prior distributions to choose from and only some empirical testing will indicate which is best.
So, each value in stratum k is now a ternary random variable Xki where
P(Xki = 0) = (1-θ),
P(Xki = -2k) = θ(1-λ), and
P(Xki = 2k) = θλ. We then just compute the variance for the sum of all the unsampled values across all strata (obviously there's no variability in the estimate for the rows we have already sampled).
No comments:
Post a Comment