Despite a couple drawbacks, the Beta distribution is the obvious choice for my prior on the proportion of rows returned by a query. For those who haven't done much with Bayesian stats, the Beta is a favorite for proportions because it's a "conjugate prior", meaning that the posterior distribution is of the same form. This makes iterative application pretty easy.
It's more than just a mathematical convenience; it really does do a nice job of distributing uncertainty. Don't have a clue? Use Beta(1,1) and your posterior is simply the proportion in your data with confidence intervals that match "traditional" frequentist models. Sort of sure, but willing to have your mind changed? Use Beta(a, b) where a/(a+b) is your believed proportion and a+b indicates how sure you are of that. If your data size is greater than a+b, the posterior will be biased towards the data. If your data size is smaller, the prior will show through more strongly. The graph below illustrates:
Here, the prior indicates a belief that we should get 16 hits for every 36 misses. We sample the data and get 57 hits and 43 misses. The posterior now predicts (16+57) hits for every (36+43) misses. The confidence interval has been tightened appropriately. We could now apply more data using the posterior as the new prior and get the same final result as if we had applied all the data at once. Convenient, intuitive, and flexible.
But, the real world is often messier than that. Because these aren't truly random samples, but samples of blocks of correlated rows, I need the posterior to reflect a pessimistic view of the uncertainty. Specifically, I want it to assume that there's a lot of relevant data out there that we simply haven't gotten to yet (until the data we have gotten to overwhelmingly suggests otherwise). That would suggest using Beta(a, b) where a is very large and b is very small, even though I don't really believe that. Specifically, I'd like to assume that EVERY row is included. The problem with setting a prior to 0% or 100% is that it's no longer a distribution at all, simply a statement of fact. As such, any data that doesn't match that fact has a likelihood of zero and the posterior degenerates back to the prior. (There's a Bayesian "proof" of God's existence that counts on people missing this detail).
There is an out. I don't actually need a prior on θ for the first iteration. I only need the point estimate to compute the variance of the observations. I can just set it to 1 and compute the confidence interval for the stratum (which will obviously be quite wide). Then, after a block has been sampled, I can set the posterior to Beta(N+h, m) where h and m are the number of hits and misses from the first block and N is the total number of rows.
The rub is if the first block has no misses. Then I'm back to a degenerate prior. That's not really terrible. If I just continue to add hits and misses to the distribution, the function will work; it just doesn't make a lot of intuitive sense.
The real problem is that this isn't going to collapse fast enough. Remember, I don't really believe this prior. I'm just setting it that way to keep the algorithm sampling. After sampling all the data, I'll have a distribution on θ of Beta(N+H, M) where H and M are the total hits and misses for the stratum. That's clearly nonsense because everything has been sampled, so there's no uncertainty left. It should be just θ = H/(M + H). So, I'm going to use a little sleight of hand. Instead of treating the sampling as cumulative, I'm going to treat it as replacing. That is, a sampled block will now replace an unsampled block in the prior, not the posterior. Under that scheme, the prior becomes Beta(N-(h+m), 0) and the posterior is Beta(N-m, m). Repeating this over all subsequent samplings will result in a final state of Beta(H, M), which is exactly what we're looking for.
Mathematically, this is a bit bogus and I'd not feel good about it if I was giving a confidence interval on θ. However, I'm not. I'm just using a point estimate of θ to compute a confidence interval on the stratum sum. So, at each step, I'll just use θ = (N-M)/N where M is the cumulative misses so far. This gives a well-defined point estimate at each step and converges to the real answer if we end up sampling the entire stratum. Given that, I don't need to stress over the actual distribution.
No comments:
Post a Comment