Tuesday, October 18, 2016

Less common distributions

Sorry I missed the off-day post; I know some of my readers rather like that one. However, there was a good reason for that: I didn't take an off day. Well, I sort of did; Saturday was pretty light. Sunday, I dug into my thesis research with a bunch more on Monday. I'm still finishing up the details on the paper I started last spring. In particular, I've been looking at distributions to model the sampling of block sums when the number of rows is a small fraction of the total rows in a block (but not zero).

In these cases, I need to model the block sum as something other than uniform, since lower values are much more likely than higher values. The obvious choice is exponential. The memoryless property is particularly relevant here: it doesn't matter how many hits I've found so far, the next row is just as likely (or not) to be included in the sum. The exponential distribution is well understood, but I need to model the parameter which determines the mean of the distribution. Since the algorithm is iterative, a conjugate prior is obviously desirable.

The conjugate prior for the parameter of the exponential (λ) is the Gamma distribution. That is, if the the prior on λ is Gamma(α, β), and n observations are made, the posterior is Gamma(α+nβ+ Σx). Basically, what we're saying is that we go into it thinking that α trials would result in a total of β. We simply add the new number of trials and the new total to that.

The next question is when do we abandon the uniform distribution for the exponential? (Recall that the uniform converged quite nicely for queries where large numbers of rows were relevant; we don't want to give that away.) To do that, we need to perform a likelihood test between uniform and exponential means. Of course, both will converge to normal (via the central limit theorem) given enough time, but it's not obvious how much time is required. So, we need precise distributions.

The mean of a series of uniform observations is a Bates distribution. The pdf is a bit unwieldy:



That doesn't make it wrong, but we'll have to look at ways to compute that quickly. Otherwise, we kill the whole point of this reduced sampling thing. At least that one is constant (b is just the upper bound of the stratum we're estimating and n is how many blocks we've sampled; the rest is closed-form).

On the exponential side, the mean is somewhat simpler. The sum of n observations is an Erlang distribution:


The rub is that we now have to integrate that over our prior on λ. In short, lots of math still to come.

No comments:

Post a Comment