Saturday, March 31, 2018

Searching for the lost

The lost piece of my variance formula, that is. (Those who know me personally know that I take this whole Jesus died and rose again thing fairly seriously. That doesn't mean, however, that one can never make a lighthearted comment about it. If you disagree with that statement, you might want to skip both today's and tomorrow's posts).

Anyway, we're told that between dying on Friday and rising sometime early Sunday, Christ hung out in hell. Exactly what he did there is a point that some folks like to debate. Some claim he went there looking for lost souls. Whatever he was up to, it probably sucked worse than my last two days, which have been spent looking for what part of my variance was lost when we approximate the vector of hit rates.

But, maybe not by much. Searching for computation errors is pretty much my version of hell.

The formula worked fine when calculated directly and I wasn't seeing anything wrong with the proof.

Substituting in the moments for a random variable is a pretty standard trick, so I was really miffed that the formula was not only off, it was outright wrong. Breaking the blocks into smaller partitions was increasing the variance. The opposite should be true.

So I spent yesterday afternoon and evening going through the derivations really carefully to see what was missed. I did find a reversed sign (as I've noted before, I do that a lot), but nothing that explained the real problem.

So today I did what I only do as a last resort because I'm really bad at computation. I sat down and started working through the whole thing by hand. Well, just in time for Easter, I've found the problem. And, it was pretty insidious.

The formula is actually right, but it contains the fairly innocuous looking term:



The nasty bit is that that term is almost always zero when you are computing each partition individually. That's because most batches don't span more than two blocks, so the variance of any particular block partition given a batch m will be zero because there's at most one entry. When you blend that, you wind up with



and that guy is almost never zero because now you're looking at the variance of partition sizes across all blocks. So, the more you chop up the data, the bigger the approximation gets even though the true value is going down.

The fix is easy enough - just drop the term. Like I said, it's almost always zero and even when it's not, it's not very big compared to the other terms in the equation. Having done that, my sampler quickly rose from the dead and began producing some decent results. Sorry, no graphs yet, but at least I've got stuff to work with again. And, I got the paper updated so my adviser can continue to review it.

I can now go to Easter Vigil and think about what really matters. I'll pick up my research again tomorrow evening.

Thursday, March 29, 2018

Enough theory

Yes, that's a direct quote from my adviser! Let's get some stinking results and put this thing to bed. As I have a 3-day weekend ahead of me (yes, I will take Sunday off and I'll also go to church tomorrow), I should be able to get empirical results we are after. Well, at least the first set.

The full-general 3-stage MCMC results that I've proposed for the grand finale, might not come together quite so quickly. Mainly because I'm not entirely sure how to do it. Still, the target is in sight and, since I can't spend the weekend running, there's no reason not to set expectations high.

Tuesday, March 27, 2018

Oh, snap.

And not in a good way.

It looks like I've got a stress fracture. I went for my normal morning run Sunday. As is my custom, except when the ground is frozen, I ran barefoot. Since we had cold rain all night, my feet were pretty cold, but otherwise there was no discomfort.

When I got done I wiped the mud off my feet and noticed the telltale bruise of a stress fracture above the third metatarsal. If it is a stress fracture (it's pretty hard to know for sure for a few weeks), this will be the third time I've had one. It's been a different metatarsal each time.

The first time, I thought it was just a bruise and kept running until it became an overt fracture. There wasn't much mistaking it when that happened. I was in a race when it snapped and had to hobble the remaining mile or so to the finish. The second time, I recognized the pattern and let it heal before things got really bad.

Hopefully, that will be the case again this go round but, at the very least, it's a few weeks off at a time I was hoping to start ramping up my mileage. Bummer. The amount of stress in my life right now will not be easy to handle without running as a coping mechanism.

Monday, March 26, 2018

About that todo list...

16 hours at work today. Not a whole lot of research going on.

I knew this was coming; we always ramp up in April and May. I was hoping to have the paper done before it hit. The next month will be telling. I'll either get the paper done, despite the growing demands at work, or I won't.

What I'll do in the latter case is an open question that I'm not going to seriously consider unless the predicate becomes a reality.

Sunday, March 25, 2018

Edged out

I think I have a reasonable progression of logic that gets r/(r-1) in front of my sample variance. It's not formal proof by any means, but I don't think that's necessary since any competent statistician knows you have to do that to remove the bias of the sample moments.

Meanwhile, I shored up the arguments quite a bit and streamlined the discussion. I now have a section that is somewhat readable.

There are still some holes:

  • I've pointed out that the edge case where the sampled blocks return no rows is pathological when using the most straightforward estimators. I haven't shown that my enhanced estimators do any better.
  • I don't have any empirical results (or even a data set to get such results) for the most general case. That's my priority for tomorrow.
  • There are still a few gaps in the logic that may or may not need closing - I'll let my adviser make that call.
  • My list of references is still way shorter than I'd like.
So, there's work to do but, dang, this is starting to look like a real paper.

Saturday, March 24, 2018

Edge cases

I haven't been posting because, frankly, working out these distributions (and typesetting them) is consuming pretty much every free moment. I did derive a somewhat interesting result tonight that deserves some mention.

Because I don't want to make any assumptions about the underlying distributions, I'm doing all my variance calculations using method of moments (compute the first two moments from the sample and plug them back into the variance equation). I'm doing this on just about everything: the observation amounts, the hit rate, the number of blocks in a partition... For the most part, it is then just a matter of cranking out the formulas. Of course, it doesn't hurt to check your work by plugging in some numbers you know and seeing that you get what you expect.

So, I did that. I set the number of partitions per block exactly equal to 1 and the variance of the block hit rate came out to exactly the variance of the partition hit rate. Except that it didn't. It came out to the population variance. But, if this is a sample, it really should come out to the sample variance. This isn't too terribly surprising. After all, I'm just matching up the moments, so there's no distinction made between a sample or population moment. Still, it's unsettling.

Then, I set the number of partitions to be exactly the row count for the block; basically my uniform case. And, again, it came out just right except that it was matching as a population rather than a sample.

This doesn't much matter when the sample gets large, but at small values it can throw things off quite a bit. I can adjust for it easy enough, but coming up with a mathematically sound way of doing that (as opposed to just messing with priors until it works) might be a bit of a trick. This seems like a good question for my adviser. It's entirely possible there's a standard trick that I just don't know about.

Thursday, March 22, 2018

What are the chances?

Somewhat off-topic post tonight.

I pulled up my usual weather app this morning and noted an interesting anomaly.

Why would the probability of rain be exactly 65% for such an extended period. (I'm guessing it's really 66.7% or 2/3, but that the app rounds to the nearest 5%).

The "chance of rain" is actually complete fiction. Pin any meteorologist down on this if you don't believe me. Is it the likelihood that it will be raining at that very instant? Is it the probability that some precipitation will be measured within some fixed time interval? Is it the odds you would take or give if betting? Is it the historical percentage of observed rain when similar pre-conditions are observed?

No, it's none of those. Why? Because most people don't have a clue how probability works. Look at the last presidential election. Nearly all the polls (and certainly the average of the polls) indicated that there was a 20-30% chance that Trump would win. Yet, people still marvel about how "wrong" the pollsters were. They weren't wrong. A 20-30% chance is the same chance a major league baseball player has of getting a hit when they come to the plate. Do people think that getting a hit in baseball is some sort of bizarre rare event? Of course not. It's just less likely than getting out.

But, if you tell someone that the chance of rain is 75% and they cancel their picnic plans, they'll be mad if it turns out to be a nice day. So, the weather folks play this silly game of trying to come up with a "probability" that conveys what they really mean, even though it has nothing to do with actual probability.

I can't know for sure, but my guess is that the algorithm that generates probabilities for the weather app I use has some sort of harness that prevents the chance of rain going beyond 2/3 in certain conditions because people will interpret that as "It's definitely going to rain."

Interestingly, a couple hours later, they had changed it to look a lot more like a real super-accurate prediction (any honest meteorologist will also admit that the difference between 50% rain and 60% is meaningless more than a few hours out). Clearly, the same algorithm is being used as the shape below the bound is identical, but the bound has been removed.


Tuesday, March 20, 2018

Notation

Half the battle in mathematics is the invention of a good notation. - Simon Laplace.

Hopefully, I'll do better in whatever the other half is.

I spent the bulk of last night filling the whiteboards in one of our larger conference rooms with the terms of the block variance. I was so excited that the really problematic terms factored out and could be pre-computed that I didn't take much time to note what they really were. Instead, I just immediately dropped them into a function and noted that the value could be attained by a simple look up at query time. That's true, but not super helpful to the reader who's trying to understand what we're doing.

This morning, I looked again at my work and was rather dismayed that I had missed expressing the formula in terms that everybody understands.

The original term was this:



and I just went with the rather dubious



(The "1" in the subscript indicates that we're looking at the first moment of b. There was a similar term for the second moment.) That's fine, you can define a function to mean whatever you want, but these terms aren't really that mysterious. We're basically just taking the probability of an event m and multiplying it by the expectation of the partition row count given m. These sorts of terms come up all the time in distributions, which is why we have standard notation for them:



Anybody with even a passing knowledge of statistics will know what the right hand side of those equations means. The left hand side gives no indication whatsoever as to what we're doing. Granted, we'll store the product as a single value that will get looked up but, for expository purposes, the right hand side is a much better representation.

Monday, March 19, 2018

Some crazy long formulas

I worked out the exact block variance formula for the general case, taking into account the differences in the partition sizes. It's pretty nutty, but actually reduces quite nicely due to the fact that we know the partition sizes up front. Therefore, much of the formulas can be computed at load time and just looked up in a table at query time. I'm calling this a big win on two fronts: 1) Lots of math 2) actually computes quickly.

Yay for that.

Sunday, March 18, 2018

Nothing recognizable

I have a distribution for the number of rows in each block partition. It's a bit messy, so I won't post it here. It's also a recurrence relationship, though I could convert it to closed form if there was a good reason to do that. Anyway, since it wasn't obvious by looking at it what the distribution actually looked like, I plotted the partition sizes for my sample. At only 1000 blocks, I didn't expect it to be a great match, but I was a little surprised by the shape.

The partitions that make up the sample are exponential (well, actually geometric as these are discrete values, but the block size is large enough to use the continuous approximation). That is, when I'm generating them, I simply use a uniform random variable to decide if that should be the last observation. Since the mean partition size is equal to the mean block size, that also means that about 370 of them should be longer than the block (which turns out to be the case). The graph doesn't look very exponential to me though (I've removed the x=1000 point since it has so many more observations than the rest of the distribution).


The second partition is a bit more complicated because the truncation point is dependent on the size of the first partition. So, this one is not only showing the interior of the distribution, but also all the partitions that hit the max row count. Anyway, it's also not particularly recognizable as anything to me. Same goes for three and four, though four actually does look somewhat exponential or, at least, some member of the gamma family.

The fifth and sixth partitions didn't have enough non-zero values to be useful (and no block contained more than six partitions).

So, that exercise was pretty non-enlightening. What is important is that the covariance terms are quite significant (as expected) when looking at things this way. I'm hoping to have all this quantified tomorrow, though none of this has gone as quickly as I would have thought, so I don't know why this piece will be any different.

Friday, March 16, 2018

Not quite so exact variance

As I was typesetting yesterday's derivation into my paper, I was struck by the sudden realization that I had an extra random variable in that equation, and it's a really important one. In case it's not obvious, extra random variables tend to increase randomness and that increases variance.

I had been a bit bugged by the fact that the covariance term between the various partitions should have been more significant. Yet, the results from the simulations were matching pretty closely without it, so I figured I'd come back to that later. Well, that's because I had baked the randomness out of my variance estimator as well.

The "missing" random variable is the number of rows in each partition of the batch (bk in the equations from yesterday). For a fixed set of partition sizes, the formula is spot on but it doesn't account for the fact that they are not fixed. Every block is chopped up a little differently. The reason I got away with it is because my estimator also fixed them. I just took the average value across all blocks. That's not a bad estimate when partitions tend to be at least as large as a block. When you have a mix of small and large partitions, the variance can be understated. If the small blocks have one distribution and the large blocks have another that is quite different, the variance can be significantly understated.

More importantly, from the standpoint of understanding the problem, by making bk a random variable, the variance term goes way up and the covariance term brings it back down (the covariance is always negative because, in a fixed size block, if one partition is bigger, at least one other partition has to be smaller). This dynamic helps understand what's driving the overall variance. Some samples have a high variance because the observations are very volatile. Others are high because the partitions are really different. The latter case is a much bigger problem from a sampling perspective, so it's helpful to know that's what you're dealing with.

How much the covariance term brings it down depends very much on the distribution of the hit rates for each partition. If there are a lot of partitions with no hits, the covariance will be zero between those, so the overall variance stays pretty high. This is absolutely correct. In the extreme case, where all the hits are concentrated in one partition, your estimator is basically useless (analogous to an infinite variance) until you find that partition.

The rub is that this does not simplify the variance equations. It's not that I can't do the derivations, that's just more work. (Tedious and error-prone to be sure, but just work). It's that, the more complicated the expression, the more difficult it is to build a decent estimator. More importantly, the more difficult it is to get a distribution on that estimator.

At least I've found some more math.

Thursday, March 15, 2018

Exact variance

The real trick, of course, will be estimating this thing, but the here's the real block variance in the general case. I'm not going to attempt to convert this to HTML; it's enough of a pain to typeset it in the paper.


Wednesday, March 14, 2018

Full general

Up until now, I've been focusing on special cases in my sampler. I started with iid, simply sampled in blocks rather than individual rows. That went to the hit rate being correlated in blocks and then the hit rate being correlated in partitions in blocks. It's time to finally take on the most general case: each partition gets its own distribution and hit rate. No caveats.

On the one hand, it's easier. Since we are refusing to make assumptions, we're pretty much left with empirical methods. On the other hand, the point of this paper is to rigorously state things that most people just wave their hands at. (This emphasis is not merely academic; when I take these ideas to the actuaries at work, they are willing to listen, but they absolutely want proof. This company has 6 trillion - yes, 12 zeros -  dollars at risk; they aren't going to stake that on something that appears to work).

Monday, March 12, 2018

Tighten up

It's taking me a pretty long time to learn a pretty simple lesson: stop dropping terms.

Dropping the partition terms from the block variance was definitely the difference. I put them in (estimating the squares of the partition sizes from the observed partitions) and things look quite a bit better:
The true block variance is just under 16000 and we can see that, even for samples of just 5 blocks, we're not off by much. That said, the observed miss rate of 6.8% is high and that's what sent us to the kernel sampler in the first place, so I put the weaker posterior back in. Basically, we discount the data by the percentage we've sampled, so when we've only sampled 5 of 1000 blocks, the posterior kernel distribution is still pretty diffuse.
It's still not perfect, but it's a lot closer. And, as you can see, after only 50 blocks are sampled, the high bias on the estimate is all but gone. I think I could even get the acceptance curve flatter by deriving the true distribution of the sample variance. That's still on my todo list, though it might not be quite as urgent given these results. I think I'll move on to the more general case where both the hit rate and measure distribution depend on the partition.

Sunday, March 11, 2018

Time to stop guessing

I've been trying to sort out the behavior of the correlated block variance for a couple weeks now and haven't had much success. I get that the correlation prevents it from being Chi-squared (if that wasn't true, this whole line of research would be rather pointless). The question is, what is it? Nothing I recognize.

I was hoping that the kernel distribution would either fix it or suggest an easy workaround via sensitivity analysis. No such luck on either. Here's the sample and kernel distribution of the block varaince when five blocks are sampled:
They're both bi-modal and neither yield a reliable confidence interval with either a t- or Normal bound.

Well, maybe 5 is just too small. Nobody would stop sampling that early, anyway. What's it look like when we get to 20?
Now we have a different problem. The sample variance is basically unimodal, but the skew is in the wrong direction. The Kernel distribution has the same problem, plus it appears to be converging to a biased mean (and this was after pulling out my "heavy" prior that slowed convergence, so the bias isn't coming from that).

By 100 blocks sampled, we have a pretty good sample variance distribution but the kernel is still way off.
The problem here is that I'm using the homogeneous bound on the block variance which is always greater than the partitioned variance. I did that because, well, because the distribution was really messy and it looked like the bound would be tight enough. It's obviously not, so I need the real answer and not an estimate.

I really don't know what we're dealing with here, but I think I'm going to have to roll up my sleeves and grind out the real distribution of the variance and test statistic to make any further progress.

Saturday, March 10, 2018

Killin' it

Apologies to anyone coming here in search of heroics in math or ultrarunning. Today, you get proud papa.

Yaya's middle school jazz band played the MAC Jazz Festival today. Yaya had solos in three of the four songs and nailed all of them. Yaya also decided on a whim to take the finish of one of the tunes up an octave, which elicited an audible gasp from the audience, many of whom apparently didn't know that middle school kids can play that high. The rest of the kids played great as well. As a result, the band was recognized as the top middle school act at the festival.

We didn't even know that Parkway North had a decent music program when we moved here. Yaya was not even two and hadn't shown any early interest in music, so we didn't bother to ask. In elementary school, when the interest did start to show, we were happy to find that the district in general, and North in particular, had a very strong music program. Yaya is all over it now and looking forward to performing with the High School groups next year. Definitely a lucky break.

But, as has often been said, "luck" is the intersection of opportunity and preparation. Yaya is earning this. We're just happy that the opportunity is there.

Friday, March 9, 2018

Who cares (part 2)

I think I just figured out who cares: I do. More importantly, my boss does. And several layers of management above that.

As some may recall, I've been a bit concerned that my research, while improving in academic merit, has risked becoming irrelevant. Today, I realized that's not the case (or, at least, doesn't have to be the case).

Up until now, I've been framing this as a sampling problem (uh, oh, there he goes reframing things again). The sampling problem is not without merit and before my adviser, who reads these posts, gets a heart attack, I'm not straying from that for the immediate paper. However, the real power is in what it allows from a data organization standpoint.

A brief anecdote may help here. Back in 2012, the actuaries asked us to build an Analytic Cube based on policy attributes. That sounds like an easy thing and, from a data architecture standpoint, it is. From a data engineering standpoint, not so much. The problem wasn't that we'd have 60 billion fact rows. We had built cubes that size before. The problem was that the policy attribute dimension would have around 4 million rows. That was in addition to all the existing dimensions, several of which have a cardinality close to 1 million.

We spent no small amount of time trying to figure out how to partition the data so that we weren't doing ridiculously large "joins" (multi-dimensional data doesn't technically get joined, but the operation of attaching dimensions to facts is similar in terms of computation). Wouldn't it be nice if we could just indicate the structure and let the database tune itself?

That's sort of what D-BESt does, but that's really focused on just maximizing partition pruning. What if we took a step beyond that? To answer that, we'd have to have some idea of what that step would be. I sort of stumbled onto that while coding my sensitivity analysis stuff this week (yes, I have actually got some work done; I've just been too busy to post about it). Sensitivities on the hit rate are pretty easy, but what about the distribution of the measures given a hit? That's a lot more open ended.

Recall that our basic framework for the block distribution is driven off three parameters: the hit rate, the mean of the included observations, and the variance of the included observations. D-BESt does a great job of handling the hit rate. It tends to be zero or fairly high, and we generally don't need to read the block to know that. But, what about the other two? Having them relatively homogenous within a block might also be of great value. If we knew the distribution for the whole data set, and we knew how those related to hit rate, we could do a much better job of not only sampling and performing sensitivities but also at recognizing what data should be pre-aggregated and where the anomalies lay (that last bit is really what the Actuaries are after).

For example, if we we already have data somewhat stratified by mean and variance, we can bring sampling algorithms like CISS into play.

We can also create good sub-samples of the data by taking a slice from each block.

Aggregates can be built based on how often a block gets "split", that is, how often a query produces a high hit rate for some of the block and a low hit rate for the rest.

We can ask why the data gets sorted the way it does. Are there patterns in the organization that are useful to machine learning? If one particular attribute doesn't conform to the pattern, is that a problem?

Most importantly, this leads to pre-computation. If we can anticipate a query, it doesn't so much matter that it takes a while to answer it because we can start doing so before the query is even issued. Sorting the data into pre-computed blocks of like distribution is the first step in that.

Doing this within the context of rigorous statistics (rather than the more Data-Sciency approach of just trying stuff and seeing what generally works) is not a small task. But, that's why I'm thinking this line of inquiry may still have merit, even if it has strayed a bit from the original question. It's hard, feasible, and valuable. That's a pretty stout combination.

Tuesday, March 6, 2018

Prove it

I have a kernel method that works. Intuitively, it makes sense. Empirically, it yields correct results (at least, asymptotically; it's a bit conservative early on, which isn't a bad thing). But, I don't yet have a solid mathematical basis for it. It's really the result of just messing with the simulation.

There are two problems with that. First off, the simulation is just that: a simulation. You can fairly easily cook up a simulation to support just about anything. Simply bake your assumptions into the sim. The real world may not be so accommodating. The other problem is that "proofs" by simulation don't give people good insight into the assumptions of the method. No method works everywhere. Only by knowing what situations are supported can one know when NOT to use a method.

So, I'm now working on the mathematical basis for my method. It's a fairly heavy lift. These correlated data sets have been largely ignored for the simple reason that they don't yield easy answers and one can usually just work around the problem with heuristics.

However, this is one thing my adviser and I agree on: there really isn't much to this paper without a solid theoretical grounding. The result in and of themselves aren't that interesting. It's only with the theory in place that we will have an adequate foundation for a dissertation.

More late nights to follow.

Saturday, March 3, 2018

Some thinking time

Ultramarathons are a good time to sort things out. I ran the Jefferson County Endurance Trials today. For what it's worth, I won my age group. I'll get to that with a race report sometime in the near future.

Meanwhile, I did formalize my approach to the sensitivity analysis.

Since I'm already building a distribution on the hit rate, ρ, from a kernel distribution, the easy thing would be to simply add to the kernel a weighted distribution derived from a fictitious block. That of course poses the question of what distribution and what weighting?

The root question we're trying to answer is "what if what we haven't sampled is way different than what we have." Put another way, how much does our estimate change if the very next block could be pretty much anything?

Since ρ is itself a probability, it's bounded between zero and one. For all the sampled partitions, we're generating a Beta distribution weighted by the number of rows in the partition. The distribution of ρ for our "out of the blue" block is uniform(0,1) - it could be anything that passes as a probability. So, chop the interval into several buckets and see what happens if the next block has a hit rate coming from one of those buckets. The next block has to come from one of them. So, if you have m buckets, just create m estimates of the variance where each one picks one of the buckets and adds a uniform distribution of that bucket, weighted by the block size, to the composite distribution.

If the sample is stable, these m variances will all be pretty similar. If not, that means that the variance is very sensitive to the next block sampled and it would be a really good idea to sample that block before drawing any conclusions.

Friday, March 2, 2018

The big miss

The reason the actuaries are leery of using sampling techniques in their analysis is one backed up by fact: you might miss something really big. (Most of the habits of actuaries are backed up by facts; you'd be hard pressed to find a more rational group of folks. I guess spending 2000 hours a year putting a price tag on death has that effect).

I can't say my research has done anything to assuage that fear. For example here's the kernel distribution of the hit rate from a sample that represents to overall population:


Everything lines up nicely. The sample average is very close to the true average and the estimate of the block variance is spot on. Notice the bump at 0.65. That is coming from a two of the ten sampled blocks. Because the hit rate for those blocks is much higher than most of the blocks, it's really important that they be represented. Not only does it pull up the average, but it also widens the variance. Say that, by dumb luck, instead of sampling one of those blocks, we had got a different one with a lower hit rate.


To the naked eye, this distribution doesn't look that much different, but it is. The estimate is significantly low for both the sum and the variance. This coupling is particularly problematic. Samples that produce bad estimates are also the ones that return the tightest confidence intervals. In other words, the worse the sample is, the better it thinks it is. Not a great combination.

It's not easy to adjust for this because, in the absence of the extreme values, there's nothing in the sample to indicate that the variability is as high as it is. Bootstrapping won't help because we'd still be looking at just the elements in the sample.

So, what if we reversed the process? Instead of resampling the data we've already seen, we invent different data and ask how much things would change if this data was introduced. The actuaries do this all the time, it's called sensitivity analysis. It's basically measuring how badly a missed assumption (or sample in our case) hurts you. If it doesn't make much difference, you can be pretty confident that your estimates are good. If introducing a single new data point really shakes things up, that's a clue you might want to keep sampling. I'm going to code this up real quick and see how it performs.

Well, not super real quick. I've got a six-hour race tomorrow and I also promised Yaya I'd give her a driving lesson. But, I will get it done this weekend.

Thursday, March 1, 2018

The kernel

I'm pretty sure I've figured out why the confidence intervals are coming out wrong on the correlated bootstrap. I'm not sure what to do about it. My adviser said he was going to take a look at it, so I put it down for a bit and went to work on the kernel method for simple block sampling.

The concept is simple enough. For every block sampled, you separate out the partitions in that block. Then, for each partition you generate a distribution for rho (the hit rate). Lot's of distributions work for this and they all converge to the same thing so I just picked the easiest one to defend: the likelihood function for the data given rho. Using such a distribution essentially yields a Bayesian posterior given a non-informative prior. In the specific case of estimating a hit rate, that posterior distribution is simply the Beta distribution and the parameters are the number of hits and misses observed.

The kernel distribution is just the average of all the individual partition distributions weighted by how many rows were observed in each partition. If the data was iid, that would be all there is to it. Since it's not, we have to also account for the fact that our sample is correlated, so it will vary more than we might otherwise expect. This can be done either by giving more weight to the prior or less weight to the data. Since the result is pretty much the same either way, I took the latter approach and scale my hits and misses by how much of the sample has been read so far.

This yields the following expected distribution for rho when 1/200 of the population has been sampled:


Not surprisingly, the spike at zero gets picked up very quickly. If a partition is of any size at all and returns no rows, that's a pretty good indication that rho is zero for that partition. The rest of the distribution, however, is pretty fuzzy. Doubling the sample size helps.


It's starting to look like there are four underlying populations, but it's still not clear how much of the population is in each part. At 5% of the population sampled, we've pretty much got it. We may have to keep sampling to further reduce the variance on our estimator, but our estimate of the block variance is pretty solid at this point.


There is, however, a catch. This is the "expected" distribution for rho. Different samples will yield different distributions. This is the same thing that's messing up the correlated bootstrap. Anything estimated from the sample will have additional volatility because of the correlation in the sample. If we knew up front just how correlated the data was, we could bake that in. Since we don't, it's a bit more dicey.

In the simple block sum case, the best we can do is try to weight the data such that the convergence reflects that uncertainty. This can be tuned based on experience with the data but, since every query is different, it's always going to be a bit of a guessing game. At any rate, since the distribution of the sample sum is a function of the distribution of rho, we can use this estimate of the distribution to to construct a distribution and confidence interval of our estimator.