Saturday, January 20, 2018

New abstract

I'm jumping the gun on this a bit as there are a few results that I don't have backing for just yet. However, it is helpful to spell out what you intend to show, even if you haven't got there yet. So, here's a cut at the new abstract:

When sampling from a finite population, the statistic of interest is often the sum (rather than mean) of observations from a sub-population. When the finite population is large (we will consider n>109 as being "large"), estimators for this statistic can be derived from the asymptotic convergence of the mean of a random sample. However, the quality of such estimators must be evaluated in light of the finiteness of the population. Further, the realities of physical storage of large data sets introduces correlations in the data that can significantly impact the quality of the estimate when a truly random sample is replaced by sampling by physical block (the latter being less efficient in the statistical sense, but far more efficient in terms of computational resources).

In this paper we examine the convergence of estimators using block sampling rather than true random sampling for three different methods: random block, Bag of Little Bootstraps (BLB), and Bootstrap Metropolis-Hastings, and derive adjustments to the estimator quality based on the organization of the data in the physical database. We then demonstrate the correctness by applying the methods to a 40-billion-row data set of cash flow projections from life insurance policies.

The gist is this:

  • In infinite populations, unless the mean is zero, the sum drifts off to infinity, so there's no point in estimating it. In finite populations, the sum is a real thing and it can be estimated.
  • Sure, you can estimate the mean and then just multiply that by the number of observations to get an estimator of the sum. That's actually a good strategy.
  • Problem is, the variance on your estimate of the mean is NOT the variance of the sum, because you were estimating the mean of an infinite population. If you're trying to assess the estimate of the sum of a finite population, you have to account for how much of the population is unsampled. (Simplest example: n=1. You sample that observation and call it your estimator of the mean. If your population was infinite, you'd have no way of knowing how good your estimator is because there's no variance from a sample of 1. However, in this case, you know exactly how good your estimator is: it's perfect. You sampled the entire population).
  • Furthermore, since we're not living in some fantasy world where observations just drop out of the sky, we need to think about how the data is stored. On any computer system capable of handling a large data set, the data is spread across multiple physical storage blocks. We could ignore that reality and just randomly grab rows using a random number generator, but that would be super duper inefficient. So much so that we'd actually be better off just reading the entire sample and getting the real number (I can prove this). So, we grab a whole block at a time and build an estimator from that.
  • All well and good if our observations are iid. Guess what? They're not. Not even close. So, we need to adjust for that.
  • Even furthermore, we're not after the entire population sum. We want the sum from a sub-population. This sub-population may or may not be spread evenly among the blocks. That drives up the variance of block sampling even further.
  • So far, all of this is pretty self-evident and amounts to hand-wringing. Now comes the big question: what are you going to do about it? We need to roll up our sleeves and include these factors in any assessment of our estimator.
  • We pick three fairly-well accepted estimators and do exactly that.
  • Then we check our work by applying what we've done to a moderately large data set. We'll run it a bunch of times and show that our estimator is within our predicted confidence interval about as often as we'd expect.


No comments:

Post a Comment