Wednesday, February 21, 2018

Fixing the methods

So, how do we actually go about "fixing" the three approaches to evaluating our estimator in the correlated case? There are many ways of course, but here's what I'm going to try.

In the case of simple block sampling, the problem is that the distribution of the sample variance doesn't match the χ2 distribution. That means the estimator doesn't match the t-distribution. So, we need a way to estimate the actual distribution of the variance so we can then estimate the distribution of the estimator. (Lots of estimation!) To estimate the variance distribution, I am going to use kernel density estimation (KDE). This is a technique where the data points are used as centers of some kernel density. You then average all these together to get an estimate of the real density. It's like a histogram, except it can be made much smoother. (Actually, a histogram is a special case of KDE where the kernel maps the observation to the center of the histogram bucket containing the data point with probability 1). The smoothness is useful if you want to evaluate the probability of an arbitrary interval. If you don't need that, a histogram works pretty well.

Since I already have a derivation of the variance as a function of the hit rate, I'm going to use some variant of the Beta distribution as my kernel to estimate not the variance, but the hit rate. That can be run through the variance formula to get a distribution on the block variance, which in turn gives an estimation of the distribution of Sr, the estimator of the total sum. There's actually not a lot of math involved in all that. It's simpler to just resort to numeric methods for the integrals.

Next up is bootstrap. This one is fairly easy given the above work. Rather than bother computing the kernel density, just resample using the raw hit rates. This is way more efficient than traditional oversampling because we're only saving the sufficient statistics (hit rate, and the first two moments of the observation) from each partition and how many rows were in each partition. We can then generate a multinomial vector with dimension equal to the number of partitions sampled and generate a random some for each element of the vector rather than generating millions of individual observations and adding them up.

Finally, we have the MCMC idea, which is basically where the Dirichlet stuff comes in. We gather the same sufficient statistics as above and run them through a Dirichlet process to get a distribution on the distribution for each partition. Yes, a distribution of distributions. One could derive the distribution of the estimator from that, but here's the wrinkle I want to throw in to tie all this stuff together: use that distribution to run the bootstrap from above. So, you have a Dirichlet Process to boil the data down to a non-parametric distribution and then you use a bootstrap process to generate a distribution of the estimator. Meanwhile, the estimator itself is still the leveraged sample sum.

Tomorrow, I'm going to modify the simulator to produce all this stuff. If it works, I'm going to have A LOT of writing to do this weekend.

No comments:

Post a Comment