Tuesday, January 30, 2018

Who cares?

All through this sampling project, there's been an underlying concern: who cares? Sure, sampling large databases by block is a great thing. So do it. Why all the fuss? Just sample by block, use the block sum sample variance to estimate the spread and you're done. What's more to say?

We've been trying to show that there is more to say and, while I knew that to be true, I was not finding a good way to articulate it. Showing a bunch of convergence graphs is helpful, but not a substitute for a concise argument.

My mind wandered back to the statement that put me onto this line of research before I even went back to school. It was made by one of our senior actuaries in response to my suggestion of using sampling: "We tried that and we kept missing important stuff."

Why? The actuaries all got A's in statistics. They know how to construct a confidence interval from a sample. Why was it not working? My original thought (and still my belief) was that the sampling wasn't adequately accounting for correlation in the data. But, the more I worked with BLB and BMH, the more they started looking like really expensive ways to arrive at the sample block variance and we were going to wind up right back where we started. Why did I think any of this was going to be better? Why should anybody care?

Then, while grinding out the exact formula for the block sum variance this morning, I realized why it wasn't working. And, it's so simple, it's a little embarrassing. The sample block sums are not sufficient statistics for the total sum (they are in the iid case, but not in general). When observations are correlated, rolling things up to blocks throws away information that turns out to be really important. We need some other statistics that summarize the data without losing information.

Such statistics are readily available, the trick is then using them to reverse engineer the overall distribution. Straight random sampling is not well suited to this task. BLB and BMH are (particularly BMH, which accommodates arbitrarily complex models). So, now I have an easily stated justification for breaking out methods that at first glance appear to be overkill.

This gives a cognitive flow to the theory section that was completely missing up until now. There's still a fair bit of writing to go, but the path is clear and that's usually half the battle.

No comments:

Post a Comment