Monday, January 22, 2018

Semi-annual outline update

Don't bother saying you've heard this one before. It seems that reorganizing the theory section twice a year is the way this thing is going to roll. Here's my outline. I stress my. This is not a bipartisan effort just yet and I expect my adviser will want to change it (mainly because it doesn't include the very thing we said it should when we met on Thursday, sounds a lot like the Senate right now). Nonetheless, for those of us trying to actually reach solutions rather than grandstanding for the public, having a draft in hand is better than starting from scratch.

Introduction doesn't change much. My adviser suggested a slight rearrangement of the ideas, which is easy to accommodate.

Theory section.
1. Formulation assuming iid observations - Layout the basic notation, introduce the query function, some quick comments about how the sum is obviously Normal, and the exact distribution of the estimator from a random block sample. Might throw in a graph from simulated results. Argue that using physical blocks in this case as a selection procedure for BMH and BLB sub-samples yields the same results as true random sampling.

2. Formulation assuming a correlated mixture distribution on IQ - Add the notation for partitioning and note how partitions tend to be grouped in contiguous physical storage. Derive the blocksum variance. Demonstrate quick convergence to normality of the estimator even in heavily skewed blocksum distributions. Derive the exact distribution of the estimator from a semi-random block sample (using radix sampling to avoid correlation between adjacent blocks). Show how radix sampling can also save the BMH and BLB methods.

3. Formulation assuming a true heterogenous mixture (both the distribution of Xi and IQ depend on pi). This is where I believe my adviser wants to insert some methods based on Dirichlet priors. I'm only somewhat familiar with the techniques, so I'm going to have to study up on this. Again, get the exact distribution of the semi-random block sample. Also demonstrate the impact on BMH and BLB using radix sampling.

The claims in the theory sections will be backed by simulated results.

Application section.
We bust out the big data set and run it against the third formulation.

Discussion and future research.
Probably best to actually get the results before writing too much discussion, but I expect we're going to find the BLB with radix sampling does pretty well. BMH should also do fine, but it's computationally expensive; enough so that we have to factor that in (generally, we are claiming that I/O dominates run time). Random blocks sampling won't work particularly well without stratification, but it still is worth showing if for no other reason than to point out why we're not just pursuing the simple path. Future research is focused on imposing organizational constraints on the physical data so that the methods converge faster. The two obvious choices (both of which I have conveniently already coded) are dynamic partitioning to ratchet the correlation up to where each block contains exactly one partition and stratification. Both of these allow us to ignore a lot of blocks because we know from the outset they won't have much impact on the estimator.

This isn't the total re-write that it appears to be. It's more a shift in focus. Before, we had BLB and BMH along just for the sake of comparison. Now, they are actually the subjects of our research. We're showing how the correlated data messes them up, but also showing how to adjust for that. Most of the code is written and I've been pretty good about writing it in a way that I can make the adjustments quickly. As for the paper itself, good writing always takes time but, fortunately, I actually have some time right now. With all our initiatives at work in finally in production (the last one being the source of the 40-billion-row data set that I'm going to use), work life is moving back to reasonably predictable 40-hour weeks.

No comments:

Post a Comment