Ultramarathons are a good time to sort things out. I ran the Jefferson County Endurance Trials today. For what it's worth, I won my age group. I'll get to that with a race report sometime in the near future.
Meanwhile, I did formalize my approach to the sensitivity analysis.
Since I'm already building a distribution on the hit rate, ρ, from a kernel distribution, the easy thing would be to simply add to the kernel a weighted distribution derived from a fictitious block. That of course poses the question of what distribution and what weighting?
The root question we're trying to answer is "what if what we haven't sampled is way different than what we have." Put another way, how much does our estimate change if the very next block could be pretty much anything?
Since ρ is itself a probability, it's bounded between zero and one. For all the sampled partitions, we're generating a Beta distribution weighted by the number of rows in the partition. The distribution of ρ for our "out of the blue" block is uniform(0,1) - it could be anything that passes as a probability. So, chop the interval into several buckets and see what happens if the next block has a hit rate coming from one of those buckets. The next block has to come from one of them. So, if you have m buckets, just create m estimates of the variance where each one picks one of the buckets and adds a uniform distribution of that bucket, weighted by the block size, to the composite distribution.
If the sample is stable, these m variances will all be pretty similar. If not, that means that the variance is very sensitive to the next block sampled and it would be a really good idea to sample that block before drawing any conclusions.
No comments:
Post a Comment