Remember, I said this process is easy to model. I never said it was easy to analyze. So, I'm going to start with the easy part: the model.
We have a series of observations {(Xi, Yi)}, i = 1...n. The distribution of Xi is dependent on Yi. More importantly, Yi is also used to determine if we should even be adding the value of Xi to our query.
We also have a partitioning function which divides the observations into groups where any two observations in the same partition are generally closer in distribution and generally about as likely to be included in a specific query. This matters because, rather than being randomly scattered throughout the data set, observations in the same partition tend to be co-located in the same physical block of storage. In fact, they are likely to be adjacent. This last bit is what allows us to model the observations as a Markov Chain; it means that if our current observation is from partition p, then that's all we need to know to have a probability that the next one is in p as well. If we knew they were just in the same general vicinity, then we'd need to either keep a longer history, or sort the block by partition (an expensive operation) to get them adjacent. If the next observation isn't in p, then all bets are off; it could be in any of the others.
So, the simplest form of the model is just the sequence of partitions {pi}, i = 1...n. The transition matrix is pretty simple. The main diagonal is the probability that the partition doesn't change. All the other entries are the probability that it does change, divided by the remaining number of partitions.
The problem with this model is that it only tells us about the sequence of partitions. We don't really care about partitions; we want to know things about the sequence {Xi IQ(Yi)}. Specifically, we want to estimate the first and second moments so we have a point estimate and a confidence interval of the sum. This is where the model can get very complex.
For starters, we're going to assume that the distribution of Xi doesn't change that much from one partition to the next. We're also going to assume that the conditional distribution of Xi given IQ(Yi)=1 is relatively constant. I don't think either of those is true, but I'm willing to go with them until the empirical evidence tells me that violating those assumptions is a big problem.
That leaves us with simply estimating, by partition, how many rows get kicked out of the query. Then, we look at estimating the variance in light of the fact that the hit rate for our block may or may not be representative of the overall hit rate. Once the hit rate is estimated, we can combine it with a simple sample variance on the included rows to get assess the quality of our estimate of the total sum.
That's still not particularly simple, but at least it's reasonably well defined. I've got a three day weekend to make something of it.
No comments:
Post a Comment