Tuesday, May 1, 2018

Targeted solution

So, if you've got data where the query relevance is highly correlated by partition, but it's otherwise uncorrelated, I've got the sampling algorithm for you. The "rho"-kernel algorithm works great in that case.

And not anytime else.


This is the number of times out of 1000 tries that the sample confidence interval didn't cover the real value. Alpha risk was set at 5%, so it should be around 50 misses per 1000.

I expected the full correlated set to generate a lot more misses. The rho-kernel algorithm makes no attempt to adjust for that type of correlation, so it would be alarming if  the results were good for that case. However, the fact that the uncorrelated data was never missing was a bit of a shock. Why does a correlation coefficient of zero break the algorithm? I had to think on that one for a while.

What's happening is that the kernel will never assign zero variance to rho. But, if the variance really is zero, the only variability that will be observed is just random noise. That's already baked into the variance computation. But, it will also be the variance of rho. So, it basically gets double counted.

Unfortunately, there's no good way to know from the data sampled whether you're in a correlated or uncorrelated situation. You could put some heuristics around it, like, if the differences between blocks were above some threshold, but it would have no mathematical basis. I'm trying very hard to avoid heuristics in this paper.

So, I just have to conclude that the rho-sampler sucks. That's not really a bad result. Eliminating inferior solutions is part of finding the good ones. I do have the same problem with the full kernel approach. In that case though, the effect is smaller and it's a bit easier to adjust for it. More on that to come.


No comments:

Post a Comment