OK, let's try this variance derivation again, this time remembering that probabilities need to sum to 1.
Recall that
μ = (nbk/2)∫θP(θ)dθ = E(X|block contains data)E(θ)
and that θ ~ Beta(a, b) where a is the number of blocks sampled that contained rows for the query and b is the number of blocks sampled that didn't.
Var(X|θ) = σ(θ)2 = E(|X - μ|2) = (1-θ)μ2 + θ[(nbk)2/3 - nbk μ + μ2] = (1-θ)p + θq
Var(X) = σ2 = ∫σ(θ)2P(θ)dθ
OK, we've been here before. Let's do the integrals right this time:
σ2 = ∫σ(θ)2P(θ) dθ = ∫[(1-θ)p + θq]P(θ) dθ
= ∫[(1-θ)p + θq]θa-1(1 - θ)b-1 dθ / B(a,b)
(forgot about that denominator last time - it's kinda important, especially when a and b start getting big.
= ∫(1-θ)bpθa-1 + θaq(1 - θ)b-1 dθ / B(a,b)
= [p∫θa-1(1-θ)b dθ + q∫θa(1 - θ)b-1 dθ] / B(a,b)
= [pΒ(a,b+1) + qB(a+1,b)] / B(a,b)
Here's the best part. Look what happens when you sub in B(a,b) = Γ(a)Γ(b) / Γ(a+b) and remember that Γ(z+1) / Γ(z) = z for all z ≥ 1:
= [pΓ(a)Γ(b+1) / Γ(a+b+1) + qΓ(a+1)Γ(b) / Γ(a+b+1)] / [Γ(a)Γ(b) / Γ(a+b)]
= (pb + qa) / (a + b)
The real world never comes out that cleanly. But, hey, you can check out the convergence graph yourself. Notice how the confidence bounds wander right up to the true value, but nearly always contain it. It really is that slick. There's much more testing to be done. My guess is that certain types of queries will obey these limits better than others. But, I don't think many will be too far off. The theory is sound and the implementation is correct. At any rate, it's coded and it's working on my test data. That's good enough for me to write this thing up.
No comments:
Post a Comment