Wednesday, March 30, 2016

Closed form estimates

There's a reason Bayesian's have been using the Beta distribution for Binomial and Bernoulli priors for so long: it does tie up into a closed form rather nicely. It may turn out to be not the best choice, but if it is workable, I don't have to do any numerical estimation to get the variance on CISS results.

Recall from yesterday that, if X is a random variable representing the query sum from a block:

E(X|θ) = μ(θ) = θnbk/2

E(X) = μ = ∫μ(θ)P(θ)dθ    where P(θ) is the posterior on θ

Factoring out the constant from μ(θ), this simply becomes:

μ = (nbk/2)∫θP(θ)dθ = E(X|block contains data)E(θ)

Well, duh. As long as E(X|block contains data) is independent of θ, that's always going to be true. Unfortunately, that's a flimsy assumption. If the query attributes are highly correlated within a block, θ drops and E(X|block contains data) rises. But, we're deferring that for the moment. We will have to revisit it when we get to dynamic partitioning because then we'll be intentionally increasing the negative correlation between the two. (We may have to look at it sooner; I don't trust any assumptions until I run real data past them).

At any rate, as long as we're using a distribution with a closed-form mean (Beta certainly qualifies), we can move on to the variance:

Var(X|θ) = σ(θ)2 = E(|X - μ|2) = (1-θ)μ2 + θ[(nbk)2/3 - nbk μ + μ2]

Var(X) = σ2 = ∫σ(θ)2P(θ)dθ

Before we go plowing ahead with the integration, let's simplify our symbols a bit by letting p = μ2 and q = (nbk)2/3 - nbk μ + μ2. This helps us see that these terms are just constant coefficients and can be moved in and out of the integral giving:

σ(θ)2 = (1-θ)p + θq

σ2 = ∫σ(θ)2P(θ) dθ = ∫[(1-θ)p + θq]P(θ) dθ

Substituting in the density for θ ~ Beta(a,b) gives

σ2 = ∫[(1-θ)p + θq]θa-1(1 - θ)b-1 dθ

   = ∫(1-θ)bpθa-1 + θaq(1 - θ)b-1 dθ

   = pθa-1(1-θ)b dθ + qθa(1 - θ)b-1 dθ

   = pΒ(a,b+1) + qB(a+1,b)   where B(a,b) is the Beta function.

Now, before you accuse me of cheating and simply burying the integrals inside a predefined function, recall that:

B(a,b) = Γ(a)Γ(b) / Γ(a+b)

and that the Γ function is simply an extension of factorial to the real number line:

Γ(i) = (i-1)!

So, if a and b are integer:

B(a,b) = (a-1)!(b-1)! / (a+b-1)!

and

σ2 = p[(a-1)!b! / ((a+b)!] + q[a!(b-1)! / (a+b)!]

OK, factorial isn't technically closed form, but since we're dealing with block counts rather than row counts, we're not talking about massive looping to get answers. In fact, while it's no problem to set the prior so I only get integer parameters a and b on the posterior, any decent math software library has a closed form approximation of the Gamma function, so I could just call that.

No comments:

Post a Comment