Saturday, December 16, 2017

Insufficient

One of the things I discussed with my adviser yesterday was whether we were dealing with sufficient statistics (better yet, minimal sufficient statistics). Because the individual rows are correlated, it is more convenient to work with block sums and use the (mostly true) assumption that block sums are independent.

If individual rows are distributed by pdf f(x) with mean μ and they are included in the query with probability λ, then we have two parameters to estimate (λ, μ). Knowing both of these gives us the distribution of the sum. The problem is, the block sum is not a sufficient statistic for these. If we get a block sum of 50 on a block of 100 rows, there's no way to know if that was (λ = 1, μ = 0.5) or (λ = 0.5, μ = 1). Examining the individual rows would give us more insight into that, so the blocksum is not sufficient.

Of course, the blocksum is sufficient for the value we really want, the product of λ and μ, which is the expected value of an individual row. That can then be multiplied by the total number of rows to estimate the population sum.

However, if we want a confidence interval on that sum, we also need to know the variance, and that very much depends on the individual parameters, λ and μ.

All hope is not lost. Again, we don't really care about λ and μ, we want the variance (which happens to be a function of those two). The variance is also a function of the vector of block sums. In the infinite population case, the blocksums so far are a sufficient statistic for the variance of the mean. So, in the finite case, we should have a statistic which is assymptotically sufficient. I think if I can establish that, plus some reasonable comments about the rate of convergence, that should be good enough.

No comments:

Post a Comment