Thursday, January 11, 2018

Confounding variables

A confounding variable is one that impacts both the independent and dependent variable in a relationship. If the confounding variable is not accounted for, a false causal relationship may be deduced. A common example would be to note that prison populations are disproportionately minority and conclude that minorities must be inherently criminal. The relationship is irrefutable, but the causation may have nothing to do with race. Minorities also suffer from lower economic status, less access to good education and general racial bias in the criminal justice system. All these things make it more likely for a person to turn to and be convicted of a crime. Studies that account for such things have shown that the environmental variables, not race itself, that are much better predictors.

We have something of an inverse problem with our current formulation. We know that rows within a block have correlation in the attributes. Those attributes are the independent variables that determine the distribution for the measures we are after. More importantly, those attributes may exclude the measure from the query altogether if they don't meet the query criteria. Thus, the mean and variance of any block sum is higher than if the rows were scattered randomly throughout the database.

The attributes are correlated because people tend to batch up data by like attributes. The batch doesn't cause the attributes to be correlated. Rather, it's the correlation that causes the data to be put in the same batch.

So, we're turning the confounding thing upside down as well and using batch id as a surrogate for "these things go together." This is generally valid, but it does leave open the possibility of having some sort of background analysis performed on the data to really determine which things go together. I have some vague notions on how one might do that based on analyzing the results of query patterns (similar to what D-BESt does, but with an eye to the measures as well as the attributes). However, that is not today's problem to solve.

But I don't want to box myself in by deferring it out of hand. So, I'm going to change my notion of a "batch" slightly and instead talk of a "partition". A partition is simply a map from the attribute vector to the positive integers. As batch id is part of the attribute vector, using it as a partitioning variable is simply a special case of the more general framework. By using partitions rather batches, I make it easier to generalize the results.

I am giving up a little bit with this. With batches, there are assumptions about contiguous rows that won't be generally true with partitions (unless the data is re-blocked by partition as with D-BESt). I talked with my adviser today about how the contiguity of batches might be leveraged and neither of us came up with anything. Therefore, I'm not considering that a loss.

No comments:

Post a Comment