UMSL held a 3-Minute Thesis competition today. You get one static slide and 3 minutes to pitch your topic. After giving the talk in a preliminary round, I was invited to the finals. I didn't place in the top three. Here is my slide and what I said.
Hello, my name is Eric Buckley and I am pursuing a PhD in Math..
As it’s an election year, I’ve chosen to introduce my subject with one of the most famous pictures ever taken in St. Louis. Contrary to popular belief, I was not alive at the time, but I’m told the 1948 election was almost as crazy as what we have going on now. For me, the most interesting fact is that EVERY SINGLE POLL got it wrong. Why? Technology bias.
I am old enough to recall when working class people would have “party lines”, a single phone number shared by several families. Pollsters would sample the number once and count it once, but it actually represented several votes. As a result, Truman’s blue collar base was under-represented..
Cell phones and caller ID have given pollsters plenty of new problems to wrestle with, but I am more interested in what happens when the source isn’t a person at all, but a database. Or, many databases combined into what’s known as a “data lake”.
In such cases, one needs to consider not only source bias but also the fact that records are highly correlated. So, it may appear you’re converging on an answer when in fact your sampling has missed a whole lot of contradictory information all tucked away in a tiny corner of the lake. Finally, in the case of financial data, which is the focus of my research, the distributions are not your typical Normal (or “bell”) curve. They are “heavy-tailed” which means they are spread out with more observations very far from the mean. Random samples from such distributions converge very slowly, if at all..
My dissertation takes a three-pronged approach to these issues. First, we stratify the data to ensure we get adequate representation of the distribution tails. This part of the work has been very productive and will be submitted for publication next month.
Next, we adjust for correlation in the data so we can accurately estimate the variability of our results. More correlation generally means higher variability. Therefore we have to dig deeper into the data to get a reliable sample that doesn’t miss pockets of significant information..
Finally, we put these two together to produce not just an estimate, but a distribution for the estimate. We can then give a range known as a “Highest Density Interval”. This is the narrowest interval that contains the real value with a given certainty. The more we sample, the narrower the interval becomes. We control the sampling to stop when the width reaches acceptable limits.
This rigorous quantification of results has been sorely lacking in the emerging field of Data Science. I am excited to be among those addressing this need.
No comments:
Post a Comment