Tuesday, January 26, 2016

Missing observations

Data cleansing was the subject of today's lecture for Data Mining. We spent most of the time looking at what to do in the case of missing observations. Granting that filling in a few values here and there is probably simplifying things without biasing results, I'd say the strategies offered were pretty bogus. I've held for some time that Data Science is just Statistics without the rigor and this bag of unsupportable tricks didn't do much to dissuade me.

I'm pretty sure that part of the problem is simply that the techniques that have some merit are beyond the level of this course (it's a mixed grad/undergrad course so, while we PhD types will be expected to produce some meaty theoretical work, the lectures are tuned to the undergrads). That said, there's at least one strategy for missing observations that I think a lot of CS folks are too quick to dismiss: go get the data.

Yes, it's a pain. I spent hundreds of hours pouring through motor vehicle records and mortgage papers preparing the data set I used for my masters work. Not because I needed a clean data set for the thesis, but because the work was actually being used by the New York State Department of Health to direct cancer investigations. Sure it, would have been easy enough to just fill in some data, but there are times when getting the answer right really matters.

You could certainly argue that many of today's data sets are too large for that level of effort (my research set was a mere 592 cases of Leukemia). But, detecting and correcting missing data is becoming a whole lot easier due to the tremendous growth in computational power. At work, we spend quite a bit of our time writing tools to detect gaps in data. When they are found, we report it to Corporate Modeling, who then chase after the providers (typically, actuaries in our remote offices) to fill it in. The result is that we have an incredibly robust set of projections data. That allows us to confidently trim the buffer between our actual and required capital and price more competitively. In the context of 4 trillion dollars of insurance in force, the million or so we spend doing this seems like effort well directed.

No comments:

Post a Comment