Wednesday, February 24, 2016

Sampling harness complete

In a perfect world, I'd be further along, but I did get to a tangible milestone today and sleep is important, too. My sampling harness is complete. The architecture is pretty straightforward. As it's basically a command-line app, there's no presentation layer; just a model sitting on top of a data layer.

I've only implemented csv files in the data layer as that's all I'm working with right now. When I start using the full dataset, I'll swap that out for HDFS and host it in the cloud. Aside from basic read/write, the data layer also does the stratification of the data before presenting it to the model. This effectively re-blocks the data in the csv case. When I go to Hadoop, I'll do the stratification as part of the partitioning of the Parquet files so the data layer can leverage the parallelism inherent in the file system.

The model currently implements two sampling methods; a sequential scan of all blocks and a random sampling of blocks. Both run through the whole data set right now. I probably won't bother putting a stopping rule on either since I'm only using them as a baseline for convergence. If I do put a user interface on this thing, I'll factor the sampling methods out into a controller layer. For now, I'm just being careful not to build sampling rules into the true model classes. That separation will likely be important when I go to parallelize things.

It's just a start, but it works fine and the whole thing has unit test coverage. That's actually pretty important because once I start running large data sets through this thing, it's going to be very difficult to validate that each step is working right. Having those test cases to fall back on for regression when making changes will likely save a ton of time.

Next up, obviously, is to code my algorithm (need to come up with a better name than "my algorithm"). Most of it is pretty simple, the only difficulty is in computing the posteriors. I'll probably use a heuristic for starters just to get it working and then work through the math with my adviser when I get back from Texas.

No comments:

Post a Comment