- Display consistent results for each of the methods. Right now, the graphs are a little scattered. They all tell good stories, but I need to either come up with a consistent graph that allows comparisons across methods, or come up with some other type of table that puts them on equal footing.
- There are a few sections of text that my adviser thinks are less than clear and I haven't been able to come up with significant improvements. We may just have to work them together word by word. Fortunately, it's a short list.
- The conclusion is, well, missing. I need to write up a conclusion of some sort.
- Run all the samplers on the empirical data set. This is the biggest lift, but at least I have that data set on the HDFS cluster at work now. It's 17 billion fact rows and it's no problem to construct queries against it that take several minutes to complete (they used to take hours before we rehosted to the cluster this year). So, it's a big enough data set to make the case that sampling makes sense. The problem is that I haven't properly parallelized the samplers, so they will probably take just as long as the full query engine (which we have spent the last year making very fast). I guess I don't need to worry about that since we're just demonstrating that things work and not making a bunch of performance claims.
That's probably a solid week's worth of work and I'm sure there's another round of edits, but this is looking like something that really will happen quite soon.
No comments:
Post a Comment