Wednesday, February 7, 2018

Reining it in

You may recall that the original thought (well, original for this iteration, which actually makes it about the five hundredth thought) was to have the theory section build starting with iid, then going to correlated on hit rate, then go to correlated on general distribution of the observations. Given what we've got now, I think I'm going to suggest that we focus just on the hit rate problem. It still gives us three basic steps.

The first is, again, iid. There's nothing remotely interesting about sampling iid observations, but it does allow us to establish the framework and methods without all the complexities that follow.

The next is sequential correlation, typically caused by the fact that things that are loaded together tend to be related and also tend to get stuck next to each other in the storage system. This was also part of the plan.

The third step ends up flowing pretty simply from that. Suppose things aren't strictly sequential, but there's still a lot of correlation. This happens on massively parallel systems where writes get sent to multiple nodes according to some hash algorithm. It has the effect of breaking up batches, but the little chunks are still sequential and highly correlated. This case is actually a little cleaner mathematically because you don't have the annoying boundary conditions for the first or last batch in a block (which typically spill into adjacent blocks).

The actual methods work the same in all three cases (though you'd be a fool to use the complicated methods if you knew the data was iid). The results vary quite a bit, though. Plus, in the context of a full dissertation, the third case sets me up nicely to follow with D-BESt.

My meeting with my adviser got pushed back to tomorrow, so I should know by tomorrow's post if that's the way we're going.

No comments:

Post a Comment