Thursday, December 15, 2016

Way forward

I met with a member of the Computer Science faculty yesterday. She seemed a lot more interested in my topic than most folks I pitch it to. Her research (looking for genetic patterns in Alzheimer's patients) is only marginally related, but the techniques have some overlap. She agreed to critique my CISS paper. Just the fact that she understands working with high-dimensionality data makes her opinion useful. Compared to a DNA sequence, my set of 1000 attributes is quite tiny.

So, feeling somewhat better that this topic is not a dead end, I'm giving some thought to where the road might lead.

First, the practical problem we're trying to solve (there's no rule that says a PhD topic has to address a practical problem, but it's somewhat easier to stay on course if it does): we want to quickly retrieve sums of measures from a large data set with attribute values matching an arbitrary set of criteria. Important words here are "quickly", "large", and "arbitrary". The lack of quantification of these terms would prompt any competent Business Analyst to reject this as a statement of requirements. However, we're not really interested in requirements, we're interested in direction, so we'll leave them vague.

If pressed, I'd say sub-second response to a query involving a half dozen attributes (out of 1000 possible) applied to a trillion rows. My Hewlett Packard rep would read that and say, "Vertica already does that!" True, provided the data is sufficiently structured, the model is properly optimized, the criteria specifies a rectangular-solid region of the dimensional space, and I'm willing to write a check for a million dollars to cover hardware and software licenses. I'd like to relax at least a few of those constraints.

If pressed on that one, I'd say that any observation can have any collection of attributes or measures as long as they are present in the data dictionary and mean the same thing across all observations where they are present. I'd say that the model should optimize itself in response to usage. I'd say that the shape of the region can be pretty much any combination of unions and intersections, including non-contiguous value lists (I might even admit arbitrary convex hulls though I'd need to think about how that statement generalizes to un-ordered, categorical attributes). Finally, I'd say it should be hosted on open-source software running on a couple hundred grand worth of hardware (or some similar setup in the cloud that could be leased for a few thousand a month).

One of the reasons I'm being vague is that, when I present this in two years, all those criteria will sound silly because costs have come down so much. However, that won't make the problem less relevant. Data will expand by at least as much. Despite the huge decrease in unit costs over the last 50 years, companies spend more than ever on computation and storage. So, those are my feelings in "today's dollars", so to speak. Two years from now, I'll restate them, adjusted appropriately.

My thought going into all this was that the way to pull this off was to write a good sampler. The problem with "random" sampling a database is that "randomly" doing anything in a large database is really inefficient. You always want to be working on sets of rows, not individual entries. Ideally, your unit of work is the physical block, which typically contains a few thousand rows. Writing processing directly against the physical storage is feasible, but fraught with implementation details that don't translate from one host to another. So, the next best thing is usually the logical block. Either way, you're looking at groups of records that are typically very highly correlated. I figured that the trick would be to adjust for the correlation.

As it turns out, correlation is the least of my concerns. The bigger problem was that, whether sampled by block or row, the estimates of the sums didn't converge. Because the measures were from heavy-tailed distributions, you could be way off, even with 90% of the data sampled. So, I've spent the last 9 months developing CISS to deal with that problem. That's great, but it still leaves the larger question open. CISS is reasonably fast and it does give decent estimates but, in putting it together, it occurred to me that the same methods that create a good sampler also do a fair job of organizing the data for a full search.

So, that's where we're going. Rather than reduce correlation in blocks, I'm going to try to maximize it. In the process, I'll also tag blocks with descriptions of the attribute values that block represents. So far, this isn't much different than what a relational data base does with indices. The difference is that the "index" changes with each block. Some blocks might be identified by what office the data came from. Others might be identified by the client company. Others might have rows coming from a single product. The query engine then tries to determine which blocks contain the rows in question without actually reading the blocks. The goal is that, any time a block is actually read, it returns a lot of relevant rows.

Achieving that goal is the interesting part. As queries come in, the query engine records statistics on each block regarding how often it was excluded and how many rows it returned when it was included. Blocks that consistently return low row counts, get dumped and their data is re-allocated to the "good" blocks that remain. This process is repeated in the background indefinitely, causing the system to continuously tune itself to query patterns and new data coming in.

There's not quite as much math in all this as I'd like. I'm hoping that somewhere along this road there are some interesting theorems to prove. None are jumping out at me right now, but we'll see what we can come up with.

No comments:

Post a Comment