Context
Well, ya get data and ya format it so queries run fast. Not much to say on this one.
Containers
This could all be jammed into one monolith (and, I might actually do that for the class project just to reduce effort). However, it won't parallelize well unless it's properly broken into services that can run off a publish/subscribe bus. So, that's how I'm laying it out.
I also don't know if I'll actually be able to install this on non-prod at work but, it looks better to be putting something like this on industrial-strength hardware, so I'm going ahead and claiming it. A 10-node HDFS cluster is way cooler than a virtual machine running on my laptop.
Components
The data layer, ingestion, and query components constitute the bulk of the work, at least if one was to do them right. As I mentioned above, since this is just a class project for the moment, I may cut quite a few corners since they are pretty standard store and retrieve routines. I'll make sure the actual evolutionary algorithm is working before I put too much work into these.
One callout is that the Manager component in the Data Layer is actually a fairly big deal. In a space like Hadoop where you don't have commits or rollback, you have to build in quite a bit of recovery for failed tasks.
The meat of the project from the perspective of the class is, of course, the Optimize container.
The components on the left are task management so this can be properly parallelized and relegated to background status. The components on the right do the real work. The Block Generator is the most interesting piece. I haven't worked out exactly how I want to do it, but my general thought is that I'll have a prior distribution on which attributes are most significant and generate a posterior based on the performance against the logs. I'll then use the posterior distribution to generate a full set of blocks for the next generation of the repository and run the query history against that. If it performs better than the current, then that becomes the new repository. Either way, we regenerate the prior distribution and start over. The creation of new generations never stops, it should consume all the idle cycles for the cluster (or, at least, as much as gets allocated to this process - there are other processes on the cluster as well). A new generation is published only when improvement is shown, but it keeps cranking either way.
No comments:
Post a Comment