Wednesday, March 29, 2017

Clock is ticking

The dreaded confluence of heavy work and heavy academics has arrived.

Homework 3 for Set theory is due on 4/4.

Our first release of the scalable cube at work is on 4/12.

My term project for Evolutionary Programming has a somewhat vague date, but I think we're supposed to be done around 4/13. I'm pretty sure I can be one of the later presenters and get an extra week or two.

Oh, yeah, and I've got this little run in Boston coming up on 4/16.

I'm making decent progress on all those fronts (even the marathon), but I'm not done with any of them. I'm honestly not sure there are enough hours in the day and I really don't want to start working through the night.


Tuesday, March 28, 2017

Saved by the test suite

I don't really need any convincing that having a good set of unit test cases is helpful. I don't strictly practice Test Driven Design. Sometimes I write the code prior to the test cases, especially when the task is sufficiently rote that I don't really have to design it. However, I don't like having too much code without test coverage and this evening is why.

I've been having to do quite a bit of refactoring of my query engine. The problem is simple enough, given the size of the data set and the number of queries, each generation of query results will take way too long. So, I've been inserting some sampling techniques to get approximate results without running the full set.

That's great, but if I didn't have my test cases all automated, I'd be pretty concerned that I'm breaking things as I re-write them. With the test cases well defined and easily run after each change, it's a fairly stress-free endeavor. Make the change, run the tests, see what broke, fix it. As I'm pretty good at keeping my line count in any one method to single digits, I'm rarely more than five lines away from spotting the culprit.

Unfortunately, not everybody views things this way. We've had an outside consulting company write a bunch of code for us at work. It's well-written code; these folks are good. However, we have darn near zero test coverage. That means that if we change it (which we certainly will), we'll need to go through full-blown debugging cycles. I don't like the prospects of that, so I've put a halt on new development while we get some decent test coverage developed.

Very few developers enjoy writing tests. I get that. Frankly, neither do I. But, as the guy on the hook if a change comes in late, (or, worse, if a defective change makes it to production), I've learned to like it more than the alternative.

Monday, March 27, 2017

To each his own

So, I've knocked out a whole bunch of Set Theory homework over the last two days. That made me feel a lot better after my dismal conversation Thursday. Yes, math cheers me up. Yes, I realize that's not normal.

Sunday, March 26, 2017

Course correction

I don't have much time to write tonight, but I did come up with a potentially interesting course correction in response to the conversation with my Set Theory prof. Actually, it's motivated by more than just that one conversation (I'm not that easily swayed; actually, I'm not easily swayed at all). It also occurred to me during a conversation on Friday with some engineers from IBM why nobody seems that interested in my topic. It's good ideas, but my basic premise is wrong. Data is not expanding faster than hardware capabilities. Hardware is actually doing a pretty good job of keeping up. It only seems like data is expanding faster because companies don't like having to replace their hardware every year. However, they don't much like replacing their software, either. So, if it's going to be a switch, they'll probably just buy the bigger box and leave the software alone.

That's not to say that there's nothing to be gained via software. There certainly is. And, that's the course correction. Rather than force a 1-off solution (that I can't possibly implement as well as the full teams at HP, IBM, Oracle, and others), I'm going to go up a layer and look more at the motivation of these problems. My goal is still data reduction, but it's a little more along traditional lines of how to best boil down data rather than how to best avoid looking at it at all.

More refinement to come as I work this through.

Saturday, March 25, 2017

Boston

Took my off day on Saturday again.

Aside from a comically small mileage base, my prep for Boston is actually going pretty well. If I was really trying to run a good time I'd probably rescind that statement but, for just showing up and being able to run well enough to enjoy the event, I think that's looking pretty good.

Wednesday's long run didn't leave any troubling aches or pains. Today's split tempo session (which was over 2 hours long with a solid 40 minutes at Tempo pace) went fine. I went out again later in the day just to loosen things up and, again, no warning signs of problem areas. And, I'm pretty good at detecting such signs; it's why I am so seldom injured.

As I'm still not training with a watch, I still have no idea where my fitness really is. I might borrow a watch for this week's intervals and the 5K on Saturday. The plan is to run the first 20 at close to the pace predicted by those activities and then not worry about it if I have to slow down in the final miles.

Thursday, March 23, 2017

Not the answer I was looking for

..,but that doesn't make it wrong.

I talked with my Set Theory prof before class today about where I was going (or, rather, how I was having trouble finding my way) with my research. He asked some insightful questions and offered a realistic, but rather bleak, view of my prospects.

Much of it wasn't news to me. I already knew that teaching positions (at least, of the full-time, tenure-track variety) are getting extremely difficult to find, much less land. I already knew that coming into the academic game at this late age basically disqualified me from any position where the usual path would include further research (probably as a post-doc) along the way. I already knew the UMSL's faculty was somewhat limited. I already knew that keeping my day job made the task that much harder.

But, I have to say he made no attempt to put a smiley face on any of those things. I recall that when I was teaching undergrads, I always tried to give them the unvarnished truth rather than encourage them down a road that was more bumpy than they perceived. Some of them didn't like that much, but most of them appreciated the honesty.

One thing is clear: if even the faculty thinks I'm basically on my own in this research, I am very definitely on my own. I'm pretty sure I can finish the PhD. Whether it will be worth anything is entirely up to me.

On the upside, he did have some suggestions on how I could take things more upon myself. That, of course, is the answer I was looking for; I just didn't expect quite so much pessimism surrounding it. I'll write about the way forward another day after I've had time to synthesize it a bit more.

Wednesday, March 22, 2017

PR

I'm going to interrupt the usual flow of this blog by inserting a running post mid-week.

I set a PR (personal record) today. A totally bogus one, for sure, but a PR nonetheless. It's the farthest I've ever run home from work. The defending champion was running home from RGA West (19 miles) followed closely by multiple runs home from AG Edwards downtown (17 miles). Today smoked them all: 25 miles.

Or something like that. I don't measure my routes particularly accurately. I'm sure it's more than 24. From RGA Headquarters, I ran into the valley, crossed the Missouri on the Boone Bridge, picked up the Katy trail to the Page Bridge, crossed back into St. Louis and home through Creve Coeur Lake Park.

This route was basically the same as the 19-miler home from RGA West, except that I had to run through the valley and cross the bridge to get to the Katy whereas, from RGA West, it was an easy 2 miles downhill to the same spot. My memory of the earlier run was that the Katy was a bit monotonous. I didn't really get that this time. Yes, it's an old railroad grade along a river, but it's pretty enough.

I will confess that, upon arriving at Green's Bottom (roughly mile 15) and looking over my right shoulder to see the RGA Headquarters on the other side of the river, I was a little miffed. There was no shorter way to connect those two points, but it seemed that running for two hours should have put me more than a mile from my starting position.

Certainly not a route that I can afford to run regularly, but I enjoyed it quite a bit today. It's my last real long run before Boston. I qualify that statement because I will have a few runs in the high-teens, which many people would call long runs. I guess they are, but I don't really consider a run "long" unless it goes over 3 hours.

Tuesday, March 21, 2017

Back on

Today was the rescheduled midterm for Evolutionary Programming. I think the extra rest day yesterday helped. I didn't feel rushed or particularly pressured. This prof words questions in a, let's be kind, "unique" way. I seem to have been doing a fairly good job of parsing them on the homework. Hopefully the same was true today.

At any rate, I'm ready to get back on it. Which is good, because there's still a ton of work to do for this term project.

Monday, March 20, 2017

On the edge

I don't know if everybody responds to their threshold like this or if it's just me. Fortunately, I don't need to know. As long as I can read the signs that pertain to me, I don't really need to worry about the signs that pertain to anybody else.

Anyway, my response to approaching my limit is never one of despair. It is never one of agony (going way beyond the limit and breaking something may be agony, but that's never the warning). It is never one of fatigue or frustration.

It is apathy. If I stop caring, it means I'm pushing too hard.

In the Middletown Firecracker Criterium in 1984 it was well over 100 degrees (the race was, as always, on July 4). I rode in the pack for a few laps and then was suddenly struck by the fact that I really didn't care how well I did in the race. I slid off the back. 30 seconds later, I passed out. Bystanders report that it was rather odd to watch; I just sort of fell asleep on the bike and hit the pavement. Of course, since the pavement (fresh blacktop) was scorching hot, I immediately came to my senses. Those who came over to help tried to get me to calm down, thinking I was in shock. I finally managed to put together the words, "I'm not in shock, I'm getting burned by the road!" At which point they helped me off the street.

At my first go at grad school, I lost interest halfway through my second year. I was doing OK, but I just didn't care anymore. In retrospect, simply taking a light load in the spring of that year probably would have been enough to salvage the effort. Of course, I would have missed the whole adventure of trying to become a pro bike racer, and that would have been tragic, but at the time it was just a decision made from apathy.

At the Mark Twain 100 last fall, I was running well - winning the race, in fact. But I was then overcome with the reality that I just didn't really care that much. The heat and deceptively brutal terrain had got the better of me. I was done and quit while still very much in the hunt.

All this is to say that the fact that I'm feeling apathetic towards studies today is setting off some real alarm bells. Between work and school, I've been at about 70 hours a week for several months now. It's been hard, but I've been up to it. In the last week, I've stopped caring. That means it's time to get off the gas.

That's not easy. We're at a very critical junction at work. I'm behind on my term project. Still, it's a a reality that needs to be acknowledged: if I keep pushing, I'll quit in response. So, I'm taking today off, even though I took yesterday off as well. Tomorrow, I have an exam and don't plan on doing anything other than that and a normal 8-hour day at work. That's a pretty short respite, but sometimes that's all it takes.

Sunday, March 19, 2017

Math

Today, we're going to venture into existentialist territory: what does it mean to be a mathematician? "Somebody who does math," would be the obvious response, and that's not incorrect, just imprecise. As someone who's wanting a Philosopher's Doctorate in the field, I think it makes some sense to come up with something more concrete.

Obviously, Mathematics is a sufficiently large field that one person would be pretty hard pressed to understand all of it; even at what I'll call the "Masters" level (meaning, no further formal instruction is necessary; if you need more info, you know where to find it and can understand it when you look it up). Still, as with any field, there are some pillars upon which pretty much everything else rests and one really should grasp those before billing oneself as an expert.

It varies greatly by school but, from looking at the Qualifying Exams from various universities, it seems that the following areas are universally accepted as pillars:
  • Real Analysis
  • Complex Analysis
  • Linear Algebra
The next tier are widely accepted, but some schools seem less concerned:
  • Abstract Algebra
  • Measure Theory (or, its restricted cousin, Probability Theory)
  • Geometry
  • Topology
Beyond that, there doesn't seem to be much agreement. I suppose I should note that Calculus is so foundational that it doesn't even get mentioned. If you don't know Calculus, there isn't much hope of grasping the rest of this stuff.

So, how does my knowledge base stand up? Real Analysis, Linear Algebra, Measure Theory and Probability Theory are no problem. I consider those pretty solid areas for me. Topology is slightly less so, but I can hang in there.

Complex Analysis is a complete blind spot. While statisticians really have no use for that silly i number, I really should address that. Fortunately, I'm told by reliable sources that it's actually fairly straightforward material. I'll probably tackle it on my own at some point.

Abstract Algebra and Geometry are areas that I seem to be OK with, but I know I'm really not. I know enough that when they come up in the context of the topics I do know, I don't get thrown. Still, if I had to actually prove a grad-level result in either area, I'm sure I'd be stumped. Unlike Complex Analysis (which is correctly regarded as undergrad material), I'm not so sure I can address those weaknesses on my own. As such, I'm signing up for an Algebra course next fall. Not really sure if I'll get a chance to pursue Geometry at UMSL but, if such an opportunity presents itself, I'll take it.

What does this have to do with my thesis? Absolutely nothing. Call me super old school, but I think some things are just worth knowing, even if you never use them.

Saturday, March 18, 2017

Beaumont 2017

I took my off day today since I knew I wouldn't have much (any) time to study. After running 10 miles and having breakfast with Fleet Feet, I shot over to the John Smerek O-meet. This is the annual scout championship meet. It was initiated and directed for many years by John and named in his honor when he passed away (around 20 years ago, though I don't remember the exact year).

Anyway, since it's all about the scouts, most of which don't know a whole lot about orienteering, the SLOC meet is a bit of a side item. Still, meet director David Fisher always manages to put together a decent course for those of us who appreciate our annual opportunity to run on this map (Beaumont is owned by the Boy Scouts, so we only get this one chance a year to use the map.)

Given how little nav training I've been doing, I was pleasantly surprised with how well I ran. Beaumont is pretty easy terrain, since it's dominated by some pretty big ridges. Still, if you're running it full speed, you have to be reading accurately and, for the most part, I was.

My legs were pretty trashed after the first few ridges, but that's basically what I was going for. It seemed that hammering over some hills after already running would be good prep for Boston. You can see from my routes that I started looking for ways to take the edge off the steepness in the last bit of the course.

I didn't bother numbering the circles; the course starts at the triangle and finishes at the double circle.


Friday, March 17, 2017

All dressed up for nothing

Well, the prof got sick. So, no test in Evolutionary Programming yesterday. Just as well, actually. As I left my Set Theory class, it occurred to me that I might not be in the best shape for taking an exam. I don't expect the exam to be particularly tough, but concentration still counts for a lot in any timed task. So, I think that on whatever day it gets rescheduled, I'll skip Set Theory and go for a run or something so my mind is clear.

Wednesday, March 15, 2017

Evolutionary Programming Cheat Sheet (part 2)

Second part of the notes for tomorrow's test.

Quantum GA

The trick here is that we're using "quantum bits" (hereafter referred to as Qbits). A Qbit, like a quantum particle doesn't actually have a value. You can observe it at thereby force it to a value but, left alone, it merely has a distribution of where it might actually be.

If and when quantum computers come to be, this will be extremely powerful for certain types of programs. It allows a register to contain not just a value, but an entire distribution. Obviously, if you can represent the distribution with a single register, there's no need to have some big population sample that you are pulling from.

Crossover and mutation can be done in the usual way, through crossover is often skipped in favor of just mutating the solution based on its fitness relative to the the current best answer.

The catch, of course, is that we don't actually have a quantum computer just yet. So, we have to simulate it. The standard way to do this is to keep a value alpha which represents the square root of the probability that a bit is zero. You can use imaginary coefficients, but it doesn't really buy you anything in this space so, sticking to reals, alpha (and it's corresponding beta where alpha2+beta2=1) define points on the unit circle. You then force an observation from the register; making all the Qbits go to 0 or 1 according to their probabilities, assess the fitness of the sampled result, and then rotate the point along the unit circle towards the best observation seen so far. (Implementation detail: I would think just storing the angle of the point unit circle would be more efficient, since you're going to have to convert it to that to do the rotation anyway.)

While this works, it's a lot of work simulating the Qbits and I'm not convinced that assessing the fitness of one observation from a distribution is any more efficient than assessing a population pulled from the same distribution. Anyway, it's a concept that really can't be evaluated until there's a legit hardware implementation. Until then, it's just another stochastic algorithm which hides the pseudo-randomness in the simulator.

Differential Evolution

The big idea here is to modify solutions by applying mutations of their neighbors.

For each solution (target vector), three other solutions are selected to form a donor vector. This vector is of the form: vi,G+1 = xr1,G + F(xr2,G - xr3,G). F is a constant, not a function.

The donor vector is then recombined with the target vector to form a trial vector. The formula for this is contrived, to say the least. Suffice it to say that one gene is picked at random to surely come from the donor vector (thus insuring a mutation) and all the others are picked based on a random value exceeding a preset threshold. Yes, really.

The trial vector is included in the next generation if its fitness exceeds that of the target vector. Otherwise, the target vector survives.

And that's basically it. Aside from a really contrived generation step, it looks an awful lot like Evolutionary Programming (below).

Genetic Programming

Like Genetic Algorithms except that instead of having generations of solutions, you have generations of programs. That is, the actual algorithm changes from one generation to the next. The most common way to do this is to represent the program as a decision tree and then modify the tree from one generation to the next. So, the basic algorithm doesn't really change, but the actual flow of the algorithm is altered by the tree, which is ancillary to the actual solution.

Crossover is generally applied by crossing sub-trees. While this works on simple categorization problems, it is limited in that the context of the sub-tree might be very important. For example, in my problem, applying this blindly to the block assigner could result in assignment strategies where the sub-tree contained no viable nodes for a block passed to it because that subtree was predicated on the conditions further up the tree being met. I think there are ways around this, but the papers we were given don't really address it. If I'm going to use this, I'll need to figure it out.

Areas where GP does well:
  • Interrelationship between input variables is poorly understood.
  • Finding shape and size of the solution is a big part of the problem.
  • Search domain is large and solutions are sparse.
  • It's very hard to directly obtain solutions, but easy to simulate performance on tentative solutions.
Fair enough. Footnote: The AI-Ratio, that is, the amount of "intelligence" built into the algorithm via artificial means versus the amount that a human has to add to even get the thing to work is very high. That is, you don't need a whole lot of domain knowledge to generate a viable algorithm.

Evolution Strategies

Evolution Strategies is similar to GA, but relies more heavily (or exclusively) on the mutation step. Mutation in the real-valued space is done by adding a Normal random variable to each current gene. Elitism can be used to ensure that good solutions don't drift too far afield.

The interesting part of ES (at least for me) is how the variance of the perturbing variable is handled. There are several options:
  • Use the same variance throughout.
  • Use a variance that decreases over time.
  • Use a variance that is increased if few improvements are noted, decreased if lots of improvements are noted.
  • Build the variance into the mutation (this is the one that intrigues me). That is, randomly generate the variance first, then generate the new solution. If the new solution is accepted, you take the variance with it. Thus, when the solutions require large variances to get off local maxima, you get those. Once all the improvements are coming from the local space, the variance gets clamped down.
 It is also possible to correlate the mutations. That is, it may be that a change in one attribute only helps if another changes with it. In that case, one should generate a correlated mutation vector.

The most successful application of this is the CMA-ES (Correlation Matrix Algorithm - Evolutionary Strategy) algorithm.

The selection step may or may not include the parent generation, Î¼.
  • (μ,λ)-selection: use the children only and select the best (cut back to just Î¼).
  • (μ+λ)-selection: keep the parents in the mix and then select.

Evolutionary Programming

Similar to GA, but rather than trying to code the genetics that produce a given behavior, it attempts to code the behavior directly (phenotype versus genotype). As such, there is no use for a crossover operator - EP uses only mutation.

The basic algorithm applies a mutation to each member of the current generation and then selects the best from the current and offspring to create the next generation.

Another novelty (though it's really just a generalization of tournament selection) is that individuals are not selected based on their absolute fitness, but by how well they stack up against a randomly selected "peer group". I'm failing to see the upside in that, though I'll concede that, in some domains, it might increase diversity. One might note that this implements a de facto elitist strategy since the best known solution will always beat any group of peers.

Mutation is driven by random perturbations based on a distribution and step size which are considered inputs. The step size (variance) may be variable over time. As with ES, the step size may be fixed (non-adaptive), vary based on some function of time and convergence (adaptive), or be part of the mutation itself (self-adaptive).

Correlations between components are allowed; necessitating a correlated mutation vector.

Among the various mutation distributions cited is an odd one: Mean Mutation Operator. It's the sum of a standard Normal and standard Cauchy random variable. The claim is that this produces more large and small perturbations and less in the middle. I don't dispute that, but I'm not sure what the mathematical basis for using such an operator would be. I think it falls under the heading of "making shit up and finding out that it works reasonably well." Hey, that's how Tungsten was invented; light bulbs are good.


Tuesday, March 14, 2017

Evolutionary Programming cheat sheet (part 1)

Writing up some thoughts for my open-notes test in Evolutionary Programming on Thursday.

Genetic Algorithms

Also known as Binary GA, to indicate that the chromosomes are binary sequences. Real-valued GA is obviously just another variant where the binary sequence is interpreted as an approximation of a real number.

Closest to the natural model of Genetic evolution. Basic steps are:

Initialize population P(t)
Evaluate P(t)
While (not TerminationCodition(P(t)))
   Parents(t) = Select(P(t))
   Offspring(t) = Cross(P(t))
   P(t+1) = Evaluate(Offspring(t))
   t = t + 1
   
So, basically, you loop until you're done, with each generation being selected for child mortality and breeding fitness. Pretty much the way nature works. Not a whole lot more to say on this one.

On a purely theoretical note, Real-coded GA's are somewhat more capable of capturing a phenotype rather than a genotype. That is, they get at the behavior you're really after rather than the coding of the behavior. This distinction is not a big deal in practice, since converting from a binary code to the actual trait is usually straightforward.

Parent selection: several options, the most popular are Proportional Selection (each parent is selected proportional to their fitness score), Linear Rank (population is ranked and selected by rank within population), and Tournament (head-head evaluation of competing parents).

The pros and cons of each of these are fairly obvious and not terribly interesting.

The cross step includes both crossing and mutation. Crossover can be single-point (one cut and the genetic strand is swapped at only the cut point), multiple cut, or uniform (each gene can come from either parent). 

Midpoint crossover: given two parents X, Y, Offspring = (X+Y)/2. This is stupid, by the way.
BLX-α crossover: Δ = α(X-Y), Offspring = random(X+Δ, Y+Δ). Somewhat less stupid. BLX-0 is also known as midpoint crossover.

Generation Gap: The proportion of the population that is replaced each cycle.
Elitism: Fraction of population that is guaranteed to survive to preserve the best solutions found so far.

Shared fitness GA

Not much different than GA, except in the fitness step. Here, the fitness is diluted by the number of close neighbors. That is, if there are 5 neighbors within the neighborhood threshold, then the fitness will be divided by 5. In the early goings of a GA, this can help disperse clustering at local minima, since the fitness of such is diluted. Of course, by the end, one wants to be looking at the real fitness of the solution.

Variable Crossover GA (VGA)

I'll confess I totally misunderstood where this was going when I first read it. It's a fairly powerful technique when your solution space is not easily defined by a fixed chromosome (e,g., machine learning problems where the chromosome may have to encode an arbitrarily large decision tree).

Given two parents X, Y of arbitrary length. Pick a random cut from X (any portion of the chromosome). Find the corresponding intervals of Y that make semantic sense. That is, if the chunk of X can fit into several different intervals of Y and still be interpreted properly, identify all such chunks. Make the swap.

The point is you might be swapping strands of much different lengths. Suppose (as in the case of my term project) that a chromosome identifies an attribute and a range of values for that attribute. There are many such attributes on a chromosome. Now, split the chromosome for X. This may include 1 or more attributes and may contain all or none of the values for that attribute. Now, figure out how this fragment fits into Y and still makes sense. Obviously, you need the attribute to be consistent with the values. You also need to be swapping out coherent criteria, that is, you can replace one attribute criteria with several, but you can't apply the values for one attribute in Y to a completely different attribute in X. So, you're resulting chromosomes could be of any length based on where you made the cut on X and how many ways there were to fit that cut into Y.





Monday, March 13, 2017

Big Iron

That's what we used to call mainframes. Today, we got the production hardware for our scalable reporting platform. 10 Nodes, each of which carries 32 CPU's, a terrabyte of RAM and 3 terrabytes of disk. In short, more artificial computing power than existed on the entire planet when I was born.

For a number of reasons, some of which actually make sense, developers aren't allowed in the data center. But, I'm told this setup easily fits in the back of a minivan, even when racked.

Of course, there's still a long way to go before we start bumping off biology. It was only a few years ago that the total artificial computing capacity of the planet surpassed a single human brain. However, it will only be a few years more before a "single" machine can do that (I use quotes because, with all architectures moving towards massively parallel, it's getting increasingly difficult to define a "single" machine; the box boundary is rather arbitrary.) Given that software lags hardware by around 10 years, my estimate of 2030 when machines and humans reach relatively equal footing is looking about right (I've maintained that date since the late 90's).

Yaya will be 27. She's going to have some interesting choices in life.

Sunday, March 12, 2017

Vocation

Sorry for any readers who might have been wishing for an off-day throwback race report. I suppose I should just state plainly that those will be coming less frequently. As far as competition goes: I've moved on.

Moved on to what, though? Today, prior to our main service at 10:45, we had a chance to listen to a woman who is currently in seminary talk about her calling. As is becoming increasingly typical in Protestant churches, this was not a 22-year-old who was pursuing the normal career path. This woman is 45. She's spent her whole life in ministry of some sort, but the call to the priesthood is one she has only recently embraced (though she claims she's been hearing it for quite some time).

Anyway, it was wonderful to hear her story, but also quite interesting to go through a rather simple exercise she had for all of us attending. First, we listed the various vocations that we, as laity, had embraced at one time or another in our lives. Easy enough. Next, she asked us to reflect on them and pick the one that we felt was our most important calling. For me, this was quite easy. I wouldn't be trashing my life to get a PhD if teaching wasn't what was pulling me.

Then she asked us to think about how we knew this was our calling. Well, that's still a pretty easy one for me. I've always wanted to teach and, when I have done it, my experiences have done nothing but reinforce that. So far, so good, but then she asked the question so few of us bother to ponder: why?

Ummm.

This is a hard one for me. Not because I can't come up with the answer. Rather because I know the answer and don't particularly like it.

There are, of course, many altruistic motivations to teach. It's very easy to pretend that you're taking the high road: leaving a more lucrative career as a practitioner, wanting to share your knowledge rather than just take it with you to retirement on the beach, wanting to give back to a field that has given you so much, blah, blah, blah. While none of those are false, none of them are the motivation, either.

I want to teach because I like doing it. I became a bike racer because I liked doing that. I became a consultant because I liked doing that (and, after a few rather lean years, liked the paycheck quite a lot, too). I run because I like to run. I go to church because I like to go to church. I cook because I like to cook.

I'm actually pretty hard pressed to find anything I do that I don't like doing. Sure, there are certain aspects of my activities that suck (I hate filling out my timesheet, and I don't mean I dislike it - I hate it). But, taken as a whole, I like doing the things I do.

Now, there's obviously nothing wrong with liking what you do. But, if the reason you do all these things is simply because you like them, at some point you have to ask if you're just being selfish. Well, I don't really need to ask; I know I'm basically selfish. I always have been and that's not likely to change at this point in life.

I'm not really sure what to do with all that but, this being Lent and all, now is a good time to at least think about it.

Saturday, March 11, 2017

We got data

I'm not exactly sure why I have 2.3 billion fact rows in my data set. All indications were that I would only have 780 million. However, all the rows have good referential integrity and there are no duplicates, so I'm going with it. I think it has to do with the way we exclude partitions from the cube. The Plan dataset is only 5-years, but a lot of the submissions were 20-year projections. The partitions in the cube total 780 million, but I picked up all the other stuff as well. At least, I think that's what happened. I'm not going to spend a day trying to verify that because I don't really care; I needed data, I got data.

Anyway, the load worked and it's still "only" 100Gb, so it's not killing my laptop. I now have 4680 blocks to work with. I expect the block count to rise significantly when I start partitioning them, since partitions never fill perfectly. I'll fill in the dimensions today and get going on the query engine. Once that's done, I can run a baseline and start reblocking stuff (which is the whole point of this exercise.

I am a little worried about performance. With this many rows, running the queries is going to take a while, at least until the blocks start getting closer to optimal. I'll basically be table-scanning the entire fact set until I can start excluding partitions. That might mean that I only get one overnight run on the early partitions. Unless that improves near the end, the most I can hope for is around 30 generations between now and when the project is due. That's not really very many. I'd rather have a few hundred.

On the other hand, it does make for a more realistic test. If we were to implement this thing in Production, the time to do the blocking would be when the new version of the business dimensions arrives. We use slowly-changing dimensions, so any attribute blocking would have to be re-done at that time anyway. Since we get dimension updates twice a day, this thing will need to tune itself pretty quickly. I'm fine with that, I just don't know that my first stab at a solution is going to converge that quickly.

Friday, March 10, 2017

Crunching

I spent most of the week refactoring my caching because I was concerned that I'd run out of memory on my laptop (no such worries once I move this to the Hadoop cluster; it can easily fit the entire data set in memory). Anyway, the big pull is underway. I'll be taking in 760 million rows, which is almost twice as many as I used for CISS. While the bigger rowset will slow down processing a bit, it will also mean that I can use a larger block size, which should give better options for partitioning. I'm using a 500,000 row block limit, so I'll have over 1500 blocks to work with. That seems like a good sample size. We typically partition our cubes into around 1000 partitions, so it's a pretty realistic number.

As you can see from the monitor shot below, Oracle is the problem as far as pulling the data goes. My PC isn't even breaking a sweat; it's waiting on Oracle to retrieve rows. There is a faster way to do this which we discovered when we hooked up the HDFS ingestion. There's a native driver that pulls entire partitions. As long as you're OK with getting every row and column in a partition, it's really fast. I could have done that here, but I'm OK with waiting a few hours for the load to compete. I have to sleep sometime.


Thursday, March 9, 2017

Constructing the Reals

You wouldn't think that you'd have to jump through too many hoops to construct the Real numbers. However, as I've mentioned before, there's nothing "real" about them. So, one does need to take some care to make sure one knows exactly what we're talking about.

There are two basic approaches to this. The axiomatic approach is to start defining properties that any set that claims to be the real numbers must satisfy. Then one needs to prove that there's exactly 1 set that satisfies those axioms. Two few axioms, you get competing sets. Too many, and you can't satisfy them with anything. Of course, when I say that exactly 1 set satisfies them, I mean that any set that satisfies them has a homeomorphic mapping to the reals. A homeomorphic mapping is one where we just change the names, but all the relationships are left in tact. For example, if we decided to call the additive identity element "A" instead of "0", we'd still have the same set, the elements would just be labeled differently. (Odd side note: the only two elements that are actually "named" in the axioms are the identity elements 0 and 1. These are also the two digits that are easily confused with letters. It's almost like they wanted to drive home the whole homeomorphic thing.)

The axiomatic approach is what I got in my Analysis class in undergrad.

In Set Theory, we're getting a different construction which I'll call the Algebraic approach. Here, we start with the Rationals which, I guess, are considered intuitively obvious and I won't debate that though, being only one axiom away from the reals, once certainly could. We then show that Cauchy sequences on the field of rationals define a ring and one can find produce an ideal that yields the Reals as a coset.

That's all great except for one small problem. Abstract Algebra is a bit of a blind spot for me. I know the basic concepts, but I've never had a formal course in it and certainly do't have the tools to actually prove stuff beyond an undergrad level.

If I hadn't taken this huge bite of a term project in Evolutionary Programming, I'd use this as the excuse to slam through a text in Abstract Algebra. As I really don't have time for that, I'm just going to have to hang on as best I can. Meanwhile, the deficiency is noted. I think that if I really want to call myself a Mathematician, I need to take (or otherwise master) a grad level course in Abstract Algebra and Analytic Geometry. Pragmatists would call me a fool as neither of those topics has even a remote bearing on my thesis, but I really do believe there are things an "expert" is just supposed to know.

Tuesday, March 7, 2017

NP Completed!

So, I finally have a bona-fide theoretical result. I finished a proof that the Blocker problem is NP-hard (and the corresponding decision problem is NP-Complete). I'm not going to publish it here just yet because, like I said, this one is for real I don't want to get scooped. It does significantly improve the chances of my Blocker paper getting published.

While I'm withholding the specifics for the moment, I will say that sometimes it does help to think about the dual space of the problem. Once I turned it into a maximization rather than minimization problem, it practically solved itself. The proof is quite concise.

Sunday, March 5, 2017

Fun Run

I'm well enough again that I've returned to "normal" training. "Normal" meaning: the best I can do given that I'm in grad school.

Turns out that's not very good. There are many runners who seem to get by just fine on 40-50 miles a week, but I'm not one of them. I've always been a high-mileage trainer. Running still feels fine, but I'm really going slow these days. I had thought that maybe I'd at least try to run Boston firm so I was finishing reasonably near the front (a 3:10 would probably put me in the first 3-4,000 finishers). Forget that; this one is going to be a pure fun run.

Saturday, March 4, 2017

Data done

Finally finished the data layer for Blocker. It shouldn't have taken so long and I'm not sure what I was doing wrong. Anyway, it's done now; all unit tests pass. Granted, it's not thread-safe, so I can't run in parallel on Hadoop. Also, the block caching is, shall we say, naive. It works, but it's certainly not a good caching strategy. Fortunately, neither of those two things have any impact on whether this is a good term project, so I'll defer them until later.

Assuming I can get the data actually loaded tomorrow (reasonable, the ingestion routine very similar to what I used for CISS and it's the same data set) and that I can get the query engine working this week (also reasonable, though it's probably 15-20 hours of work), I'll have four weeks to actually develop the blocking algorithm and then another two weeks to produce results and write it all up.

Doable, but tight for sure.

Friday, March 3, 2017

Slow coding

I'm not sure why, because there's nothing particularly hard about it, but I'm really taking way too long to code the data layer for Blocker. Spent another five hours on it tonight. It's mostly done, but hooking it to the oracle extract and pulling all the data is looking like it will consume the weekend.

Thursday, March 2, 2017

More ideas on the NP-Complete proof

It occurs to me that it might be easier to reduce either Set Cover or Set Packing to my blocking problem. The two problems are duals of each other, meaning that you can verify if you've hit a minimum in one, by comparing it to the maximum of the other. Since one rarely finds an optimal solution to these sorts of things, the next best thing is to look how far apart the dual solutions are and note that the optimal result has to be somewhere between them. So, if they are close, you can probably say "good enough" and get on with life.

I think Set Cover might reduce more naturally, but even it is going to take some manipulating. Informally, the Set Cover problem is trying to find the smallest collection from a family of subsets where the union gives you the whole set. Set Packing does the opposite, it tries to find the biggest group of disjoint subsets. The fact that these are dual problems is left as an exercise to the reader.

Anyway, the idea is that one can view the blocks as subsets covering queries. The idea is to find a group of blocks that satisfies all the queries. It's a little stickier than that, because there's the redundancy that comes from multiple queries (partitioning the data to optimize a single query is trivial; just put all the relevant rows for that query in one block and put everything else in another).

Another one that might work is Subset Sum (find a subset of numbers that adds up to a given value). While not in the "original 21" problems, it seems like it might work. (There are hundreds of NP-Complete problems and you can legitimately reduce from any of them, but that the canon of 21 are universally accepted, so it adds some street cred). I worked on this a bit this evening, but hit a few snags that seemed rather formidable. Still, I'm not ruling it out.

Wednesday, March 1, 2017

Ordinal numbers

Holy crap. Just when you thought Cardinal numbers were weird. Ordinal numbers are weirder!

I'm really glad this set theory stuff wasn't on the Q. It's just batshit crazy.