Monday, January 30, 2017

Evolutionary algorithm for dynamic block partitioning

I wrote up my proposal for a term project in Evolutionary Algorithms.

Problem statement

Queries against very large fact tables are slow. There are three common approaches to this problem: indexing, partitioning, and aggregation. Indexing is effective when the number attributes attached to a row is small. When the number of attributes runs into the thousands, indexing each attribute becomes infeasible. Partitioning is effective when query criteria allows the search engine to exclude partitions. However, if the query crosses partitions, little is gained. Likewise, aggregations are only beneficial if the query is being grouped by attributes used in the aggregation.

Furthermore, the above three approaches are all static. The addition of new attributes or a change in query patterns can significantly curtail the effectiveness of the initial solution.

The goal of this research is to explore dynamic partitioning based on changing query patterns. There are several possible implementations. The first avenue will be to construct a genetic algorithm which creates an ecosystem of partitions which compete for correlated records. Over time, low performing partitions are pruned and replaced with partitions generated from higher performing partitions.

The overall objective function is to minimize the number of blocks that need to be read to satisfy a query. Since query patterns change, minimizing this directly is problematic. Thus, a surrogate fitness function for each block will be used: maximizing the query correlation of rows within the block. That is, a fit block will either return many rows for a query or can be excluded entirely. An unfit block will be searched often, but will not return many rows.

An outline of this approach follows.

Self-tuning partitioning ecosystem for large fact tables

Data Layout

Data is organized using a traditional star topology. Dimensions contain collections of attributes. An attribute can only appear in one dimension. The set of (attribute, attribute value) pairs in a dimension row must be unique in a dimension.

Fact rows contain one or more measures and foreign keys to dimensions. Facts are partitioned into blocks. The criteria for inclusion in a block is the output of this process at each generation.

Generation zero

The initial blocking of data is arbitrary. Data blocks can be built sequentially as data comes in or, if some information about query patterns is known, some attributes can be selected as candidates for splitting data into blocks.


As fact rows are submitted, the loading program assigns each row to a block matching the criteria for that block. Blocks are examined from smallest to largest and the first block with matching criteria gets the row. If the first matching block is full, the block is split into two based on either a random criteria or the next available candidate attribute.

Query processing

As queries come in, statistics are kept on how often a block is excluded from queries without having to examine its contents (that is, it can be determined from the block inclusion criteria that no rows would match the query) and how many rows are returned each time the block is read. The query criteria are also saved so they can be used for subsequent fitness testing. (In testing, this query load will be simulated).

Block propagation and fitness testing

During times when the system has available resources, a background task creates new generation candidate blocks. This is done by crossing and mutating the criteria for more successful blocks and then populating the new blocks by selecting rows from less successful blocks. The query history is then applied to the candidate blocks to assess their fitness.

Generation n+1

When enough generation candidates have been tested (a tunable parameter expected to be roughly 10 times the desired block count), a new set of block criteria are selected for inclusion in the next generation based on a random selection weighted by fitness. A percentage of the existing blocks (another tunable parameter, probably around half in early generations, then decreasing) is randomly selected for end of life weighted by lack of fitness.

The remaining blocks are carried to the next generation without modification while the rows from the terminated blocks are added to the entire generation using the same scheme as in generation zero (smallest block matching criteria; split when full).

The generation counter is then incremented and all new queries are applied to the new set of blocks. When all queries running against the old set are complete, the terminated blocks are deleted and the storage freed. Propagation and fitness testing of the next generation of blocks resumes immediately as a background task.

Criteria Management

More recent criteria are given preference for fitness testing, though older queries should not be disregarded altogether. Additionally, there may be specific query criteria that are cyclical in nature (for example, queries typically run at the end of a quarter) and are performance critical. Such queries can be added to a pinned list that keeps them active in fitness selection.

Sunday, January 29, 2017

SLOC Cold Nose

In 1997, I won an orienteering meet. It was just a little local thing, but it was an overall win. As I hadn't done much competition of any kind since retiring from cycling in 1993, it was the first win in several years. Since then, I've managed to win the overall at some race every year. Yesterday, the streak was pushed into its third decade with a win, fittingly enough, at a local orienteering meet. I posted the best time on the advanced course at the SLOC Cold Nose.

The course was the typical brutality one would expect at Babler State Park. It's a very steep park with lots of trails and thick woods. Therefore, you are generally left with either running the straight line and taking the obstacles head on or blasting full-speed on longer trail routes. After making mistakes on three of the first four legs, I switched to option 2 and just ran as hard as I could.

Definitely not my proudest navigation but a great workout. And, yes, despite my attempts to take these things less seriously, it's still very nice to bag a win here and there.


Saturday, January 28, 2017

Fundamental Theorem of Calculus

You'd think math guys would have a better grip on the whole singular/plural thing. The Fundamental "Theorem" of Calculus is really two theorems that are typically stated together because one isn't much use without the other. Anyway, it's the type of thing that might come up on the Q. Since I'm not even very good at remembering other people's results, much less proofs of such results, I decided the best thing to do would be to write my own proof. It's a little hand-wavy, in that I tend to just claim that things converge rather than busting out the obligatory epsilons and deltas, but it illustrates the point and I'm pretty sure such rigor won't be required on a timed exam.

I also needed to minimize the reliance on other results, so I got it down to where I don't use much other than the completeness of the reals (to get convergence) and the mean value theorem (to get equality on the derivatives). FWIW, here it is. There are much better proofs out there, but, because I derived this one, I'll be able to remember it.

Part i) If f is continuous and differentiable on the interval [a,b] and f' is Riemann integrable on [a,b], then the integral of f' from a to b is f(b) - f(a).

Proof: Let P be a partition of [a,b] and T={ti} be a set of points such that ti sits in the component i of P. The Riemann-Stieltjes sum:



Is bounded by the Upper and Lower Riemann sums on P. Since we get to choose the points in T, we can use the mean value theorem to select each ti such that



Thus,



since all the middle points cancel out.

f' integrable implies that S converges to the integral as the norm of P goes to zero. Thus:



Part ii) If f is Riemann integrable on [a,b] and F(x) is the integral from a to x and F(a)=0, then F is continuous on [a,b] and F'=f where ever f is continuous.

f integrable on a closed interval implies f is bounded on that interval. Let M ≥ |f(x)|. Then,



Thus, F is continuous at x from the right. A symmetrical argument shows it is continuous from the left.

If f is continuous at x, then by the mean value theorem for integrals, there exists δ>0 and x0 in (x,x+δ) such that



Taking the limit as δ goes to zero gives



Again, a symmetrical argument shows the same result from the left. Therefore, F'(x)=f(x).

Q.E.D.

That's latin for done, y'all.

Thursday, January 26, 2017

NHST

Null Hypothesis Significance Testing. It's the backbone of statistical work as generally practiced. While the philosophical underpinnings are a topic upon which reasonable people may disagree, there's no denying that an awful lot of practical science has benefited from it. So, even though I think the technique is somewhat bogus, I've learned it and learned it well.

And, that's a good thing. Because it's pretty much the entirety of the last Stats chapter which I really don't have time to read before the Q.

Monday, January 23, 2017

Set Theory

While my real aptitude is in applications, I've always enjoyed theory. I'm enjoying my set theory class enough that I took a day off from Q studying to finish my homework. I figure it's close enough to Analysis that it has to have some carryover to Q prep.

In particular, what I like about "higher" math is that it forces you to go to places that you can't possibly understand. And yet, we're able to learn things about these places and even predict how things work by relinquishing our senses and building structures that are completely abstract. It's a lot like religion, really.

Consider the complete lattice. It's a set where any subset has a least upper and greatest lower bound, but the ordering in between is murky at best. Some complete lattices, like the integers {1, 2, 3} are easy to understand. Others, like the power set of the real numbers on the interval [0,1] where the partial order is inclusion, are a but more difficult to wrap your head around. But, you can just not worry about that and prove stuff anyway.

Sunday, January 22, 2017

The right way to handle a loss

No, not me losing. I don't handle that particularly well and doubt I ever will.

The Packers lost today. I couldn't care less. Yaya does, though. I'm not sure what got her turned onto rooting for the Pack, but she certainly does. Anyway, she was disappointed that they won't go to the Super Bowl. She got bummed enough about it that she turned off the TV well before the game was over (it didn't seem likely that they would come back from 31-0).

Then, she went downstairs and played her trumpet for a couple hours. She came back up in pretty good spirits.

She had already practiced before the game, so this was just her way of dealing with it. I'm no psychology expert, but I'm pretty sure that dealing with disappointment by turning your attention to something you're good at is a pretty sound coping strategy.

Saturday, January 21, 2017

Haven't given up quite yet.

Seven hours studying in today. Fairly productive, too. Getting through everything remaining in three weeks still seems an extreme longshot, but there's certainly no reason to think I won't get a lot closer.

Friday, January 20, 2017

Might have to just accept what's done

I record my study time for the same reason I record my training time: it's useful to know how much I can do. I think I'm maxed out. 30 hours last week plus 40 at work. I don't feel overwhelmed, but I certainly don't feel like I could do much more. I may just have to accept that what I know now is what I'm going to bring to the Q. I honestly don't know if that's enough.

Thursday, January 19, 2017

Evaluating estimators

I've been preaching about how you can't just trust the results; you have to check to make sure they make sense. How does one do that?

There are two questions in play, here. The first is, is the estimator generally good for this problem? That often turns out to the be easier thing to evaluate. The second is, is the estimator generated by this data set any good? This one is often a lot harder, especially if all you have to go on is your one data set.

The first is typically based on minimizing the expected value of some loss function. That is, on average, how far off will this estimator be times how much do I care? The squared error function is the most common: the cost proportional to the square of the difference between the estimate and the "real" value (we'll go with the idea that a real value does exist for the moment). That cost function tends to favor estimators with a low variance, since squared distance is what variance measures. Other cost functions (like absolute distance or log distance) yield different favorites. At any rate, comparing the general performance of two estimators is merely a matter of selecting your cost function and computing the expectation. Even if the expectation is not closed-form, you should be able to approximate it with numerical methods.

Knowing whether your particular estimate is any good is a whole 'nuther thing. There are all sorts of weird data conditions that can skew an estimate. Good experiment design forsees many of these and defends against them but, at the end of the day, we are talking about random variables. Sometimes they just plain come out whacky.

Here's where the frequentist methods have some problems. Because the methods are based solely on the data set, it's very hard to deduce that the problem is the data set, itself. There are all sorts of model tests that can (and should) be performed to see if the data matches the model. But, there is no good way of telling if the data matches the "reality" that it was drawn from.

Bayesians aren't really on any firmer footing, here. If our beliefs are wrong and a nutty data set confirms those beliefs, we may well be even further from the truth than our frequentist brethren. About the best that can be said is that at least those beliefs are stated up front in the prior. This does make it a little easier for an independent observer to challenge the assertions.

Tuesday, January 17, 2017

Maximum Likelihood Estimators

There's probably no more tortured piece of frequentist procedures than Maximum Likelihood Estimators (MLE's). Note that I didn't say they are wrong, they are widely used because they are generally effective. The problem is the philosophical hoops you have to jump through to accept them.

Of course, most practitioners don't concern themselves with that; they just crank their answers. However, those advancing the field really should at least consider that the axioms upon which these advances are based make some sense.

As with the Method of Moments, the idea is simple enough. Data is more likely to be observed in some configurations than others. Given a set of data, it makes sense to look at what configuration is most likely to produce such data. So, write your likelihood as a function of the parameter you care about and find the value of the parameter maximizes that function.

The rub is that unlikely things do happen and just because something isn't the configuration that makes the data the "most likely" doesn't mean it should be dismissed. Furthermore, as with the Method of Moments, the MLE might be a value that makes no sense.

For example, suppose we want to know how likely a certain even is to occur. We observe a dozen trials and it doesn't happen. The MLE for the probability is zero. As in, the event can't possibly happen. Ever. That may be the Maximum Likelihood value but, assuming this isn't some fantasy event like unicorn sightings, it's clearly wrong. You can argue that the experiment was bad because the sample was too small, but that evades the question. The procedure, properly applied, yielded nonsense.

OK, that can happen with any procedure, but what's important here is the reason the procedure fails. It fails because there's no place to interject common sense. The frequentist world view is that there is underlying truth out there and our experiment is trying to uncover that. That's a perfectly fine world view, but it leaves you completely at the mercy of your data. Some think that's a good thing, I do not.

All that said, MLE's are relatively easy to construct and verify. I've used them myself in past and may have cause to use them again. I just don't put much faith in them. Then again, I don't put much faith in results from any single data set. Unconfirmed research isn't much better than a guess in my mind.


Monday, January 16, 2017

Method of Moments and Satterthwaite's Estimator

This result should be near and dear to my heart since it's basically what saved my bacon on the CISS algorithm. Of course, whether that algorithm ever sees the light of publication is still an open question, but I digress.

The concept is very simple (which is why it's one of the oldest estimation techniques). You consider the moments of the random variable you're sampling, expressing them in terms of the parameter you care about. Then, you compute the actual moments of your sample distribution. Line them up and you have a system of simultaneous equations that you can use to solve for your parameters.

There are some downsides. The biggest is that such estimates may well be biased. For example, the method of moments applied to a Normal sample suggests the variance be estimated using the formula for population variance rather than sample variance. Not a huge deal if the sample size is reasonable, but a biased estimate nonetheless. Also, you may get a result that simply makes no sense.

An example of this (and how to deal with it) is Satterthwaite's Estimator. Rather than going deep into details that you can easily look up if you care, I'll focus on the flexibility of the method (and the corresponding responsibility on the practitioner).

Satterthwaite was trying to get the denominator of a t statistic. Basically, he wanted a linear combination of his random variables to be modeled as Chi-squared, but the degrees of freedom were unknown. Applying the method directly gives an estimator that works, but might go negative. Since there's no such thing as negative degrees of freedom, that didn't sit well.

Satterthwaite obviously wasn't the sort of guy who just shrugs it off and hopes it will work out. By working additional known constraints into the equations, he came up with another estimator that can't go negative. In fact, it's still pretty much the best one known and is still used today (that is, when it makes sense to solve for a tractable estimator and not just ram the data through a MCMC simulator).

The point is that, since most distributions have lots of moments and there are also other external constraints that can be considered, the Method of Moments is less of a method than a general framework for cooking up an estimator. There are lots of possible "right" answers. When you have very little information about the sampled distribution (as in CISS), it's a really powerful technique. But, it needs to be used carefully because the results might be just a bit nutty.

Sunday, January 15, 2017

Little Woods Ultra (Last Man Standing)

10. Well, that's not so bad. I put the car in reverse and back out of the driveway. As I shift to 1st, I glance at the dash again. 6. Hmmm. By the time I get to the highway, my car is reporting 4. Apparently, the engineers at Hyundai enjoy throwing out some false hope. Well, it will probably warm up when the sun rises. It does, but as I cross the river to Edwardsville which, apparently, benefits less from St. Louis' prodigious carbon emissions, it begins to fall again. As I park, the reading is 2. That's 2F, as in Fahrenheit, So, assuming it rises to 3 in the 20 minutes between now and the start, this will tie for the third coldest start of my career. And, in case you're a new reader of these reports, that's 3rd out of a field of several thousand.

"Are you in it to win it?" asks Cheri Becker. As defending women's champ, she deserves an honest answer. Am I? I keep telling people that I not running these things to win, that I just enjoy being out there. For the most part, that's true, but this one is different. You simply could not come up with a format better suited to my strengths. Start a 4.1-mile trail lap at the top of each hour. If you present yourself at the next start, you go again. If not, you're out. When nobody else presents, you are the Last Standing and you've won. You don't have to run fast. You just have to refuse to quit. That's me in a nutshell. Still, I go with the truth: "That depends on how cold it gets tonight; I'm not looking to make this a tough guy competition." She accepts that. After all, this is supposed to be a "fun run" even though it will be longer than a marathon for most participants.

Last Man Pinning His Number On
Given the temperature and the fact that nobody has invested an entry fee in the event, it's a pretty good crew that shows up for the first start at 8AM (about 70 runners). Despite getting there in plenty of time, I've managed to not be ready for the gun. As I fumble with getting my number attached, the rest of the field heads out onto the trail. It occurs to me that I am Last and I'm Standing: I must be the winner!

Well, no. I soldier on. While the event is officially located at the cross country course at Southern Illinois University, Edwardsville, the actual course does not use the open grass trails of the cross country course. Instead, we are running on the single-track mountain bike trails that wind through the "Little Woods", the smaller of two wooded parcels on the SIUE campus. I'm fine with that, but it does mean that moving up from my dead last position will not be easy. Fortunately, the generous 1-hour limit means I don't really have to. Rather than expend energy passing, I simply run until I hit the back of the group ahead and then walk a bit until they get ahead and then run some more. I get back to the finish with five minutes to spare.

That's enough time to check out the pot-luck aid station. The closest thing this race has to an entry fee is a request that you bring something for the aid station table. As I'm pretty busy with studies, I took the easy route and just bought some Little Debbie Oatmeal Cream Pies at the store. Others have invested more and brought home made cookies, muffins, and even some pancakes being kept warm in a crock pot.

Walking in with Bill & Carrie Sona on lap 2
While the first lap worked, I'd rather run for longer stretches and take walk breaks where it feels right rather than whenever I happen to encounter traffic. On lap 2, I run all of the first mile then walk the short climb starting mile two. I then run again until the next short climb, which comes a mile later. Running mile 3 puts me at the start of 4 with well over 20 minutes to go. Fortunately, Bill Langton is with me so I have company as I walk all the way back in. We make a point of noting our time; we can walk the whole mile back as long as we have at least 16 minutes left on the lap. That information my come in handy later.

That pattern repeats for the next few laps. The temperature is still pretty frosty, but it's comfortable running, thanks to a clear sky and very little wind. The volunteers at the start/finish have a decent shelter and a large fire, so everybody is in good spirits Bill and I stay together most of the time. The company is nice to have since the pace is not exacting a toll. At least, not yet.

We're finishing laps in around 52-53 minutes, even with walking the whole last mile. On lap 6, the field has thinned enough that I try walking some on the first mile to to put a few more minutes into the trail and less waiting for the start of the next lap. Unfortunately, that 55-57 minute range seems to be everybody else's target as well. Bill and I end up doing the yo-yo thing stuck between two groups for all of miles 2 and 3. We return to our former strategy on lap 7,

We did do some running
By the end of lap 8 (just over 50K, an official ultra now), it's obvious that this isn't going to be easy. First, the trail is slick enough that I'm putting more effort into staying upright than I'd like. It's not that the footing is bad, it's just not good. Also, the mandatory stops between laps are not the respite one might expect. It would be far easier to just keep going than to stand around for a few minutes tightening up waiting for the next start.

Mitigating that is the arrival of some pizza at the aid station. I have no idea how they kept it warm.

At the end of lap 9, Bill decides he's had enough. There are still five runners willing to take the start and he doesn't think it's worth hanging in just to get another place or two.

Hot potatoes are now available. It's like they're trying to bribe us to quit and eat the food instead.

Cheri Becker is the only woman to take the start, so finishing the lap will seal a repeat win for her. The other four are Travis Redden (who created this race five years ago, but has since passed on direction to Metro Tri Club), Steve Johnson, James Baca, and myself.

Cheri and James call it at the end of 10 (41 miles), while the final three dig our headlamps out of our gear bins. By the end of mile one, I've got mine turned on. Travis and Steve firm it up on the rest of the loop and finish a few minutes ahead of me.

Another pizza has arrived and they're passing out beers as well. Steve decides that sounds like more fun than another lap on a cold, dark trail.

The fact that it's come down to Travis and I generates some real excitement among the few folks who have stayed into the night. Unless there's a rules change (at no-entry-fee events, the Race Director has a pretty free hand) that forces an early completion, it's obvious that this could go on for quite some time.

On lap 12, Travis wastes no time in opening up a gap on me. There's no competitive reason to do this, so I'm left with four possibilities: 1) he's feeling really good, 2) he's counting on a rules change where fastest person in will get the win, 3) he's being an idiot, 4) he's trying to psych me out. We've run enough ultras together that I'm pretty sure he knows that (4) won't work. I figure that leaves me with a 1 in 3 shot (probably less, since Travis' recent successes would indicate that he's learned to keep his pace in check).

Eventual Winners Cheri and Travis
Unfortunately, I am running into real problems of my own. My back is starting to send the warning signs that all the little adjustments to slips on the trail have taken their toll. Before throwing in the towel, I want to make absolutely sure that Travis wasn't just running the last lap fast because he planned on bailing. I stroll in acting like I've still got all night to go. Travis doesn't bite; he's clearly going back out for at least one more. That means I need to do at least two more to win. That's possible, but it doesn't sound like much fun. I come clean and admit that I'm done.

If it was anyone else, I might have hung in there. But, I know how much Travis likes to beat me and I'm pretty sure he would have run through to morning to do it. That sort of epic battle of wills had real appeal a few years ago, but it's not what I came here for today. 50 miles at a super-easy pace is a perfectly fine long run for January; it's time to pack it in.

And, it really was a great day (and part of a night) on the tail with a large portion of the trail running community with which I have become so close. The conditions were quite fantastic as long as you didn't stand around for too long. I'm happy for Travis and I don't want to insinuate for a moment that I let him win. He came ready to play and deserved the win. For me, it was yet another step along the path to treating these things as events rather than races. A small step, but in the right direction and on the right trail.

Saturday, January 14, 2017

Finding limits

Well, that's what you do in Analysis, you find limits of sequences, like the limit as n goes to infinity of 1/n is zero.

I'm also finding my own limits. Seems that I can practice pretty much all day long, but if I'm actually learning something (or refreshing from 30 years ago), about 6 hours a day is as much as my brain will take. I seem to recall that was about what it was in my 20's, too. I could code or do homework for hours on end if needed, but intake of new information, whether reading or in lecture was capped at around 6 hours.

It's interesting (to me, anyway), that that's also the physical limit for most elite athletes (runners are a significant exception as the weight-bearing nature of the sport knocks them down to 2-3 hours per day). That could be coincidence, but I wonder if it's actually a genetic adaptation of some sort. Like, we're just meant to truly expand ourselves no more than six hours a day because that's enough. After that, you're better off honing the skills you already have.

Friday, January 13, 2017

Big calc day

I spent pretty much the whole day working Calculus problems today. As my old calc text is the typical undergrad tome, I didn't run out of exercises to do. I think it's probably enough. By the end of the day, I felt like I was spotting the patterns quickly and solving them pretty consistently (I still drop signs and invert values all the time but, I did that 30 years ago, too, and it seems to be the type of mistake math professors forgive.)

One page from the book did provide some comic relief. It's a diagram demonstrating the inversion of functions:



Wow! That must be a pretty complicated function that you need to access a tape drive to solve it! I'm trying without success to come up with any tractable function that couldn't be computed using the chip out of the cheapest wristwatch sold in the last five years.

Thursday, January 12, 2017

Cramming

Cramming is usually used to connote intensive studying at the last minute. However, with classes starting Tuesday, this extended weekend (I'm taking tomorrow off, so I have a 4-day weekend) represents my last chance to really push on the Q prep. I plan to use it as such. Let the cramming begin!

Wednesday, January 11, 2017

Equivariance

The text I'm prepping from states the Equivariance Principal as a data reduction technique. While formally true, I'm not sure I'm buying that characterization of it. Not because I think the math is bogus (I don't), but simply because it does not fit my subjective view of data reduction. Equivariance essentially reduces the number of conclusions, not the actual data. Splitting hairs? Perhaps.

Anyway, it's actually two principals. One is measurement equivariance. This means that if you measure in inches and then form your estimator in inches, you should come up with the same answer as you measured in meters and formed your estimator in meters and then converted that to inches. Just about everybody agrees on that part, though it's never absolutely true in real life. The measurement system will impact both the precision and accuracy of the measurements. However, if we except that we are getting the same answers within the limits of our precision, that's good enough.

The second one is a little funky, but it's where the reduction comes from. This is the principal of Formal Invariance. It basically says that if the model for your sample space and distribution are the same, the means of reaching a conclusion should be the same, regardless of what the model is being applied to. So, for example, if you choose to estimate the probability of a coin coming up heads by applying some function to a series of coin tosses, you should be able to apply that same function to the inverse (where "success" is now getting tails) and wind up with the inverse of your estimator (that is, 1 - your estimator for heads)

Mathematically, a set of transformations from a sample space onto itself that obeys these two properties must form a group. That is, they are closed under inverse and composition. That's generally a pretty easy thing to prove.

OK, so where's the rub? Well, just because the math works out doesn't make it so. Models are just that, models. They aren't the real thing. Conclusions that work for some parameters make no sense for others, even when the model used is identical. You could model the height of five-year-old's as a normal random variable. It's not a perfect fit, but it's not terrible either. But, just because the left side of the tail is chopped (it can't possibly go below 10 inches, even though the normal distribution would assign some very small, but positive, probability to that), doesn't mean that some other variable modeled as a normal random variable also has a chopped left tail. The limiting of the set of conclusions based on other conclusions from similar models only works if the model exactly describes the reality. And, that's never true.

So, I'm calling shenanigans on this one.

Tuesday, January 10, 2017

Drill

I posted a few days ago that what I really need is more drill. The traditional way to do that is flash cards. I made some flash cards and I've been using them. However it occured to me that I spend a lot of time looking at the ones I know already. What would be better would be a way to track which ones I'm doing best on and not look at them so much.

So, I wrote a little program to do that. It came together quicker than I expected. I pretty much had it done over lunch. It simply has a list of expressions with a graph indicating which are equal to which. It picks one at random (weighted by how well I've done), then picks possible wrong answers weighted by how often I've guessed wrong and inserts at least one right answer. If you click on an answer that matches, yay, it turns green. Otherwise, it turns red. You can keep clicking until you find the right one.

The "Ask" button clears the screen and picks a new set.

The rendering is done by sending the LaTeX code for the expressions out to codecogs to be turned into a gif image. It's good enough now for what I need. With a couple day's polishing, it could be useful to a more general audience. I might do that, but certainly not until after the Q,




Monday, January 9, 2017

Do what you love.

We've all heard that. And, it's pretty easy advice to take when what you love is math. I mean, why wouldn't you do math if that's what you love? It's not like it's hard to find a job if you're good at it. Well, maybe it is if your affections are limited to Abstract Geometry or Group Theory. But, even then, you should be able to find a faculty position at some mid-level school where, in exchange for teaching Calculus to a bunch of Freshman who don't particularly want to take it, you can indulge your curiosity. However, if all you want to do is sink your teeth into deeply analytic problems, you've got options everywhere. Most of them pay pretty well, too.

But, what if what you love isn't one of those things our society chooses to reward? What if your passion is teaching Elementary School or writing poetry or painting (pictures or houses; neither are particularly promising careers)? Or, what if what you love is playing the Trumpet? Which, is the case with my daughter.

Now, she's only 13. Lots of 13-year-old's enjoy music and then go on to do something else (quite often, math). But there's a difference between enjoying it and loving it. I know because my private teacher in High School spent his days teaching at Julliard. As such, I played in ensembles with some of the very best music students in the world. Two things became clear to me.

  1. If I practiced enough, I would get as good as them.
  2. There was no way I was going to practice that much.

I simply didn't love it like they did. Yaya loves it.

Sometimes she gets distracted and forgets to practice, but I never have to nag her. I simply ask if she's practiced today and, if the answer is no, she hops up, grabs her horn, and starts to play. Usually the answer is "yes". Or, more commonly, "yes, I played a bit before school" (when it's warmer, she plays at the roadside waiting for the school bus to show up; one of her friends holds her music), "then I had band during school and jazz band after school, then I did some scales when I got home and I'm going to do some improv tonight."

As a result, she was the only 7th-grade trumpet in All Suburban Jazz this year. They had their concert yesterday and they were a pretty tight act; crazy good by the standards of Middle School. It could be that the next Dizzy Gillespie is taking trumpet in some 6th grade class in suburban St. Louis, but it's a better bet that Yaya will get the lead spot next year. Especially if she keeps working at it the way she is.

Now, just as my parents were more than happy to buy nice instruments, pay for private lessons, and send me off to music camps in the summer, I am unconditionally supportive of this as an extra-curricular activity. Well, no, that's not really true. We have had to insist that she not blow off all her other schoolwork to practice more. But, that rather obvious constraint aside, we're totally behind this.

But, what if it's more than that? Will I really be able to honestly tell her to do what she loves if what she loves is a career with such limited chances for success? I think so. I hope so. Because, if I can't, then it's just a stupid catchphrase and not real advice at all.

I do know what it means to fail in a high-risk career. I never made more than $10,000 in a year as a cyclist. I got fired from my team every single year and had to find a new one. I lived out of my car. I got divorced. I hit my 30th birthday with less than a thousand dollars to my name. I would not trade it for the world. I have lived the ensuing years with no regrets about what might have been. I want the same for her.

Thursday, January 5, 2017

Birnbaum's Theorem

This one relies on some tediously technical definitions, so I'll state it informally.

Sufficiency Principal: If S(X) is a sufficient statistic for X. And you have an experiment that produces S(X) then the evidence from the experiment is the same whether you use X or S(X). This is a slight extension of sufficient statistic, because now it brings in the notion of how the data was collected.

Conditionality Principal: If you have multiple ways of testing something and you pick one randomly, then the evidence from that test is the same as if you had intentionally run just that test. That is, the fact that some other test might have produced the same result in a different way is irrelevant.

Likelihood Principal: The evidential significance of an experiment is determined by the likelihood function. That is, any conclusions from the experiment are dependent only on how likely one is to observe such data given the conclusion is true.

Note that all three of these are axioms. You can accept them or not. Birnbaum's Theorem shows that the first two imply the third and the third implies the first two. So, if you accept the Likelihood Principal, you have to buy into Sufficiency and Conditionality.

Sufficiency is pretty well accepted. The other two are subjects of considerable debate.

To see why, consider this fairly simple experiment: I want to know how often a baseball player gets on base for every time they come to bat. I can watch 20 at bats and count the times they reach base. Or, I can see how many tries it takes to get on base seven times. Suppose I choose the latter and it takes them 20 tries. That's a plausible result from the first experiment as well. Either way, I get a maximum likelihood estimate of 0.35, which is a pretty good on base percentage.

But, I'd like to know how certain of that I should be. Using standard Null Hypothesis Significance Testing (NHST) techniques, I would say that the 90% confidence interval was 0.15 to 0.60. That's because I was waiting for the seventh success, so I use a Negative Binomial distribution. But, if I was running the first experiment, my data would be distributed by a Binomial distribution. The exact same 7 out of 20 gives a confidence interval of .19 to .56.

Why should I get two different confidence intervals for the exact same data? Why does it matter what my intentions were? I'm trying to measure the batter, not me. What's worse, suppose it wasn't even my intention. Suppose I just flipped a coin to decide what experiment to run. My confidence interval for a players ability is now affected by the result of an independent coin toss!

For this reason, most Bayesians reject the Conditionality Principal in favor of using prior beliefs. Of course, that still means my result is going to depend on the beliefs that I brought to the experiment, not just the data, but at least I've quantified them up front.

Quantum physicists had a really hard time with this idea a hundred years ago. It just seems plain wrong that an outcome is different simply because we chose to look at it differently. At least they were honest enough to freak out about it. Sadly, the preponderance of statistical research is done in complete ignorance of this exceedingly fundamental fact.

Wednesday, January 4, 2017

Caveat Emptor

Here's a tip for any budding team leads out there: don't believe vendors. If it doesn't work now, assume it never will. I'm not going to go into specifics as publicly flaming a party rarely improves the relationship. Let's just say that if a certain piece of vendor-supplied software actually did what the vendor claimed then the project that I've been leading for the last four months would be the most successful project of my fairly long and fairly successful career.

As said software does not live up to the vendor claims, I'm going to have to do some apologizing and lay out a whole bunch of contingency plans for getting this mess into production. Kinda sucks.

The part that cheeses me off the most is that I did see it coming. We did a proof of concept project last year and I had some real misgivings about the product. The shortcomings were real, but I let myself be convinced that either the vendor would fix it or we could work around it. They didn't, we still can, but it's going to be awfully hard to do that and get into production by March.

So, the most likely path is that we swap out that piece of vaporware for something that actually works and go to prod in April.

It's not that big of a deal. The project is still a firm success. At tomorrow's demo we will extract, load, and transform a 20-million row batch in about six minutes. On the current production system, that takes 45. And, we haven't even tuned it yet. When we really turn it loose on big loads (200-million plus) that take full advantage of the horizontal scaling, 50-100 times faster is a reasonable expectation. It just sucks that the presentation layer is so flakey. If that was tight, this would have been a huge win. Now it's just another step towards improving things.

Tuesday, January 3, 2017

Cranking it

I've been doing a lot of exercises in prep for the Q over the last couple weeks. I'm doing just fine with proofs, but not so great with computation. I've never been very good at computation (that's what computers are for!). However, from the review materials the department has provided, it looks like I can expect a few questions that will require grinding out some straightforward, but intricate, derivations. So, I need to be able to crank through this stuff without doing stupid things like dropping a minus sign or inverting a constant.

I think the best approach to this is to go back to the "brain intervals" that served me well a year ago in my Algorithms class. The Q is 5 weeks away. Five weeks of interval training can make a big difference in a running race. Hopefully, the brain adapts as quickly as the body.

Sunday, January 1, 2017

2017 Goals

Those familiar with the Myers-Briggs Type Indicator will not be surprised to hear that I'm an INTJ. The "J" means that I do best when I've got a goal in mind. Good goals are measurable, attainable, and hard. So, with those criteria in mind, here are my goals for 2017.

School:
  1. Pass the qualifying exam.
  2. Pass the admission to candidacy.
  3. Get at least one paper accepted for publication.
  4. Maintain 4.0 GPA.
Work:
  1. Fully convert all our analytics to a horizontally scalable platform.
The rest of life:
  1. Stay married (don't laugh; divorce and grad school are highly correlated).
  2. Stay debt free (outside of our mortgage).
  3. Support Yaya's music (OK, that's a bad goal because it's not measurable, but I still need to keep it on the radar).
  4. The BIG buckle from Leadville (under 25 hours).