never2old4school: 2017

Thursday, December 28, 2017

Markov chains

I'm back. Didn't have much chance to publish over break, but I did get a fair bit of work done. One lead I'm looking into is the suggestion from my adviser that we consider the query identity function to be a Markov Chain. Markov Chains are sequences of variables that reflect a "state". How you get to the state doesn't matter. Your next state depends entirely on what state you are in. If you want to model correlation that way, you build your history into the state.

So, for example, if you had a sequence where the variables are correlated 3 deep (that is, the next value will be correlated with the previous 3), then your states would actually be 3-tuples of the last three observations. Those alone determine the distribution of the next value. Since the identity function can only take on zero or one (a row is either in the query or it ain't), the state can just be the sum (or, possibly a time-weighted average) of the previous n observations where n is your correlation horizon.

I generated some quick chains using correlation horizons of 0, 1, 2, and 10 observations. I used the same pseudo-random number stream for each so that they were easier to compare. The only difference is the transition matrix.

0:0100001001011101101010000000001100001101001001101010010001010101010011000110010001011111010110100011
1:1110000000001111100011000000000110000111000001101110000000010101111000000110000000011111111110100011
2:0100000000001111101010000000000100000101000001111110000000010101010000000111010000011111111110100011
2a:0100000000000101111010000000000100000101000001111111110000000101010000000111010000000001010110100011
10:0100001000000000000000000000000100000100000001101010010001010101010011000110010000011111010110100011

Each of these chains has an overall hit probability of 0.4. That is, 40% of the values are 1. More precisely, if you ran the sequence long enough, you would converge to that percentage. However, the correlation is much different.

On the uncorrelated row, the value switches from zero to one quite frequently. When the correlation horizon goes to 1, it switches less frequently. At 2, even less so. 2a shows that even with the same horizon, you can greatly increase the correlation by changing the transition probabilities. The last line represents a fairly modest correlation with a horizon of 10. It switches more often than 2a, but also tends to drift further from the underlying probability for any given subsequence.

The point is, you can model pretty much any correlation behavior you want with this sort of thing.

Except, the behavior I'm really after.

The reason the variables appear to be temporally correlated is because batches of related data get loaded all at once. So, if one is running a query which will include a lot from one batch and not from another, that will look like temporal correlation except at the transitions between one batch and the next. At the actual batch boundaries, the correlation is zero.

What's really happening isn't correlation of the identity function as much as correlation of a confounding variable. Batch ID is generally not in the query. But the values of other attributes that are in the query are very dependent on the batch ID.

I think we can get away with modeling it this way for now. The fact that we're getting pretty decent results without accounting for correlation at all indicates that this isn't a huge problem from a practical standpoint.

That said, the ultimate goal is to lay this algorithm on top of my dynamic partitioning which will send attribute correlation through the roof. So, we are going to have to deal with it at some point.

Sunday, December 17, 2017

Do we really want to go there?

Anybody who's been with the blog from the get go (a fairly small group based on my page stats) knows that I originally went from looking at individual entries to block sums because adjusting for the correlation was over-complicating things. Having developed a working algorithm using just block sums, I'm a little loathe to revisit that decision. Still, my adviser has made the legitimate point that, if the block sums are not sufficient statistics, then we are giving away information.

We have argued that while the individual observations are not independent, the blocks are. That's not really true, but it's pretty close to true. The last records in one block are correlated with the first records in the next adjacent block. However, since we are sampling blocks randomly, that's really not a problem; there's no systematic correlation that will throw off our results. And, if the blocks sums are independent, then it's a no-brainer that they are sufficient statistics for the total sum.

A question that can't be answered from just the block sums, however, is how correlated the data is. This impacts our estimate of the variance of the total sum. Recall that we're assuming a mixture distribution on the sum where it is zero with some probability (1-λ) >= 0 and distributed according to an arbitrary non-constant distribution otherwise. If the correlation was zero, then λ would always be so close to zero, we'd have no need for a mixture distribution. It's only if all the observations are highly correlated that there's much chance that all the rows in a block get excluded. As the variance is very dependent on the value of λ, it would be good to get this value right as soon as possible in the sampling.

I've been using prior on λ ~ Beta(h, m) where h (hits: number of non-zero block sums) and m (misses: number of zero block sums) start at 2 and then get increased for every sampled hit and miss, respectively. This obviously converges to the right answer, but it takes a while when λ really is very near (or equal to) zero or one. This is why CISS actually generates better confidence intervals for modestly selective queries than ones where nearly every block is producing either zero or non-zero sums.

One recent suggestion from my adviser is to look at the number of rows selected in each block and make some deductions from that. If all the blocks are returning about the same ratio, the correlation is probably pretty low. If it's all over the place, or if many blocks return zero rows, then correlation may be a big deal. In the latter case, you'd want to sample further to insure against missing a block that had a lot of rows to return.

The question is, do we need to answer this now? Frankly, I'd rather get the first result out there and cite it as an area for future study. I don't think it will fundamentally change the algorithm, but it could complicate the analysis and I've found that this is not an easy problem to explain to people even in it's simplest form (it seems lots of people who use databases and analytic tools all the time haven't given much thought to how they actually work).

Saturday, December 16, 2017

Insufficient

One of the things I discussed with my adviser yesterday was whether we were dealing with sufficient statistics (better yet, minimal sufficient statistics). Because the individual rows are correlated, it is more convenient to work with block sums and use the (mostly true) assumption that block sums are independent.

If individual rows are distributed by pdf f(x) with mean μ and they are included in the query with probability λ, then we have two parameters to estimate (λ, μ). Knowing both of these gives us the distribution of the sum. The problem is, the block sum is not a sufficient statistic for these. If we get a block sum of 50 on a block of 100 rows, there's no way to know if that was (λ = 1, μ = 0.5) or (λ = 0.5, μ = 1). Examining the individual rows would give us more insight into that, so the blocksum is not sufficient.

Of course, the blocksum is sufficient for the value we really want, the product of λ and μ, which is the expected value of an individual row. That can then be multiplied by the total number of rows to estimate the population sum.

However, if we want a confidence interval on that sum, we also need to know the variance, and that very much depends on the individual parameters, λ and μ.

All hope is not lost. Again, we don't really care about λ and μ, we want the variance (which happens to be a function of those two). The variance is also a function of the vector of block sums. In the infinite population case, the blocksums so far are a sufficient statistic for the variance of the mean. So, in the finite case, we should have a statistic which is assymptotically sufficient. I think if I can establish that, plus some reasonable comments about the rate of convergence, that should be good enough.

Friday, December 15, 2017

St. Vincent.

I had a good meeting with my adviser today, but I've got quite a bit of work to do before I'll have anything coherent to post from it. So, instead, I'm going to talk about the run I went on after our meeting.

There's a park south of campus called St. Vincent park. I don't know what this Vincent character did to get canonized, but I'll give him the benefit of the doubt and say it was probably genuine virtue rather than simply winding up on the wrong end of some Turk's simitar. Since many of my runs from campus go through this park, I've started naming them with virtues. St. Vincent the Humble is the shortest; it's just a loop around the park. St. Vincent the Goode combines the park loop with most of the Wayne Goode trail that runs through campus. The longest loop is St. Vincent the Patient.

Today, I decided to run a new route. I went through the park and then continued south to Forest Park. I then took the metrolink back to campus (UMSL students get free metro passes). Running through Wellston, four High Schoolers did a comically bad job of impersonating dangerous thugs. Granted, this is a very rough neighborhood and race relations in North County are challenged, to say the least. But, this was also in broad daylight on one of the busiest streets and they were having a hard time not laughing as they ran towards me. They ran along with me for a bit and heckled me while I just chuckled. Seeing that their "attack" had failed to produce the desired response, they went back to whatever they were doing before (which appeared to be nothing).

I run through the hood all the time and little incidents like these don't phase me. Nobody is seriously going to mug a runner; we don't carry anything worth stealing. Still, I don't know that I'll run this particular route often enough to warrant naming it. However, I can't pass up such an obvious pun. From here on, this route will be St. Vincent the Chased.

Tuesday, December 12, 2017

Upset

I spent the night at work (yes, Dr. Wu, paid employment; I'll get back to this research thing real soon). Well, a good chunk of the night, anyway. I'm packing it in at 11PM, which is actually earlier than I had expected.

Since it was off hours, I didn't feel bad about having FiveThiryEight's running commentary on the Alabama election going at the same time. The fact that Alabama just elected a democrat to the United States Senate is more than a little mind bending. It was the perfect storm for the GOP. Just goes to show that heavy-tailed distributions do exist.

While I'm certainly no fan of Moore, I actually find this whole episode rather sad. Of course Moore should have lost. Even without the whole child molestation thing, he's shown a complete disregard for the limits of power throughout his career as a judge. If anybody should be aware of those limits, it would be a judge. But, this is really unfair to the people of Alabama.

Jones doesn't speak for them. He speaks for a small and vocal minority. He will vote opposite of what the vast majority of the state wants. Whether or not you agree with what Alabama wants, this is not how democracy is supposed to work. Then again, very little of this past year has had much to do with how democracy is supposed to work.

It has occurred to me that, this whole data science doctoral thing could make me somewhat valuable to political campaigns. I don't know if I want to go that route or not. On the one hand, I really do believe in democracy and wouldn't mind helping a candidate with whom I agreed. On the other hand, the whole process is just so broken right now, I might get super depressed.

I suppose I should get back to my research before stressing about it.

Sunday, December 10, 2017

Delegation

Another week of very little progress on the paper as we get year end things finished up at work.

One could argue that, since I'm in charge of the group, I could just delegate a few more things. Technically, writing code isn't even in an architect's job description, though I do a fair bit of it. This last bit was a module that I really should have handed off because it was quite a bit of work. However, it was done rather badly the first time and I just didn't want to risk it being done badly again because it's rather central to the whole system.

As with my research, I need a big block of hours to be productive coding and that simply doesn't happen during normal business hours. So, since knocking out a big chunk of code during the day is pretty much fantasy land, I end up doing it at night. I like coding at night, but that doesn't leave any room for research.

Fortunately, it's done and we don't have any big pushes for a few months. I need to start using those evening blocks for research.

Thursday, December 7, 2017

Pareto results

Just so it doesn't look like I've made no progress when I meet with my adviser tomorrow, I ran the algorithms with a Pareto distribution. The significance of using Pareto rather than Laplace is that Pareto has infinite variance (but finite mean). Laplace just has a really big variance. That appears to matter.

CISS is dealing with it fine (stripping off the tails brings us back to finite, but large, variance). The Bag of Little Bootstraps goes completely haywire. Bootstrap-Metropolis-Hastings is somewhere in the middle. Using a query that introduces correlated filtering doesn't change things much one way or the other.

So, I think this is a better illustration if we're only going to use three distributions. I suppose there's no downside to showing all four, other than all these graphs are making the paper pretty long without adding much real meat.

Tuesday, December 5, 2017

Effort vs Talent? Try Ability

There's an ongoing debate in the area of human performance as to which counts for more, effort or talent. Arthur Lydiard, the famed track coach, was once asked what the most important thing a person can do to make it to the olympics. He answered, "choose good parents." On the other hand, the much misquoted "10,000 hour rule" implies that it's the effort that makes the difference.

My own experience, both in professional athletics and at one of the world's top universities, is that it very much depends on what you count as success. I was good enough to ride for a pro team and good enough to get an engineering degree from Cornell, but both were very hard. The hardest things I've ever done. No additional effort would have made me a really good pro or put me at the top of Cornell's class; I simply didn't have that kind of talent. But, I did have enough to do what I did. Some people aren't even that lucky.

That said, I think a person can find modest success in just about anything unless they are boxed in from birth by truly bad genetics or circumstance. Conversely, no matter how much the stars have aligned in your favor, there comes a point where you have to put out some effort to actually accomplish something.

So, you need both. I call that "ability". It's independent of how you got there; it's just a measure of where you are. And from that point, you can choose to apply more effort to increase it (or, if you're a 54-year-old runner like me, limit it's decline) or you can choose to put effort into something else where talent gives you more room to grow.

Or, you can do what so many people do and not even think about it. Simply cruising through life is an option, and I'm not even going to say it's a particularly bad one, especially if you are blessed with a lot of advantages from the outset. It's just not a path I'm interested in.

I started thinking about all this because tonight was Yaya's concert at school. The middle school had two bands (7th and 8th grade) and the high school had two (there's more mixing of years in the two ensembles there, but it can basically be viewed as varsity and JV). The 8th grade band sounded the best, at least to my ear. Sure the Symphonic Band (Varsity) played harder stuff, but they didn't play it particularly well. In their defense, they are all coming right off marching band season and threw this thing together in just a handful of rehearsals. But, that's my point. It was the extra effort of 15 weeks of prep that made the 8th graders better. At the end of the day, people are (generally) rewarded for what they actually do. Ability is the key to doing things, regardless of how you come by it.

Sunday, December 3, 2017

Pareto

I haven't done much (hardly any) work with Pareto random variables. That's actually a little odd since they show up so often in financial analysis. The Pareto distribution is the mathematical basis of the 80/20 rule. That is, 80% of the wealth (or income, or effort, or whatever you're measuring) is held by 20% of the population. Furthermore, as a power distribution, you can apply this rule as often as you like. So, if you just look at that top 20%, you find the same 80/20 rule applying. This pattern is definitely present in my data, as I noted way back when I started down this road 2 years ago.

It doesn't have to be 80/20; that's just the classic statement of the "Pareto Principle". Pareto found that 80/20 was the most common ratio with respect to wealth distribution both in his home country of Italy and pretty much every other country for which he could get data. That seems to have held up pretty well over the subsequent 125 years not only for wealth but a lot of other things as well. The actual distribution can be tuned to have an inflection point somewhere else (for example 90/10), but 80/20 works for a surprisingly large number of cases.

Believers in math religion (that is, that the world is somehow governed by math) will quickly note that the scale parameter, α, that yields the 80/20 rule is a very pretty (log 5)/(log 4). Yes, the Ancient Greeks would have loved it (except for the bit about it being an irrational number; they weren't down with that idea). However, I find that view dangerously naive. Math is very useful for describing reality, but it certainly doesn't dictate it. 80/20 is just a nice approximation, nothing more.

But, we're drifting into Philosophy and right at the moment I need to be focused on Statistics. The reason Pareto is useful to me is that it's a common distribution with an infinite variance. As such, it's a better example than the Laplace distribution I have been using. So, I think I'm going to swap it into the analysis.

There is a small problem when dealing with positive and negative numbers. Laplace is a simple symmetrical version of the exponential distribution. You can do that with Pareto as well, but you get a hole in the domain. Pareto random variables are bounded away from zero. So, if your minimum value is x_m, then simply folding the distribution over on itself means that you can have observations anywhere on the real line except the interval (-x_m, x_m). This not a particularly hard problem to get around, just shift the distribution closer to zero by x_m.

The downside of that is that it screws up the 80/20 rule. Another option is just to not worry about values with magnitude less than x_m. My model already assumes a mixture of a real-valued random variable and one that is constant zero, so I don't really need to do anything to adjust for that. The draw back here is that I'm really giving up a lot of the center of my distribution. To get the interquartile range of the standard normal with a symmetric 80/20 Pareto, the minimum value is approximately 0.37. So, the "hole" in the distribution is over a quarter the size of the inter-quartile range. By contrast, the standard normal puts nearly 30% of it's distribution in that same interval and Cauchy jams in a bit more.

However, it's the tails I'm really interested in, so I'm not going to stress much over the interior. I don't really want to devote a whole paragraph in the paper to how I transformed the pseudo-random data just to do a comparison. I'll just state it as symmetrical Pareto and let the results speak for themselves (unless, of course, my adviser vetoes that approach).

Just for the record (so I don't have to derive it again) the Pareto parameters that give 80/20 distribution with an interquartile range equal to standard normal are x_m=0.3713 (I'm using 4 significant digits in my parameters) and α=1.161.

Friday, December 1, 2017

New hours

I'm once again feeling like progress is stuck. A few weeks ago, I felt quite differently. I don't think it's any coincidence that I felt like things were going well after I had pulled a couple all-nighters to get some stuff knocked out.

As a pentagenarian, I can't do that on a regular basis. So, I talked to my VP about shifting my work hours to open up some larger blocks of time. I can slide reading a paper in around things, but to actually get my own work done, I need to be able to focus for at least a day.

The plan is that I'll stay a bit later during the early part of the week and then leave before lunch on Friday. That will allow me to meet with my adviser and then stay late at the library. Should be good for 8-10 hours of concentrated effort. I'll also carve out one other night a week (probably Tuesday or Wednesday) where I stay late at work. Oddly enough, my "workspace" at "work" is actually well suited to efficiently getting "work" done.

The remainder of my 20 hours of schoolwork will be the aforementioned sliding in reading a paper at lunch or in the evening at home.

At least, that's the plan. I tried it out today and it didn't go very well because my adviser wasn't available and I had to get back home to feed Yaya. But, it's a good plan.

Tuesday, November 28, 2017

More thoughts on BMH

In any Metropolis-Hastings approach, you need some kind of likelihood function that you can use to decide if you stay where you are in the chain or move to the new proposed value. I've been using simply computing the likelihood of getting the observed block average given the proposed mean and observed variance.

Here's the kicker, though: these are heavy tailed distributions. The variance may not to exist. Of course it will always exist for a finite sample, but if the underlying data truly is heavy tailed, that's a rather bogus measure.

So, now I'm thinking about how you compute a likelihood using some other measure of spread that can be reasonably estimated. Interquartile range is one obvious one, though it suffers from the fact that it's not particularly helpful in assessing the mean, particularly if the distribution is multi-modal. There might be some real math to be done here.

Monday, November 27, 2017

Size matters

Block size, specifically. And, none of the methods I've been looking at seem to care.

For example, in the BMH paper, the example uses k samples of m observations per iteration. Values of k are 25 and 50, m is 200, 500, and 1000. I'm sorry, I thought the title of your paper included the phrase "Analysis of Big Data". I don't know of a Big Data platform that uses 1000-row blocks. 1-million-row blocks would be more common.

That matters because if I have, say, 10 billion rows, that's only 10,000 blocks. If I use up 25 of them on each iteration, I'm not going to have many iterations before I've gone through the bulk of the data set. It's not that I can't see that a larger number of smaller blocks converges with less data, it's just that such an arrangement completely defeats the purpose of these algorithms; that is, to get the answer as quickly as possible. And, simply counting the number of floating point operations is a very poor predictor of how long an algorithm takes when the size of the data is larger than the size of memory.

This, of course, is one of the big differences between academic work and industry. Academics my rightly point out that fewer computations are needed with smaller blocks. What they are glossing over is the fact that computation time doesn't really matter much. It's getting stuff off the disk that takes all the time. Once the block is in memory, running the computation on the full block is basically free (unless you're doing something really nutty).

So, I'm implementing these algorithms using much larger blocks. The authors could rightly accuse me of not giving their algorithms a fair shake. I'm not. I'm actually giving them a huge advantage: I'm letting them use all the data in memory instead of just a tiny amount. My metric continues to be how many blocks are read, even though these algorithms have more computation than mine.

At the end of the day, I'll have to justify all this with some true time studies. To do that, I'll need to re-write all this stuff (again) to run on the cluster. It's actually not that much code and I was going to get around to it anyway. But, I'm hoping I can put that off until after we get this paper out the door.

Sunday, November 26, 2017

Rebirth

Between trying to get things tied up at work before leaving for Ithaca and the trip itself, I haven't had much to report the past week. Thus, there have been no reports.

However, one of the advantages of taking a break is that you tend to think of the bigger picture a bit. I'm starting to think this whole bootstrap business has some promise. It's an active area, so it will be viewed as relevant and nobody seems to be terribly concerned that all the existing methods fail (or, at least have not been proven correct) when higher moments are absent. So, one might conclude that getting some results in the absence of higher moments would be a good thing.

Here are a few of the questions that might be worth answering (I'll have to check to see if any of them have been):

How can you mix bootstrapping with stratification?
How to account for mixture distributions where the mixture has temporal correlation?
More detailed results using BMH with mixture/hierarchical models. In particular, does a block sampling strategy have the same asymptotic properties, even in the presence of block correlation (I'm sure it does, but the proof could be messy)?
What happens to all these things when the variance is unbounded?
What happens when we mix the correlation-maximizing strategy of D-BESt with methods designed to account for correlation?
What if we replaced the pseudo-random number generator with a chaotic function in these methods?

That last one is a bit off-topic, but it occurred to me that random number generators are really just chaos functions by another name. Maybe it would be simpler/more efficient/more effective to intentionally pick a chaos function that covers the the sample space non-uniformly. Sometimes what we want isn't so much randomness, but non-periodic coverage. Non-linear differential equations are good at that, particularly when you go to higher dimensions.

Also, I still have my eye on the Blind Multi-Stage Bootstrap that I wrote about a couple weeks ago. A couple additional thoughts on that one:

Does it make sense to update the BMH distribution with each sample based on what we've observed so far? If so, what sort of prior do you use to control that?
What if it's the entire model that you're not sure about? That is, rather than a few competing models, is there a way to be entirely agnostic about the model? We actually had a colloquium on this last year which looked into Dirichlet models. I think I'd at least like to read up on that.

Sunday, November 19, 2017

High tide

Once again, work and school have crested simultaneously and, for obvious reasons, work is winning.

The annoying thing is that the last few production issues at work have had nothing to do with our new platform. It's been the legacy stuff that's been acting up. I'll be happy when we can retire all of that, but I'm afraid that's at least a year away. Legacy systems don't die without a fight.

I do expect to be completely free from work concerns during my trip back to Ithaca next week. I may get a chance to do some research at Cornell's math library. I'm not sure how much I can get done in a single day there, but it would certainly make sense to try. Of course, the real point of the trip is to see my mom and sister. I don't think hiding in a library all week is a particularly good way to demonstrate that I care that my dad just died.

None of this has me particularly stressed. I expect that some weeks will be productive and others won't. That said, I really would like to get past my admission to candidacy sooner rather than later. It's a fairly arbitrary hurdle, but it has the effect of clarifying expectations for the rest of the dissertation. Right now, I still feel a bit unsettled as to how much work remains.

Friday, November 17, 2017

Right answer, wrong question

I've found a fair bit of research on using Metropolis-Hastings on time series data. In everything I've read so far the focus is on identifying the correlation. I don't really need to do that. I already know the correlation boundaries. What I need to be able to do is build those into my model so I can adjust for them.

In the mean time, I think just by pulling multiple blocks I can smooth out most queries pretty well.

Wednesday, November 15, 2017

The hard work of math

Most people think of math as "hard". By that, they really mean "incomprehensible". There's nothing hard about noting, or even proving, that the cardinality of the integers equals that of the rationals but is less than the cardinality of the reals, even if most people would prefer to simply call them all "infinite". Even most math-literate folks, like engineers and physicists are generally content to simply make the distinction between countably and uncountably infinite rather than trying to wrap their heads around the various degrees of uncountable sets. These things aren't "hard" if you trust your axioms and your methods but, intuitively, they are pretty far out there.

On the other hand, rigorous proof and its software analog, formal validation, are objectively hard tasks. They require discipline, patience, and, above all, attention to detail. Any attempt to do them quickly defeats the purpose. They necessarily require effort.

So, having completed the interesting part of comparing the sampling algorithms, I'm now left with the hard task of making sure that everything I did was correct. It's a long night of fairly unrewarding work. But, it is the hard work of math. It needs to be done.

Tuesday, November 14, 2017

BMH preliminary results

Here's a first look at the Bootstrap Metropolis-Hastings results. Vertical scales are the same as before.

Cauchy is nutty as usual but the other two are actually pretty good. Then we try a filtered query...

Oh, wowsers, that's not nearly so good. The confidence interval misses the true value most of the time, even in the normal case. The problem, of course, is the correlation of the attributes. Here's the kicker though. Metropolis-Hastings can handle arbitrarily complex models. Currently, I've got it assuming iid inputs. That's not necessary. A model that assumes correlation in the data set might do much better. I'll work on that next.

Monday, November 13, 2017

Bootstrap Metropolis Hastings

I'm coding up the second algorithm for the comparo. This one is a bit less intuitive. First, a quick overview of the Metropolis Hastings algorithm:

The idea is to approximate the posterior distribution of some parameter given a set of data. Presumably, this distribution is difficult (or impossible) to derive, so we settle for generating an empirical distribution. It goes like this:

1) Pick a starting value for your parameter. Let's call it Y since greek letters are a pain to produce in a normal text editor. This first value is Y₀.

2) Pick another value for Y by applying some distribution to Y₀. The choice of distribution is pretty arbitrary, but things converge faster if it's somewhat similar to the distribution of Y. Since we're looking at sums, a normal distribution should be close (if not spot on). So, we let Y' = Y₀ + N(0,s) where the standard deviation, s, is a reasonable guess at the variance of our sum.

3) Compute the relative likelihood of the two values L = P(X|Y')/P(x|Y₀) where x is our observed data set.

4) If L > 1, then Y₁ = Y', that is, we "accept" Y' as our next value in the empirical distribution.

5) If L < 1, then we accept Y₁ = Y' with probability L (by comparing it to a pseudo-random uniform(0,1) variable. Otherwise, we keep the original value, Y₀ = Y₁.

We keep doing this for a long time until the empirical distribution of {Yi} converges. That limiting distribution is our estimate of the true distribution of Y.

The Bootstrap Metropolis Hastings algorithm works pretty much the same way except that rather computing the likelihood of the entire data set, we first select a subset and run it on that. We repeat that for multiple subsets and then average the results.

If this sounds like a lot of work for something as simple as a sample sum, you're right. We're not expecting the result to be any better than the BLB algorithm, which requires significantly less computation. I'll hopefully have confirmation of that in a couple days.

However, because the BMH algorithm returns an arbitrarily complex distribution and not just a point estimate and variance, it opens the possibility of leveraging the hierarchical model that we're using in CISS or even expanding that to something more complex. Once I'm done implementing the naive method, I'm going to try it with the hierarchical model and see if that keeps the variance from collapsing the way it did in the BLB tests on filtered queries.

Sunday, November 12, 2017

Relentless reogranization

INTJ. That's my Meyers-Briggs personality type. And, as much as I'm wary about over-simplifying human behavior, I have to concede that the description of that type fits me rather well. One of the attributes is "relentless reorganizer".

And, yes, I'm going to re-frame part of my problem once again. This time, however, it's not of my own volition. My adviser felt that some more clarification could be made about what exactly we mean by a "query". That's obviously a fair thing to ask since the entire method hinges on it. I thought about giving a bunch of examples. I may still do that, but I think what would fly a lot better would be a rigorous mathematical definition. So, here's a first cut at that:

Let X = {X_i} be a set of financial measures and Y = {Y_i} be a set of attribute vectors describing each corresponding measure in X. Let I_Q: Y → {0, 1} be an indicator of whether a vector in Y should be included in a query. A query, Q: ℝⁿ ⨉ Y → ℝ, is a function such that Q(X,Y) = Σ X_i I_Q(Y_i).

I_Q is called the query condition. Typically, I_Q defines a rectangular subspace of the full attribute space. That is, each attribute is limited to a range of values independent of other attributes. However, our results do not rely on that assumption. Any subset of the attribute space is an acceptable query condition.

If using a multi-currency model, then one could further indicate how exchange rates factor in:

Let C be the set of currencies and T be the range of time values. An exchange rate is a function F: C² ⨉ T such that F(d, c, t) is the conversion factor from input currency c to display currency d at time t. Let Y_ci be the input currency for X_i (that is, the financial units of measure X_i) and Y_ti be the time attribute for X_i. Then, a query, Q: ℝⁿ ⨉ Y ⨉ C→ ℝ, is a function such that Q(X,Y,d) = Σ X_i I_Q(Y_i) F(d, Y_ci, Y_ti) where the result is expressed in units d. As we don't specifically deal with multi-currency in this paper, I'll leave that bit out. Might need it later, though.

Saturday, November 11, 2017

Stopping rule revisited

So, my stopping rule got a little contorted through all the various machinations of my variance. Fortunately, I had it right in the code all along, I just transcribed it badly when going from the confidence interval to the stopping rule (basically, I stop when the confidence interval is sufficiently narrow).

For starters, we have:

$\frac{Y-Y^r}{\hat\sigma\sqrt{\frac{n}{r}(n-r)}} = Z\sim N(0,1)$

Stated as an alpha-risk, that's

$\alpha \le \textup{P}(|Y-Y^r| < B) = \textup{P}\left(\frac{|Y-Y^r|}{\hat\sigma\sqrt{\frac{n}{r}(n-r)}} \le\frac{B}{\hat\sigma\sqrt{\frac{n}{r}(n-r)}}\right)$

Since the denominator is always positive, this implies

$Z_{a/2} \le \frac{B}{\hat\sigma\sqrt{\frac{n}{r}(n-r)}}$

so, collecting all the stuff that changes each sample on the left, we stop when

$\hat\sigma\sqrt{\frac{n}{r}(n-r)} \le \frac{B}{Z_{a/2}}$

Friday, November 10, 2017

More results

I scaled up my data size and also tried some runs where the filter created a mixture distribution (a mixture of observations that are constant zero because they don't meet the criteria and rows from the original distribution that do). Results are what I was expecting but I'm much happier than someone who just had things work out according to expectations. That's because the charts so vividly demonstrate the issue.

First the Bag of Little Bootstraps method with no filtering of rows, that is, basically iid distributions:

As with previous examples, all three distributions have the same inter-quartile range. The difference is the weight on the tails. The horizontal axis is number of blocks read.

Not surprisingly, BLB does great with normal. Then again, just about any algorithm works with iid normal data; that's why so many people assume it even when they shouldn't. The heavier the tail, the worse things get, which is also what you'd expect. That said, Even in the case of Cauchy data, the algorithm is holding it's own. It appears the iid assumption is enough.

CISS does a bit better with the heavy-tailed distributions, but not much. To make comparisons a bit easier, I've used the same vertical scale from one method to the next. I did have to use a larger scale for Cauchy because it's so much more variable, but it's the same on both graphs.

The error bounds are jacked up on the first two; I haven't had time to look at why. If that was all we had, I'd be sad. CISS is better, but not enough better to give up the flexibility that the bootstrap can be used for any estimator, not just means and sums. However, that's not all we got. When we switch to the mixture distribution, things change radically.

That's a mess. In the first two, the variance collapses because there are too many blocks where all the records are excluded (I generated this data set with attribute correlation, which cranks up the probability of missing all records). Oddly, the Cauchy approximation is be the best of the three, though I think that gets chalked up to dumb luck.

On the other hand...

Now we're talking. Good convergence and tight confidence bounds to match.

Thursday, November 9, 2017

Base results

Here are some simple charts showing the Bag of Little Bootstrap performance on three distributions with the same inter-quartile mean. As one would expect, it works great on the normally-distributed data. OK, but not so great on the heavy-tailed Laplace. A train wreck on Cauchy.

Wednesday, November 8, 2017

... and less good news

Got interrupted right when I was starting to have a productive evening working on my sampling of generated data. Production failed at work. Not a terribly hard fix but it took a big bite out of the evening and now I'm up against a real deadline if I'm going to have anything at all to show to my adviser this week. Tomorrow's looking like a very long day.

Tuesday, November 7, 2017

In other news...

I do have a day job. And, while it's consuming way more of my time than I'd like, at least I can claim some success. Here are some stats from our new reporting platform compared to last year:

Data size:

2016: 3 Billion rows, split among three cubes.
2017: 3.8 Billion rows, all consolidated in one cube.

Median batch load time (for week before recalc; which is the busiest week):

2016: 47 minutes (review only)
2017: 22 minutes (all)

Last year, the only way we could keep the "official" cube stable was to throttle the builds to once every three hours. So, you might have to wait a long time to see results there. We had a "review" cube that allowed you to see what your particular batch was going to look like that refreshed more often.

Benchmark query time:

2016: 14 seconds with high variability (it was common for the query to take over a minute)
2017: 5 seconds, reasonably consistent

So, we've hit our marks. But, it's come at a cost, both for RGA and for me. The project wasn't cheap and all the overtime I've put in could have gone to working on my research.

Still, it's an easier pill to swallow when the results are so obvious.

Monday, November 6, 2017

Blind multi-stage bootstrap

The multi-stage bootstrap is a bootstrap method where one first creates a hierarchy of attributes and then selects the bootstrap sample based on that hierarchy. For example, suppose we had cash flows from several offices and we have good reason to believe that cash flows within an office are correlated whereas cashflows between offices are not. Further, we know that the cash flows for life policies are fundamentally different from those for health. If the goal was to get some sense of the variability of the present value of all these cash flows, it would make some sense to try to estimate these things separately and then combine the estimates, rather than pretend that they are all just one big pool.

So, we perform the bootstrap, but rather than selecting randomly from all observations, we first select an office (proportional to the observations for each office), then pick a product type (proportional to the life/health mix for that office) and then select our bootstrap sample from the observations meeting those criteria. As with the usual procedure, we repeat this via a Monte Carlo process until we have convergence on our estimator quality.

All good, but what if we don't know the hierarchy? Specifically, what if we know that a query may or may not exclude all rows from a block, but have no way of knowing which it is (without sampling the block beforehand, which defeats the purpose of ad hoc sampling). We are now in a situation where we are performing a multi-stage bootstrap but, rather than consciously picking the sampling branch, we are merely observing it as we sample.

I'm not sure what (if anything) this does to the estimator. In frequentist statistics, foreknowledge of an experimental design changes the outcome. Bayesian stats are a bit more resilient to foreknowledge because that's baked into the prior. At any rate, I don't need to address it right away because we are just doing a comparison with BLB and BMH, not really trying to expand on those methods.On the other hand, a cursory literature review didn't turn up anything on this topic, so it might be worth investigating sooner rather than later simply to stake out some theoretical ground.

Sunday, November 5, 2017

BLB and correlation

The bootstrap algorithm, in most of it's many forms, assumes iid (independent, identically distributed) observations. There are some adjustments that have been proposed when the data elements are correlated with respect to your attributes (in particular, a lot has been done with time-series data). But, very little has been done to adjust for correlation of the attribute values themselves.

In both the BLB (Bag of Little Bootstraps) and BMH (Bootstrap Metropolis-Hastings) algorithms, attribute correlation is something of a killer. It's not hard to see why.

Suppose I have a query where only 1 in 100 rows is included. Since my block size is 100,000 rows, I'm still expecting to see 1000 rows included from every block. The chance of getting no rows in a block is essentially zero.

Unless, of course, the attributes are highly correlated, which they usually are in my data. Let's suppose we're going after data for one particular office and that office did all their loading on a single week. Let's further assume that the period over which all offices loaded data was four weeks and the rate of loading was fairly uniform. Since data is placed into blocks in load sequence, all the data my query cares about is going to be concentrated in the 25% of the blocks that were created during the week that office was loading.

The chance of getting no relevant rows in a block is now 75% which, you may notice, is a lot higher than 0%. We could sample eight blocks and still have a ten percent chance of not finding any relevant rows. Obviously, any estimate of the sum from such a sample would be zero. Worse, the bootstrap evaluation of the quality of that estimate will be that it's perfect. No deviations will be found, no matter how many bootstrap runs you perform. Any reasonable stopping rule would kick in and you'd get a very confidently stated wrong answer.

There are a couple ways to attach this. One is to try to randomize the data a bit so there's less attribute correlation in blocks. That would work and it's not particularly hard to do, especially on a platform like hadoop where background data re-orgs are what the system is good at.

But, what if we went the other way? Suppose we amped up the correlation so that me were maximizing the chance of finding zero rows. D-BESt, my blocking algorithm from last spring, is really good at that. Recall that for real-life production queries on real-life production data, it yielded a partitioning scheme where upwards of 90% of all blocks could be safely excluded from any search.

Well, if you know up front that they can be excluded, it's not cheating to simply exclude them and limit your search to the blocks that may actually have some information you're looking for. Such a remaining set of blocks would have a hit rate much higher than the population at large. While I developed D-BESt with the idea that it would be used to optimize full scans, I'm increasingly thinking it can make the sampling work a lot better as well.

Friday, November 3, 2017

Big Data gets it wrong

Some may interpret my meager posts over the last week as lack of progress. Well, they pretty much got that right. We are in the final two weeks of the Plan cycle at work. That's where all the offices submit their projections for the next year. It's not as easy as it might sound and my group is right in the middle of it since they use our stuff to report on the projections. Thus, it's been a pretty busy week and, while I've had time to read quite a bit of stuff, I haven't had a solid block of hours to code the comparo.

So, I'm writing completely off topic tonight.

Yaya turns 16 in 21 months. I'm not going to give her my car right away but, as she's a pretty responsible kid, I doubt I'll hold it back for too long. I'll probably change the license plate since there's pretty much no chance she'll ever run 100 miles.

So, like any self-respecting 54-year-old dude, I've been checking out the current crop of sports cars. Note that I said "sports cars" not "grand touring cars". I have nothing against fast luxury vehicles, but they don't seem like a good use of such a large sum of money.

Anyway, as I was browsing web sites looking at pictures of Miatas, 124s, Minis, and such, it occured to me that maybe I could afford a used Lotus Elise, which is pretty much my idea of the perfect car: super light, exceedingly nimble, and sufficiently underpowered that you have to drive it right to appreciate just how great it is. So, I poke around for that.

Now, every site I go to has a banner add for some quarter-million-dollar car. Folks, the Elise only ran $50K new. How does checking out the resale value on that suddenly put you in the market for the likes of Ferrari and Maserati?

Somebody at Google needs to tweak their eigenvectors.

Monday, October 30, 2017

Which bootstrap?

There is one small hitch in my plan to use the bootstrap algorithm in my comparo: which one to use? The term is pretty broad, basically encompassing any technique that subsamples a sample to deduce the quality of an estimator from that population. The name comes from the phrase, "pulling yourself up by your bootstraps." You don't know anything about the population distribution, so you repeatedly subsample your sample to simulate the distribution.

The Bag of Little Bootstraps (BLB) algorithm that I'm officially comparing seems to favor the percentile bootstrap, though they admit that other techniques may be more appropriate in various situations. Because my data is highly correlated within physical blocks, I'm leaning towards using the Block Bootstrap. This also plays into my desire to keep all my subsamples within a physical block of data.

It also makes for an interesting matchup between the BLB and the Bootstrap Metropolis-Hastings (BMH) algorithm. If I use blocks as my subsamples for both, it will be a straight-up fight to see which one can do a better job of estimating the true blocksum variance.

Sunday, October 29, 2017

Tyson O

I did a local orienteering meet today. Boy, am I out of practice. Slow in the terrain and my nav was super sloppy. I was on the fence about going to Kansas City for the Possum Trot. I'm not anymore. It would be a complete waste of time.

That said, it was a truly great day to be out in the woods. High 40's, partly cloudy, and no bugs thanks to our first freeze last night. A great day, even if the result wasn't.

Saturday, October 28, 2017

Six weeks

I don't recall the author who said it, but I heard a quote once that was something to the effect that "books take six weeks". What the author meant was that you have this stuff bouncing around in your head, but when you actually put pen to paper, it's six solid weeks of work to produce a novel.

I wouldn't know anything about that but it does seem to match the patterns of the few published novelists I know.

It appears that's also how long it takes for me to produce a paper. As some may have noticed from the little graphs to the right of this page, I track my school hours in Attackpoint, which is a site intended for tracking physical training. I've found that keeping a "training log" of my schoolwork helps me recognize when I'm not making progress because I'm stuck versus not making progress because I'm simply not putting the effort in.

That would be enough reason to do it, but it also provides the feedback that project managers use so much: actuals versus estimates. That is, how long did you think it was going to take and how long did it actually take. I never put out an estimate for my CISS paper, but I guess I probably would have said that a publishable result probably requires a month or so of effort.

It looks like that's about right. I still have some revisions to do, but I'm basically done with the CISS paper. Any further work will be logged under general thesis research until I have another publication squarely in mind. Right now, my hours specifically on CISS are 225. With the remaining revisions, that will go to something like 250-275; basically six weeks of solid effort. Too bad it's been spread across 18 months. Hopefully, with coursework done, the next efforts will have a better clock to calendar time ratio.

Friday, October 27, 2017

Why would you even do that?

I didn't have much to show for this week's efforts. Reading a paper and generating a bunch of random data sets doesn't make for a good demo. I decided to go ahead and keep my regularly scheduled meeting with my adviser figuring I could at least get him to validate the approach I was taking.

To that end, I put together a little table of each method being compared and why you would even consider sampling that way.

n: number of raw rows

m: block size

N=n/m: number of blocks

r: blocks read

s2: estimated variance of block sums

B: bound on variance of population sum estimator

Full Sampling

Block selection: sequential

Variance estimator: n/a

Estimator of population sum: actual sum

Stopping rule: all blocks read

Rational: Certainty. This is what most people do now with finite data sets.

Drawback: Slow if data set is really large.

Random Sampling

Block selection: random

Variance estimator: sample variance of block sums read so far

Estimator of population sum: N * (sample mean block sum)

Stopping rule: N * s2 < B

Rational: Leverage law of large numbers to get a good approximation with less reading.

Drawback: Variance will be understated for heavy-tailed distributions.

Bag of Little Bootstraps

Block selection: Random, each block selected becomes a "little bootstrap" subsample

Variance estimator: bootstrap estimate of blocksum variance

Estimator of population sum: N * (sample mean block sum)

Stopping rule: N * (sum s2) / r < B

Rational: Better estimate of variance in heavy-tail case

Drawback: More computation per block, variance may still be understated due to correlation

Bootstrap Metropolis-Hastings

Block selection: random

Variance estimator: Metropolis-Hastings

Estimator of population sum: N * (sample mean block sum)

Stopping rule: N * (sum s2) / r < B

Rational: Another stab at getting a better variance estimate

Drawback: Same as above

CISS

Block selection: Random by strata to minimize estimate of total variance

Variance estimator: Bayesian posterior applied to stratum variance, S

Estimator of population sum: sum of strata estimates

Stopping rule: sum(S) < B

Rational: Leverage nature of desired statistic to actually lower the blocksum variance

Drawback: Data must be stratified in advance

This contest is totally rigged to have CISS come out on top. All the other methods just do increasingly good jobs of telling you how hard the problem is. That's because they are general purpose methods designed to work on any statistic. CISS is using the fact that we know exactly what kind of statistic we are after (1st order U-statistic) and therefore can rearrange things to minimize the variance of such a statistic.

Wednesday, October 25, 2017

Laplace random variables

The exponential distribution really is a gem. Aside from being correct in the vast majority of cases where it is employed (as opposed to the normal, which statistical hacks will apply to just about any problem), it is so perfect mathematically, that many cool properties just seem to pop out of nowhere.

I was generating my random data sets (see step #1 from yesterday) using Studio-R. There are a number of ways to get a sample of pseudo-random Laplace variables. The simplest of course, is to just get a bunch of exponentials with half the variance you're looking for and reverse the sign on half of them.

>laplace = c(-rexp(5000, 1.02), rexp(5000, 1.02))

This gives you a vector of 10,000 random variables distributed as Laplace with a variance of 2.04 (which yields an inter-quartile range of 1.35, same as N(0,1)). But, there's a cooler trick out there. Because of the memoryless property of the exponential distribution, the difference between two independent exponentials winds up being Laplace. This is fairly easy to derive if you think to look for it (I didn't, I just stumbled across it while looking up other things about Laplace).

Again, in R:

>laplace = rexp(10000, 1.02) - rexp(10000, 1.02)

Of course that's a silly way to do it; it requires double the computation. But, it's also kinda cool. In fact, the three distributions I'm working with can all be generated from exponential and uniform. If E1, E2 are standard exponential (mean & variance of 1) and U is uniform(0, 2π), then:

X1 = E1 * sin(U) ~ N(0,1)
X2 = E1 * cos(U) ~ N(0,1)
(and X1 and X2 are independent even though they use the same E1 and U)

L = E1 - E2 ~ Laplace(2)

C = X1 / X2 ~ Cauchy

Tuesday, October 24, 2017

If you can't get stuff done...

... make it look like you meant it by producing a decent plan!

That's pretty much what I'm going to have to do this week. I sized up the task, and I don't think the comparo is going to be complete by Friday. But, I do have a plan! I'll even number the steps to make it look more official.

1) Construct simulated data sets of 10,000 blocks for each of the three distributions I've been featuring in the paper (Normal, Laplace, Cauchy). Assign correlated dimension values to the measures.

2) Run CISS on the simulated data sets.

3) Run the Random sampler on the simulated data sets.

That much might actually happen by Friday.

3) Code the BLB algorithm.

3) Run BLB on the simulated data sets.

4) Observe that all three are pretty much the same for Normal (at least, I hope this is true), but that CISS starts to look better as the tails get heavier.

5) Run the algorithms using query criteria that reference the correlated dimensions, thus raising the variance of the block sums.

6) Note that CISS does way better in this case (at least, I hope this is true).

7) Be really sad if hopes aren't met.

8) Figure out what I did wrong.

9) Fix it.

10) Write up results.

Monday, October 23, 2017

Baggin' them bootstraps

The first technique that I'll be using as a comparison will be the "Bag of Little Bootstraps" which I wrote about last spring. It doesn't appear that I can derive a theoretical superiority; I'll have to actually code the algorithm and run it on my data. I might want to run it on some simulated data as well. The long and short of it (mostly long) is that I'm in for some more development work.

The good news is that I think I can adapt my basic sampling engine to handle the selection, so all I need to write from scratch is the actual computation, which isn't particularly tough.

One thing that did amuse me a bit (I'll admit I'm being just a bit smug, here) was their section discussing "large-scale experiments on a distributed computing platform." This was describing a data set of 6,000,000 rows. Um, guys, you're short a few zeros on that number. My data set has 60,000,000,000 rows.

Sunday, October 22, 2017

The end of a good life

My father died last night. It was expected, though one is never really ready for such things. He wasn't particularly ill, he had just gotten progressively weaker and less cogent. About two weeks ago, he pretty much gave up trying. We (meaning, my sister, since she lives just outside Ithaca and I'm 900 miles away) put him in hospice, in accordance with his wishes. They took good care of him and mom got some much needed rest. Two days ago, he stopped eating. As far as we can tell, he went fairly peacefully.

When most people talk about someone having lived "a good life" they mean a life to which one might aspire. More successes than failures, a loving family, a few notable things done that left the world a better place. One could look at my dad's life and say that, but that's not what I mean when I say he lived a good life.

Dad was good. He did the Right Thing. Always. It was amazing to watch. I'd like to say it rubbed off on me and maybe it did to some extent, but his virtue was simply on another level. He wasn't sanctimonious about it, he just did it. Love your neighbor as yourself wasn't a command to him. He simply didn't know any other way to deal with people.

When I was in High School, I remember him lamenting to me that there was one guy at work that he just couldn't get along with. He said he'd never found someone with whom he couldn't get along. Even as a brash teenager, I knew better than to add my own commentary to such a confession, but inside I was thinking: "You're 50, and just now you're meeting someone you can't get along with?"

I realize it sounds like I'm describing George Bailey from It's a Wonderful Life but, just ask anybody who knew him. George would have admired him. (If for no other reason than he knew better than to hire incompetent family members and put them in charge of really important stuff; that's kind of important whether you're running a bank in Bedford Falls or a Molecular Biology lab at Cornell).

Anyway, he has passed. It's tempting to haul out the cliche that the world is a lesser place for it, but I don't think that's true. His was a good life. Lives have beginnings and endings. His has ended. And, the portion of it that I got to see was the most good life I have ever known.

Saturday, October 21, 2017

Don't ask unless you want an answer

As I mentioned a while back, I decided to run the Gumbo Flats 10K to see if my fitness was where it needed to be to put in an honest effort at Pere Marquette.

40:24.

It's not.

So, now I need to decide if I'll try to address the fitness in the mere seven weeks between now and then, accept that I'm not fit and run anyway, or not run.

Anybody who knows me knows that option 2 is a non-starter. Especially at a race where I have as much history as Pere Marquette. So it's train super hard for about five weeks or just go up there and have a nice weekend, maybe working one of the water stops.

That last one sounds pretty appealing.

Friday, October 20, 2017

Comparo!

I don't know why this never occurred to me. I guess that's why you have an adviser in grad school. I should actually compare the results of my method to some other methods. After all, why would anybody care about your method if it wasn't at least as good as stuff that's already been published?

He had a couple of suggestions for comparison. The one I'm liking the most is the "Bag of Bootstraps" method, mostly because it's a really cool name. It also has the advantage of being relatively easy to code. Not sure if I can knock out the result in a week, but I told him I'd try. I would really like to stay on track for taking the A exam in December of January. I think I'm pretty close. Having a comparison result will help.

Thursday, October 19, 2017

More graphs

Under the heading of "A picture is worth a thousand words", I put together another graph showing the optimal number of strata. Actually, what this shows is that you don't have to stress over the optimal number of strata. Just stratify your data and CISS will do just fine sampling it.

And, while I'm at it, I had a slight mistake in yesterday's graph which I corrected:

Wednesday, October 18, 2017

Docking the tail

Why do you dock a boxer's tail? Because it destroys everything.

Same is true with heavy-tailed distributions. Here's a little graph showing how full sampling of the tail allows us to take much less of a sample from the interior of the distribution. The optimal point of the cut depends on the distribution. As you might expect, the heavier the tails, the more you want to cut off. Normal hardly benefits at all. Cauchy works best with a fifth of the distribution relegated to the tail.

The optimal cut point also depends on the total number of blocks. Here, the total is 10000 and we're assuming a "full" query, that is, every row is relevant. Most queries filter rows which raises the variance of the interior.

Monday, October 16, 2017

Oops

Not sure how we got this far in without catching the fact that my calculation of total variance is wrong, but we did. Fortunately, it's the total variance formula that was missing a term, not the block sum variance (which would invalidate my theorem and pretty much send the whole paper back to square 1. Also fortunately is that the missing term was n/r, the total number of blocks divided by the number sampled. That converges to 1, so the asymptotic behavior doesn't change and the empirical results still stand.

So, bullet dodged and, as a bonus, having that term in there makes some optimality claims much more sustainable. I'll work on those tonight.

Saturday, October 14, 2017

Rock Bridge Revenge

I don't have too many reasons to ever go to Columbia, Missouri, but when I do, I always try to get in a run at Rock Bridge State Park. The west side of the park is loaded with cool rock features (the most prominent giving the park it's name) and the larger east side is just miles of flowing singletrack through pretty woods. I've done a couple orienteering meets there, but never felt like going all the way to Columbia for a short trail race. When they added the 50K distance a few years ago, I was immediately interested, but it wasn't until this year that it fit into my schedule. After the DNF at Leadville, I almost bailed in favor of doing a fall 100, but decided that this one had waited long enough.

It's only a two-hour drive so I get up at 3AM and drive out race morning. I don't generally sleep well the night before a race anyway. Oops, did I say race? I mean, event. Actually, no, I mean race. The result notwithstanding, the truth is that I took pretty good fitness into Leadville. It bugged me that it came to nothing. I decided I'd try to hold on to it for a few more weeks and give this one an honest effort.

Since I really do need to make some forward progress on my dissertation, the only way to keep my mileage up is running to and from work. That's great for fitness, but doesn't offer much opportunity for technical training. I did plenty of trail running over the summer, but most of it was at 100-mile pace. A 50K is a completely different stride. The trails at Rock Bridge aren't terribly difficult, but they do have more than their share of what Mikell Platt calls "time bleeds", little obstacles that break your stride and steal your pace a couple seconds at a time. I manage to get out on trails a few times to sharpen things up.

Any thoughts of winning are dismissed by the presence of John Cash. I'm pretty sure I won't be keeping up with the overwhelming favorite on the women's side, AnnMarie Chappell, either. That's actually pretty comforting as it makes it less likely I'll do something stupid in the first half of the race.

Left to right: Cash, Chappell, Me, Sankalp

The first half mile is an out and back on the park road to string things out a bit. John and AnnMarie hit the trail first, followed closely by Shiva Sankalp and another runner I don't know. I'm in fifth with a few other runners fairly close behind.

The sun only rose a few minutes ago and it's hidden behind overcast skies. As a result, I'm almost wishing I'd brought my headlamp as the woods are fairly thick. Despite some heavy rain in the last couple days, the trail is in pretty good shape and I've got a half dozen machine screws drilled into the sole of each shoe, so the poor visibility isn't a huge problem. Even concentrating on the placement of each stride, there's no missing the fact that this is a truly beautiful forest. The diffuse light gives it all a deep green glow.

By a mile in, the four ahead are out of sight. After crossing the stream around mile 2, I hear no splashes behind me. It appears this is going to be a lonely run. I'm fine with that. This isn't really a distance conducive to early-race chatting; the effort is fairly firm throughout. Being alone also reduces the risk of missing a trip hazard and going down hard. (Alert readers may have picked up on the foreshadowing from the use of the word "reduces" rather than "eliminates").

The first six miles make a loop around the perimeter of the west side of the park. We don't go past any of the really cool stuff as those trails get mighty busy on a typical fall day. As this morning is uncharacteristically muggy and drizzly, we pretty much have the park to ourselves. One could reasonably conclude from the current conditions that this was just going to be a drab fall day. The weather radar tells a far different story. A fierce line of storms is heading our way with an expected arrival right around the end of the first of two laps.

That's actually a good thing for me as I tend to do better in bad conditions. I do have one concern though: the osage orange. If you've never seen one, they are about the same size and shape as a regular orange, but that's where the similarity ends. They are really hard and no good to eat. The trees aren't super huge, but they're big enough that when the fruit falls to the ground, it hits with an alarmingly loud thud. Each little breeze sends several of them to the forest floor. If the storm unleashes them all, this could turn into a live fire exercise.

After the aid station at six miles, the route crosses to the west side of the park. This loop gets run in alternate directions from year to year. This year, we go clockwise, which is the direction that has "more climb". Obviously, that's not really true since you start and finish at the same spot, but what's meant by that is that more of the uphill is true uphill rather than gentle grade alongside a stream. People who have run the course both ways confirm that the clockwise loop is a bit slower.

Shortly into this loop, I catch the unidentified runner ahead of me. He's going a lot slower than he was last time I saw him, but doesn't appear to be injured. If he's cooked himself by mile 10, he's in for a really unpleasant day. He hears me coming and steps off the trail to let me by.

The volunteers at the next aid station are particularly buoyant. I'm not sticking around to chat, but even the few seconds it takes to grab some banana slices and water are enough for them to unload quite a bit of encouragement. It's a pleasant boost, even if it wasn't needed just yet. I still feel fine, as one should at this point in the race.

The lap ends without incident. The time, 2:31 elapsed, is a bit of a shock. I don't wear a watch, so this is the first pace feedback I've received. I've only once run a 50K over 5 hours; usually I'm a bit over 4. There was nothing about the trail that seemed particularly slow and I felt like I was running OK. I'm also in fourth place, so I can't be doing that badly. Maybe the measurement is wrong. I don't know. At any rate, I feel fine, so maybe just a bit more effort on lap 2 will yield a negative split.

And then the storm arrives.

It got a lot muddier than this

Thankfully, it's just really heavy rain without much wind. The feared mortar attack from the osage orange trees does not materialize. With the trail already a bit soggy, it doesn't take long for the grip to completely evaporate. Even with the screws, my shoes have no luck securing solid ground. While this dashes all hope for a negative split, it also raises a different possibility. At least one of the people in front of me is probably going to cave in these conditions.

I'm not really sure why I do so well in the mud. It's not that I particularly like it. I just seem to be good at it. My best finish (11th) in 16 tries at the fabled Pere Marquette came on the year the trail was inches deep in mud. At Psycho Wyco, where inches deep in mud is a normal year, I've pretty much kicked ass and was the eight person ever to get the title of "mud stud" for running it in under 5 hours. Now, with the trail turning into a river, I have a very real chance of a podium finish if I can find my form for the last 10 miles.

As I cross the road to the eastern loop, I get a comical display of the opposite extreme. There had been a guy hanging around the start wearing cowboy boots. Turns out, he was actually entered in the race. He's not looking super happy with his progress. Today might not have been the day for that joke.

The east side loop is insanely slick. I let up on a few of the descents where even a small mistake could be the end of the race. Other than that, I'm hammering for all I'm worth. The rain has stopped and the cool air behind the front is a welcome relief. But, the trail won't recover until hours after I'm done. I get to the aid station and find the workers every bit as upbeat as on lap one. As I'm leaving, they offer one noteworthy fact: "San is just in front of you."

Well, the reason I didn't ask is because aid station workers are notorious for giving out ambiguous information. "Just in front of you" could mean anything from 1 minute to 10. There are only six miles left, so unless it's closer to the former, he's out of reach. Still, they wouldn't likely say that if he was further ahead than he was on the last lap. It could well be that my second lap efforts are paying off.

Indeed, only a mile from the station, I spot his orange singlet ahead of me. He doesn't seem to be struggling with the mud as much as just plain old fatigue. I catch him on one of the smaller inclines and do my best to put on a brave face as I go by. He's 23 years my junior; not someone I want be sprinting against up the final hill. Better to make the pass stick now. I surge for the next mile and then get an opportunity to look back at a trail bend and confirm that he's not making a fight of it.

I'm pretty happy to be in the top three, but I'm also very much feeling the effort. I try to remember exactly where I am on the loop and how much there is to go. I get to a wider section of trail that leads to a small campground. I recall that from there it's a big downhill, a very steep uphill, and then another long descent to the road crossing. After that is the final mile in the west section and OH SHIT!

Normally, when you trip on something, your foot does come free, you're just so far off balance that, after a few awkward steps you hit the ground. This is more like a movie pratfall. Whatever grabbed my foot wasn't letting go so all my forward momentum is quickly converted to rotational. For those who haven't taken physics, that means I hit the ground really fast and really hard.

Fortunately, this is not a rocky section of trail and, aside from having the wind knocked out of me, there's no damage done. I'm back up quickly and within a minute I've found my stride. Still, I decide that for the rest of the race I'll be concentrating on the trail right ahead of me and not everything else between me and the finish.

I pass quite a few of the 25K field (who started an hour after us) in the last few miles. As always happens to me, I start panicking that someone will catch me from behind in the last mile. This pretty much never happens to me so I don't know why I stress over it. Then again, maybe it doesn't happen because I stress over it. At any rate, I'm third to the line in 5:08. They don't do age groups in this race but, for what it's worth, John and AnnMarie are both under 50.

Not a time I'm going to brag about, but the effort was certainly a good one. I really enjoyed going all in on lap 2. Age, obligations, and life in general pretty much guarantee that the limits of my performance will continue to decline (and the rate of decline will likely increase). However, given that limit, it's pretty gratifying when you know you've come very close to it. I did my best and that will have to be enough. I'm fine with that.

Friday, October 13, 2017

Just doing his job

It's probably been obvious from my posts but, in case not, I'll just state it: I would have submitted this CISS paper for publication a loooooong time ago. I'm used to the pace that things get done in the corporate world. Who cares if it's not great? Show me what you got; we'll get some feedback.

That's not really the way it works in academia. Yes, there is a peer review process. But, you're expected to have your stuff pretty much in order before submitting to that.

I was thinking about that after I met with my adviser today. I could come up with a bunch of complaints about the process. That would ignore the fact that he's basically doing his job. That job, to put it bluntly, is to treat me like crap. Not in any kind of mean way, but part of grad school is getting your ego knocked down a few rungs so you leave a bit more conscious of your limitations. Yes, those egos do seem to restore themselves once placed in faculty positions, but think how much worse it would be if they were coddled all through grad school.

Anyway, I have to grudgingly admit that the paper is way better now than it was a month ago and that wouldn't have happened if he had just rubber-stamped it. It's still not done and it needs to be done real soon or the whole dissertation timeline is going to have to be reevaluated (read: there will be a mutiny on the home front and it will never happen). Still, it will only matter if it's done right. And, that takes time.

Thursday, October 12, 2017

ECM

I sent my adviser an update of my paper this afternoon. It still needs more work, but it didn't get any tonight because I took yet another night off and attended the annual gala for the Episcopal City Mission.

I'll admit to being a bit conflicted about charity galas. I attend two per year (this one and another for the Neumann Center at Washington University). They are certainly enjoyable. And the money goes to very good things. Still, damn, wouldn't it be great if people would just give without all the trimmings because it's the right thing to do?

I guess, but Christ attended quite a few parties, so maybe it's not such a bad way to make things happen.

Anyway, ECM ministers to juvenile offenders. It's tough work. As tonight's keynote speaker made clear, it's sometimes rewarding (meeting a former offender who has become a responsible adult) and sometimes not (meeting a former offender who's being leaving the courthouse en route to the big house).

That's the nature of ministry. The ends are in god's hands. We are to be concerned with the means. Sometimes that's a hard thing. I don't know where I'm going to end up when I get my degree. I could land a teaching job at a small university. That's always been the goal, but the increased reliance on adjunct faculty makes that less likely than it was 20 years ago. I might stay in industry. I like the work and the money sure is good.

But, I'm also starting to think I might wind up at a community college or 4-year state school. The pay would be bad. Crazy bad. It would almost qualify as volunteer work. The resources wouldn't be anything like what I'd have in industry or at a university. Could it mean all the difference in one person's life? I don't know. I'm pretty good at getting results out of people who care. I don't really have a track record one way or the other with people who don't.

At the very least, I want to help with the kids who nobody wants to help. I'm not remotely equipped to do that, but I'm willing to learn.

Tuesday, October 10, 2017

Optimal partitioning

I may have to punt on this one. There are just too many variables to state with certainty that a partitioning strategy for CISS is optimal.

That said, some guidance can be derived. So, I did that. Here's what I've got.

We're trying to optimize the following system:

minimize sum r_j
where B ≥ sum (n_j-r_j)σ_j²
and r_j ≥ 1

Well, duh, that's easy. Find the critical variance, σ* where you want to sample everything greater and nothing less. Just set r_j=n_j for all the σ_j² > σ* and r_j=1 when σ_j² < σ*. The only statum that gets a partial sample is the critical one where σ_j² = σ*.

That's all well and good but, if we knew the distribution of the block sums in advance, we wouldn't be sampling in the first place. Since we have to estimate σ_j², things get a bit more complicated.

First off, note that the bias on the estimate is O(1/h), where h is the number of hits (sampled blocks with a non-zero sum) we've had on the statum. If we assume that the number of hits is proportional to the number of samples (that's a bogus assumption, by the way), we get

E(r_j) = q σ_j² / σ*

Where q is some constant that will vary by the query. Roughly. And that's just for the strata where σ_j² < σ*. If σ_j² > σ*, it's reasonable to assume the entire stratum will be sampled since the bias is high.

So, while we have no closed form expressions, there are some principals that are clear. First off, more strata means that the σ_j² go down. That's offset by the fact that you have to sample blocks even from strata that have very low variances to get rid of the bias. So, you really want the gaps between σ_j² to be large enough that the estimates for the lower ones can slide underneath σ* as quickly as possible.

Second, this only pays off if you can pack lots of blocks into a few strata with very low variance. With heavy-tailed distributions, that means really going after the tails. Recall from the chart from last week that the inner quartiles of Cauchy were every bit as tight as Normal; it was the tails that needed to be chopped up.

So, the guidance I'm going to suggest is go with the σ_j² = 1/n_j heuristic and tune from there. Not the big-time math result my adviser was hoping for. Sigh...