Monday, October 30, 2017

Which bootstrap?

There is one small hitch in my plan to use the bootstrap algorithm in my comparo: which one to use? The term is pretty broad, basically encompassing any technique that subsamples a sample to deduce the quality of an estimator from that population. The name comes from the phrase, "pulling yourself up by your bootstraps." You don't know anything about the population distribution, so you repeatedly subsample your sample to simulate the distribution.

The Bag of Little Bootstraps (BLB) algorithm that I'm officially comparing seems to favor the percentile bootstrap, though they admit that other techniques may be more appropriate in various situations. Because my data is highly correlated within physical blocks, I'm leaning towards using the Block Bootstrap. This also plays into my desire to keep all my subsamples within a physical block of data.

It also makes for an interesting matchup between the BLB and the Bootstrap Metropolis-Hastings (BMH) algorithm. If I use blocks as my subsamples for both, it will be a straight-up fight to see which one can do a better job of estimating the true blocksum variance.

Sunday, October 29, 2017

Tyson O

I did a local orienteering meet today. Boy, am I out of practice. Slow in the terrain and my nav was super sloppy. I was on the fence about going to Kansas City for the Possum Trot. I'm not anymore. It would be a complete waste of time.

That said, it was a truly great day to be out in the woods. High 40's, partly cloudy, and no bugs thanks to our first freeze last night. A great day, even if the result wasn't.


Saturday, October 28, 2017

Six weeks

I don't recall the author who said it, but I heard a quote once that was something to the effect that "books take six weeks". What the author meant was that you have this stuff bouncing around in your head, but when you actually put pen to paper, it's six solid weeks of work to produce a novel.

I wouldn't know anything about that but it does seem to match the patterns of the few published novelists I know.

It appears that's also how long it takes for me to produce a paper. As some may have noticed from the little graphs to the right of this page, I track my school hours in Attackpoint, which is a site intended for tracking physical training. I've found that keeping a "training log" of my schoolwork helps me recognize when I'm not making progress because I'm stuck versus not making progress because I'm simply not putting the effort in.

That would be enough reason to do it, but it also provides the feedback that project managers use so much: actuals versus estimates. That is, how long did you think it was going to take and how long did it actually take. I never put out an estimate for my CISS paper, but I guess I probably would have said that a publishable result probably requires a month or so of effort.

It looks like that's about right. I still have some revisions to do, but I'm basically done with the CISS paper. Any further work will be logged under general thesis research until I have another publication squarely in mind. Right now, my hours specifically on CISS are 225. With the remaining revisions, that will go to something like 250-275; basically six weeks of solid effort. Too bad it's been spread across 18 months. Hopefully, with coursework done, the next efforts will have a better clock to calendar time ratio.

Friday, October 27, 2017

Why would you even do that?

I didn't have much to show for this week's efforts. Reading a paper and generating a bunch of random data sets doesn't make for a good demo. I decided to go ahead and keep my regularly scheduled meeting with my adviser figuring I could at least get him to validate the approach I was taking.

To that end, I put together a little table of each method being compared and why you would even consider sampling that way.

n: number of raw rows
m: block size
N=n/m: number of blocks
r: blocks read
s2: estimated variance of block sums
B: bound on variance of population sum estimator

Full Sampling

Block selection: sequential
Variance estimator: n/a
Estimator of population sum: actual sum
Stopping rule: all blocks read
Rational: Certainty. This is what most people do now with finite data sets.
Drawback: Slow if data set is really large.

Random Sampling

Block selection: random
Variance estimator: sample variance of block sums read so far
Estimator of population sum: N * (sample mean block sum)
Stopping rule: N * s2 < B
Rational: Leverage law of large numbers to get a good approximation with less reading.
Drawback: Variance will be understated for heavy-tailed distributions.

Bag of Little Bootstraps

Block selection: Random, each block selected becomes a "little bootstrap" subsample
Variance estimator: bootstrap estimate of blocksum variance
Estimator of population sum: N * (sample mean block sum)
Stopping rule: N * (sum s2) / r < B
Rational: Better estimate of variance in heavy-tail case
Drawback: More computation per block, variance may still be understated due to correlation

Bootstrap Metropolis-Hastings

Block selection: random
Variance estimator: Metropolis-Hastings
Estimator of population sum: N * (sample mean block sum)
Stopping rule: N * (sum s2) / r < B
Rational: Another stab at getting a better variance estimate
Drawback: Same as above

CISS

Block selection: Random by strata to minimize estimate of total variance
Variance estimator: Bayesian posterior applied to stratum variance, S
Estimator of population sum: sum of strata estimates
Stopping rule: sum(S) < B
Rational: Leverage nature of desired statistic to actually lower the blocksum variance
Drawback: Data must be stratified in advance


This contest is totally rigged to have CISS come out on top. All the other methods just do increasingly good jobs of telling you how hard the problem is. That's because they are general purpose methods designed to work on any statistic. CISS is using the fact that we know exactly what kind of statistic we are after (1st order U-statistic) and therefore can rearrange things to minimize the variance of such a statistic.

Wednesday, October 25, 2017

Laplace random variables

The exponential distribution really is a gem. Aside from being correct in the vast majority of cases where it is employed (as opposed to the normal, which statistical hacks will apply to just about any problem), it is so perfect mathematically, that many cool properties just seem to pop out of nowhere.

I was generating my random data sets (see step #1 from yesterday) using Studio-R. There are a number of ways to get a sample of pseudo-random Laplace variables. The simplest of course, is to just get a bunch of exponentials with half the variance you're looking for and reverse the sign on half of them.

>laplace = c(-rexp(5000, 1.02), rexp(5000, 1.02))

This gives you a vector of 10,000 random variables distributed as Laplace with a variance of 2.04 (which yields an inter-quartile range of 1.35, same as N(0,1)). But, there's a cooler trick out there. Because of the memoryless property of the exponential distribution, the difference between two independent exponentials winds up being Laplace. This is fairly easy to derive if you think to look for it (I didn't, I just stumbled across it while looking up other things about Laplace).

Again, in R:

>laplace = rexp(10000, 1.02) - rexp(10000, 1.02)

Of course that's a silly way to do it; it requires double the computation. But, it's also kinda cool. In fact, the three distributions I'm working with can all be generated from exponential and uniform. If E1, E2 are standard exponential (mean & variance of 1) and U is uniform(0, 2π), then:

X1 = E1 * sin(U) ~ N(0,1)
X2 = E1 * cos(U) ~ N(0,1)
(and X1 and X2 are independent even though they use the same E1 and U)

L = E1 - E2 ~ Laplace(2)

C = X1 / X2 ~ Cauchy

Tuesday, October 24, 2017

If you can't get stuff done...

... make it look like you meant it by producing a decent plan!

That's pretty much what I'm going to have to do this week. I sized up the task, and I don't think the comparo is going to be complete by Friday. But, I do have a plan! I'll even number the steps to make it look more official.

1) Construct simulated data sets of 10,000 blocks for each of the three distributions I've been featuring in the paper (Normal, Laplace, Cauchy). Assign correlated dimension values to the measures.

2) Run CISS on the simulated data sets.

3) Run the Random sampler on the simulated data sets.

That much might actually happen by Friday.

3) Code the BLB algorithm.

3) Run BLB on the simulated data sets.

4) Observe that all three are pretty much the same for Normal (at least, I hope this is true), but that CISS starts to look better as the tails get heavier.

5) Run the algorithms using query criteria that reference the correlated dimensions, thus raising the variance of the block sums.

6) Note that CISS does way better in this case (at least, I hope this is true).

7) Be really sad if hopes aren't met.

8) Figure out what I did wrong.

9) Fix it.

10) Write up results.

Monday, October 23, 2017

Baggin' them bootstraps

The first technique that I'll be using as a comparison will be the "Bag of Little Bootstraps" which I wrote about last spring. It doesn't appear that I can derive a theoretical superiority; I'll have to actually code the algorithm and run it on my data. I might want to run it on some simulated data as well. The long and short of it (mostly long) is that I'm in for some more development work.

The good news is that I think I can adapt my basic sampling engine to handle the selection, so all I need to write from scratch is the actual computation, which isn't particularly tough.

One thing that did amuse me a bit (I'll admit I'm being just a bit smug, here) was their section discussing "large-scale experiments on a distributed computing platform." This was describing a data set of 6,000,000 rows. Um, guys, you're short a few zeros on that number. My data set has 60,000,000,000 rows.

Sunday, October 22, 2017

The end of a good life

My father died last night. It was expected, though one is never really ready for such things. He wasn't particularly ill, he had just gotten progressively weaker and less cogent. About two weeks ago, he pretty much gave up trying. We (meaning, my sister, since she lives just outside Ithaca and I'm 900 miles away) put him in hospice, in accordance with his wishes. They took good care of him and mom got some much needed rest. Two days ago, he stopped eating. As far as we can tell, he went fairly peacefully.

When most people talk about someone having lived "a good life" they mean a life to which one might aspire. More successes than failures, a loving family, a few notable things done that left the world a better place. One could look at my dad's life and say that, but that's not what I mean when I say he lived a good life.

Dad was good. He did the Right Thing. Always. It was amazing to watch. I'd like to say it rubbed off on me and maybe it did to some extent, but his virtue was simply on another level. He wasn't sanctimonious about it, he just did it. Love your neighbor as yourself wasn't a command to him. He simply didn't know any other way to deal with people.

When I was in High School, I remember him lamenting to me that there was one guy at work that he just couldn't get along with. He said he'd never found someone with whom he couldn't get along. Even as a brash teenager, I knew better than to add my own commentary to such a confession, but inside I was thinking: "You're 50, and just now you're meeting someone you can't get along with?"

I realize it sounds like I'm describing George Bailey from It's a Wonderful Life but, just ask anybody who knew him. George would have admired him. (If for no other reason than he knew better than to hire incompetent family members and put them in charge of really important stuff; that's kind of important whether you're running a bank in Bedford Falls or a Molecular Biology lab at Cornell).

Anyway, he has passed. It's tempting to haul out the cliche that the world is a lesser place for it, but I don't think that's true. His was a good life. Lives have beginnings and endings. His has ended. And, the portion of it that I got to see was the most good life I have ever known.

Saturday, October 21, 2017

Don't ask unless you want an answer

As I mentioned a while back, I decided to run the Gumbo Flats 10K to see if my fitness was where it needed to be to put in an honest effort at Pere Marquette.

40:24.

It's not.

So, now I need to decide if I'll try to address the fitness in the mere seven weeks between now and then, accept that I'm not fit and run anyway, or not run.

Anybody who knows me knows that option 2 is a non-starter. Especially at a race where I have as much history as Pere Marquette. So it's train super hard for about five weeks or just go up there and have a nice weekend, maybe working one of the water stops.

That last one sounds pretty appealing.

Friday, October 20, 2017

Comparo!

I don't know why this never occurred to me. I guess that's why you have an adviser in grad school. I should actually compare the results of my method to some other methods. After all, why would anybody care about your method if it wasn't at least as good as stuff that's already been published?

He had a couple of suggestions for comparison. The one I'm liking the most is the "Bag of Bootstraps" method, mostly because it's a really cool name. It also has the advantage of being relatively easy to code. Not sure if I can knock out the result in a week, but I told him I'd try. I would really like to stay on track for taking the A exam in December of January. I think I'm pretty close. Having a comparison result will help.

Thursday, October 19, 2017

More graphs

Under the heading of "A picture is worth a thousand words", I put together another graph showing the optimal number of strata. Actually, what this shows is that you don't have to stress over the optimal number of strata. Just stratify your data and CISS will do just fine sampling it.



And, while I'm at it, I had a slight mistake in yesterday's graph which I corrected:


Wednesday, October 18, 2017

Docking the tail

Why do you dock a boxer's tail? Because it destroys everything.

Same is true with heavy-tailed distributions. Here's a little graph showing how full sampling of the tail allows us to take much less of a sample from the interior of the distribution. The optimal point of the cut depends on the distribution. As you might expect, the heavier the tails, the more you want to cut off. Normal hardly benefits at all. Cauchy works best with a fifth of the distribution relegated to the tail.



The optimal cut point also depends on the total number of blocks. Here, the total is 10000 and we're assuming a "full" query, that is, every row is relevant. Most queries filter rows which raises the variance of the interior.

Monday, October 16, 2017

Oops

Not sure how we got this far in without catching the fact that my calculation of total variance is wrong, but we did. Fortunately, it's the total variance formula that was missing a term, not the block sum variance (which would invalidate my theorem and pretty much send the whole paper back to square 1. Also fortunately is that the missing term was n/r, the total number of blocks divided by the number sampled. That converges to 1, so the asymptotic behavior doesn't change and the empirical results still stand.

So, bullet dodged and, as a bonus, having that term in there makes some optimality claims much more sustainable. I'll work on those tonight.

Saturday, October 14, 2017

Rock Bridge Revenge

I don't have too many reasons to ever go to Columbia, Missouri, but when I do, I always try to get in a run at Rock Bridge State Park. The west side of the park is loaded with cool rock features (the most prominent giving the park it's name) and the larger east side is just miles of flowing singletrack through pretty woods. I've done a couple orienteering meets there, but never felt like going all the way to Columbia for a short trail race. When they added the 50K distance a few years ago, I was immediately interested, but it wasn't until this year that it fit into my schedule. After the DNF at Leadville, I almost bailed in favor of doing a fall 100, but decided that this one had waited long enough.

It's only a two-hour drive so I get up at 3AM and drive out race morning. I don't generally sleep well the night before a race anyway. Oops, did I say race? I mean, event. Actually, no, I mean race. The result notwithstanding, the truth is that I took pretty good fitness into Leadville. It bugged me that it came to nothing. I decided I'd try to hold on to it for a few more weeks and give this one an honest effort.

Since I really do need to make some forward progress on my dissertation, the only way to keep my mileage up is running to and from work. That's great for fitness, but doesn't offer much opportunity for technical training. I did plenty of trail running over the summer, but most of it was at 100-mile pace. A 50K is a completely different stride. The trails at Rock Bridge aren't terribly difficult, but they do have more than their share of what Mikell Platt calls "time bleeds", little obstacles that break your stride and steal your pace a couple seconds at a time. I manage to get out on trails a few times to sharpen things up.

Any thoughts of winning are dismissed by the presence of John Cash. I'm pretty sure I won't be keeping up with the overwhelming favorite on the women's side, AnnMarie Chappell, either. That's actually pretty comforting as it makes it less likely I'll do something stupid in the first half of the race.

Left to right: Cash, Chappell, Me, Sankalp
The first half mile is an out and back on the park road to string things out a bit. John and AnnMarie hit the trail first, followed closely by Shiva Sankalp and another runner I don't know. I'm in fifth with a few other runners fairly close behind.

The sun only rose a few minutes ago and it's hidden behind overcast skies. As a result, I'm almost wishing I'd brought my headlamp as the woods are fairly thick. Despite some heavy rain in the last couple days, the trail is in pretty good shape and I've got a half dozen machine screws drilled into the sole of each shoe, so the poor visibility isn't a huge problem. Even concentrating on the placement of each stride, there's no missing the fact that this is a truly beautiful forest. The diffuse light gives it all a deep green glow.

By a mile in, the four ahead are out of sight. After crossing the stream around mile 2, I hear no splashes behind me. It appears this is going to be a lonely run. I'm fine with that. This isn't really a distance conducive to early-race chatting; the effort is fairly firm throughout. Being alone also reduces the risk of missing a trip hazard and going down hard. (Alert readers may have picked up on the foreshadowing from the use of the word "reduces" rather than "eliminates").

The first six miles make a loop around the perimeter of the west side of the park. We don't go past any of the really cool stuff as those trails get mighty busy on a typical fall day. As this morning is uncharacteristically muggy and drizzly, we pretty much have the park to ourselves. One could reasonably conclude from the current conditions that this was just going to be a drab fall day. The weather radar tells a far different story. A fierce line of storms is heading our way with an expected arrival right around the end of the first of two laps.

That's actually a good thing for me as I tend to do better in bad conditions. I do have one concern though: the osage orange. If you've never seen one, they are about the same size and shape as a regular orange, but that's where the similarity ends. They are really hard and no good to eat. The trees aren't super huge, but they're big enough that when the fruit falls to the ground, it hits with an alarmingly loud thud. Each little breeze sends several of them to the forest floor. If the storm unleashes them all, this could turn into a live fire exercise.

After the aid station at six miles, the route crosses to the west side of the park. This loop gets run in alternate directions from year to year. This year, we go clockwise, which is the direction that has "more climb". Obviously, that's not really true since you start and finish at the same spot, but what's meant by that is that more of the uphill is true uphill rather than gentle grade alongside a stream. People who have run the course both ways confirm that the clockwise loop is a bit slower.

Shortly into this loop, I catch the unidentified runner ahead of me. He's going a lot slower than he was last time I saw him, but doesn't appear to be injured. If he's cooked himself by mile 10, he's in for a really unpleasant day. He hears me coming and steps off the trail to let me by.

The volunteers at the next aid station are particularly buoyant. I'm not sticking around to chat, but even the few seconds it takes to grab some banana slices and water are enough for them to unload quite a bit of encouragement. It's a pleasant boost, even if it wasn't needed just yet. I still feel fine, as one should at this point in the race.

The lap ends without incident. The time, 2:31 elapsed, is a bit of a shock. I don't wear a watch, so this is the first pace feedback I've received. I've only once run a 50K over 5 hours; usually I'm a bit over 4. There was nothing about the trail that seemed particularly slow and I felt like I was running OK. I'm also in fourth place, so I can't be doing that badly. Maybe the measurement is wrong. I don't know. At any rate, I feel fine, so maybe just a bit more effort on lap 2 will yield a negative split.

And then the storm arrives.

It got a lot muddier than this
Thankfully, it's just really heavy rain without much wind. The feared mortar attack from the osage orange trees does not materialize. With the trail already a bit soggy, it doesn't take long for the grip to completely evaporate. Even with the screws, my shoes have no luck securing solid ground. While this dashes all hope for a negative split, it also raises a different possibility. At least one of the people in front of me is probably going to cave in these conditions.

I'm not really sure why I do so well in the mud. It's not that I particularly like it. I just seem to be good at it. My best finish (11th) in 16 tries at the fabled Pere Marquette came on the year the trail was inches deep in mud. At Psycho Wyco, where inches deep in mud is a normal year, I've pretty much kicked ass and was the eight person ever to get the title of "mud stud" for running it in under 5 hours. Now, with the trail turning into a river, I have a very real chance of a podium finish if I can find my form for the last 10 miles.

As I cross the road to the eastern loop, I get a comical display of the opposite extreme. There had been a guy hanging around the start wearing cowboy boots. Turns out, he was actually entered in the race. He's not looking super happy with his progress. Today might not have been the day for that joke.

The east side loop is insanely slick. I let up on a few of the descents where even a small mistake could be the end of the race. Other than that, I'm hammering for all I'm worth. The rain has stopped and the cool air behind the front is a welcome relief. But, the trail won't recover until hours after I'm done. I get to the aid station and find the workers every bit as upbeat as on lap one. As I'm leaving, they offer one noteworthy fact: "San is just in front of you."

Well, the reason I didn't ask is because aid station workers are notorious for giving out ambiguous information. "Just in front of you" could mean anything from 1 minute to 10. There are only six miles left, so unless it's closer to the former, he's out of reach. Still, they wouldn't likely say that if he was further ahead than he was on the last lap. It could well be that my second lap efforts are paying off.

Indeed, only a mile from the station, I spot his orange singlet ahead of me. He doesn't seem to be struggling with the mud as much as just plain old fatigue. I catch him on one of the smaller inclines and do my best to put on a brave face as I go by. He's 23 years my junior; not someone I want be sprinting against up the final hill. Better to make the pass stick now. I surge for the next mile and then get an opportunity to look back at a trail bend and confirm that he's not making a fight of it.

I'm pretty happy to be in the top three, but I'm also very much feeling the effort. I try to remember exactly where I am on the loop and how much there is to go. I get to a wider section of trail that leads to a small campground. I recall that from there it's a big downhill, a very steep uphill, and then another long descent to the road crossing. After that is the final mile in the west section and OH SHIT!

Normally, when you trip on something, your foot does come free, you're just so far off balance that, after a few awkward steps you hit the ground. This is more like a movie pratfall. Whatever grabbed my foot wasn't letting go so all my forward momentum is quickly converted to rotational. For those who haven't taken physics, that means I hit the ground really fast and really hard.

Fortunately, this is not a rocky section of trail and, aside from having the wind knocked out of me, there's no damage done. I'm back up quickly and within a minute I've found my stride. Still, I decide that for the rest of the race I'll be concentrating on the trail right ahead of me and not everything else between me and the finish.

I pass quite a few of the 25K field (who started an hour after us) in the last few miles. As always happens to me, I start panicking that someone will catch me from behind in the last mile. This pretty much never happens to me so I don't know why I stress over it. Then again, maybe it doesn't happen because I stress over it. At any rate, I'm third to the line in 5:08. They don't do age groups in this race but, for what it's worth, John and AnnMarie are both under 50.

Not a time I'm going to brag about, but the effort was certainly a good one. I really enjoyed going all in on lap 2. Age, obligations, and life in general pretty much guarantee that the limits of my performance will continue to decline (and the rate of decline will likely increase). However, given that limit, it's pretty gratifying when you know you've come very close to it. I did my best and that will have to be enough. I'm fine with that.

Friday, October 13, 2017

Just doing his job

It's probably been obvious from my posts but, in case not, I'll just state it: I would have submitted this CISS paper for publication a loooooong time ago. I'm used to the pace that things get done in the corporate world. Who cares if it's not great? Show me what you got; we'll get some feedback.

That's not really the way it works in academia. Yes, there is a peer review process. But, you're expected to have your stuff pretty much in order before submitting to that.

I was thinking about that after I met with my adviser today. I could come up with a bunch of complaints about the process. That would ignore the fact that he's basically doing his job. That job, to put it bluntly, is to treat me like crap. Not in any kind of mean way, but part of grad school is getting your ego knocked down a few rungs so you leave a bit more conscious of your limitations. Yes, those egos do seem to restore themselves once placed in faculty positions, but think how much worse it would be if they were coddled all through grad school.

Anyway, I have to grudgingly admit that the paper is way better now than it was a month ago and that wouldn't have happened if he had just rubber-stamped it. It's still not done and it needs to be done real soon or the whole dissertation timeline is going to have to be reevaluated (read: there will be a mutiny on the home front and it will never happen). Still, it will only matter if it's done right. And, that takes time.

Thursday, October 12, 2017

ECM

I sent my adviser an update of my paper this afternoon. It still needs more work, but it didn't get any tonight because I took yet another night off and attended the annual gala for the Episcopal City Mission.

I'll admit to being a bit conflicted about charity galas. I attend two per year (this one and another for the Neumann Center at Washington University). They are certainly enjoyable. And the money goes to very good things. Still, damn, wouldn't it be great if people would just give without all the trimmings because it's the right thing to do?

I guess, but Christ attended quite a few parties, so maybe it's not such a bad way to make things happen.

Anyway, ECM ministers to juvenile offenders. It's tough work. As tonight's keynote speaker made clear, it's sometimes rewarding (meeting a former offender who has become a responsible adult) and sometimes not (meeting a former offender who's being leaving the courthouse en route to the big house).

That's the nature of ministry. The ends are in god's hands. We are to be concerned with the means. Sometimes that's a hard thing. I don't know where I'm going to end up when I get my degree. I could land a teaching job at a small university. That's always been the goal, but the increased reliance on adjunct faculty makes that less likely than it was 20 years ago. I might stay in industry. I like the work and the money sure is good.

But, I'm also starting to think I might wind up at a community college or 4-year state school. The pay would be bad. Crazy bad. It would almost qualify as volunteer work. The resources wouldn't be anything like what I'd have in industry or at a university. Could it mean all the difference in one person's life? I don't know. I'm pretty good at getting results out of people who care. I don't really have a track record one way or the other with people who don't.

At the very least, I want to help with the kids who nobody wants to help. I'm not remotely equipped to do that, but I'm willing to learn.

Tuesday, October 10, 2017

Optimal partitioning

I may have to punt on this one. There are just too many variables to state with certainty that a partitioning strategy for CISS is optimal.

That said, some guidance can be derived. So, I did that. Here's what I've got.

We're trying to optimize the following system:

minimize sum rj
where B ≥ sum (nj-rjj2
and rj ≥ 1

Well, duh, that's easy. Find the critical variance, σ* where you want to sample everything greater and nothing less. Just set rj=nj for all the σj2 > σ* and rj=1 when σj2 < σ*. The only statum that gets a partial sample is the critical one where σj2 = σ*.

That's all well and good but, if we knew the distribution of the block sums in advance, we wouldn't be sampling in the first place. Since we have to estimate σj2, things get a bit more complicated.

First off, note that the bias on the estimate is O(1/h), where h is the number of hits (sampled blocks with a non-zero sum) we've had on the statum. If we assume that the number of hits is proportional to the number of samples (that's a bogus assumption, by the way), we get

E(rj) = q σj2 / σ*

Where q is some constant that will vary by the query. Roughly. And that's just for the strata where σj2 < σ*. If σj2 > σ*, it's reasonable to assume the entire stratum will be sampled since the bias is high.

So, while we have no closed form expressions, there are some principals that are clear. First off, more strata means that the σj2 go down. That's offset by the fact that you have to sample blocks even from strata that have very low variances to get rid of the bias. So, you really want the gaps between σj2 to be large enough that the estimates for the lower ones can slide underneath σ* as quickly as possible.

Second, this only pays off if you can pack lots of blocks into a few strata with very low variance. With heavy-tailed distributions, that means really going after the tails. Recall from the chart from last week that the inner quartiles of Cauchy were every bit as tight as Normal; it was the tails that needed to be chopped up.

So, the guidance I'm going to suggest is go with the σj2 = 1/nj heuristic and tune from there. Not the big-time math result my adviser was hoping for. Sigh...

Monday, October 9, 2017

Pseudo-theorem

Well, after messing with it for several days I've decided that even if I can prove my contention that the optimal stratification has the variance inversely proportional to the number of blocks, it's still pretty bogus because the variance changes with each query.

However, in the process of working through all that, I did formalize my approach to where it's very easy to justify stratifying it that way even without the proof.

I may still have something on the optimal number of strata, and that would actually be the more useful result.

Sunday, October 8, 2017

Stewardship

I got asked to talk about stewardship at church again this year. Here's what I said.

“Usnavy. That’s a pretty name. I’ve never heard it before.”

She gave a wry smile and with a slight eye roll explained, “You’ve heard it, it was printed on the blanket I was wrapped in at the naval hospital.”

As some of you know, I’m a bit of a story teller. As such, I don’t generally put much stock in the details in the stories told by others. One should never let the facts get in the way of a good yarn. However, when this story was told to me by a fellow missionary, I accepted it straight up. I did that because, while my stint in full-time service to the Lord was brief, it was long enough for me to learn two things:
  1. You don’t need to make up crazy stories about your missionary work; they come to you.
  2. The people missionaries minister to believe that they have no power whatsoever.
You see, the reason that story is funny is also the reason it is tragic. As affluent Americans, we can’t imagine letting someone or something else tell us what to name our children. Or where we should live. Or what we should eat. What career to pursue, what school to attend, what car to drive, what color raincoat we’ll buy. We decide for ourselves. We are empowered.

If the only raincoat you can afford is the plastic bag out of a park trash can, you don’t get to pick the color.

What’s this got to do with Stewardship? Well, I’m getting’ to that.

We have a very engaged congregation. And that’s a great thing. Everybody who can help out around here really should, if for no other reason than to realize just how much work goes into making a group of believers a functioning church.

But, there is a pattern of Stewardship in affluent churches such as ours. We tend to lend our efforts to inward ministries. For outward ministries, we are much more likely to write a check.

Frankly, as a college student with no means, that pattern was very useful to me when trying to raise funds for my missionary work. And make no mistake, outward ministries like Trinity Food Pantry and Episcopal City Mission have real expenses and they appreciate your donations very much. But by giving actual service to these ministries, something much more profound can happen.

It is often stated that these programs we put in place to help the underprivileged would work a whole lot better if the people they were aimed at would show a bit more gumption. This is true. But just how confidently does a person work towards their future when they’ve been so beaten down by a system of inequality that they don’t even think they can name their own kid?

They are children of god. They are worth more than they are getting. They need to hear that. They need to hear it from us, not our delegates. They will only believe it if we, the beneficiaries of the most imbalanced economic system in the history of the world give them the one thing we can’t buy more of. We need to give them our time.

Friday, October 6, 2017

The devil you know

Well, we knew it was coming. Production went down big time last night. We had it back running by mid-morning, but it took the rest of the day and some of the evening to clean up the mess. At least we now know one of the failure patterns. One of the disconerting things about a complete overhaul of a system is that when it goes into production, you really don't know how it's likely to fail. All the failure patterns identified prior to go-live are addressed, but only a fool thinks there aren't more waiting out there.

The old system failed a lot, but it failed in ways we knew. It wasn't great fun to be woke at 2AM to fix it, but it generally was pretty easy to fix. The fact that the new system has been running fine in prod for over a month with no failures is certainly an indication that it's more robust than the old one. But, today was a reminder that when it does fail, we have some thinking to do.

Anyway, it's all back running again now. And, I did slide in just a little math today. I proved the bias on the CISS variance is O(1/h) where h is the number of non-zero blocks sampled so far. That result is pretty useful when deriving the optimal number of strata.

Thursday, October 5, 2017

Fake it til you make it

One trick for proving things is to proceed with your argument and, every time you hit a snag, just figure out what you need to get over that hurdle and then assume it as part of your premise. Then, when you're done, see if the condition can be relaxed without compromising the theorem. If it can, yay for that. Otherwise, at least you have something for a more limited case.

I'm still working on the proof that the optimal stratification yields strata variance inversely proportional to the number of blocks. I think I may have a way forward, but I've had to add a couple conditions. Most notably, I need the density function to be continuous and monotonic. The first is really just a convenience. The second seems like it will help quite a lot. Since I'm looking at partitioning unimodal distributions, the monotonic condition is no problem except that I need to split the distribution in two at the mode to get two disjoint monotonic densities.

All this is completely unnecessary and not even particularly useful since the distribution changes with every query so trying to find optimal partitioning is a bit of a fools errand. However, it could be a really pretty proof and, as with everywhere else, there's a place in math for aesthetics.

Wednesday, October 4, 2017

Empirical stratification

I put together a little table showing how chopping the tails off of three different distributions affects the non-tail variance. Each of these has an inter-quartile range (the difference between the 25th and 75th percentile) of 1.35.

N(0,1) Laplace(0,2.25) Cauchy(0.675)
σ2 1 3.18 inf
75% 0.67 0.67 0.67
95% 1.64 2.25 4.26
97.5% 1.96 2.93 8.580
σ0.502 0.15 0.13 0.13
σ0.102 0.60 0.81 1.41
σ0.052 0.76 1.15 3.37

Clearly (and predictably), the heavy tailed distributions benefit a lot more from chopping the tails off and sampling them completely. However, even the Normal distribution gets a 25% boost in power just by trimming the extreme 5% of the distribution. That is, if you can stop sampling at with n-r blocks unsampled with the tails, you can stop with (n-r)/0.76 blocks unsampled if you trim the tails. With Laplace (AKA, double exponential), you get to stop at (n-r)1.15/3.18. That's nearly three times as many unsampled blocks.

The law of diminishing returns is also clear. While Cauchy gets some real benefit from trimming a full 10% off, even the Laplace distribution is showing only marginal improvement from the 5% trim. Trimming off everything beyond the quartiles doesn't make any sense even in the case of Cauchy. You'd be sampling half the distribution every query plus whatever you needed to get the interior estimate to converge.

A better approach is to go with multiple strata. While, I'm sure the best way to tune this is to just run a bunch of tests on your actual data set, I'm also pretty sure that optimal stratification breaks the data so that the stratum variance is inversely proportional to the number of blocks in the stratum. That may not be true and it might be really hard to prove, but it's worth a shot.

Tuesday, October 3, 2017

Optimal stratification

Here's the motivation behind CISS in a nutshell. I argued last week that the difference between our estimate of the sample sum based on the subsample and the true sample sum was

Var(Y-Yr|Xr) = σ2(n-r) where σ2 = Var(Xi).

Now, take just the case where we trim off the tails. Clearly, the variance of the remaining blocks is reduced without all those tail observations (just how much it's reduced depends on the distribution). Assuming we sample the tails in their entirety, the remaining variance is

Var(Y-Yr|Xr) = σT2(n-r) where σT2 is the variance of the non-tail elements.

This is clearly less than the original value since the only difference is σT2 being swapped in for the greater quantity σ2. So, given that this is better, what's the best possible answer? The optimization problem we're trying to solve is:

minimize r such that

  1. σT2(n-r) > B (where B is our upper bound on the estimate variance)
  2. r > number of blocks in the tail strata (we need to sample at least one from the middle to have an estimate of the mean

You can crunch that through a linear programming algorithm if you want, but it's pretty obvious by inspection that the optimal solution is one that puts r-1 blocks in the tail. The problem, of course is that we have no idea what σT2 is. We could compute it for the entire population easy enough (we're going to have to do a complete pass of the data to stratify it anyway), but it's going to change with every query. Some queries won't have any results in the tail. Others will have lots. Some will have high correlation in the non-tail blocks (increasing the variance), others will be much more random.

So, just tearing off the tails won't really help much. We need more stata. How many more? Ah, now we're getting into rough waters. The short answer is: it depends. Mostly it depends on how volatile the stratum variances are from one query to the next. I don't know that this has a tractable answer mathematically. I could just simulate a bunch of results. Before heading down that road, I think I'll at least try to come up with a theorem that relates the distribution to the optimal partitioning for any one query.

Monday, October 2, 2017

Dual space

One thing I got fairly good at when I was at Cornell was looking at a problem from the opposite direction. Mathematically, it's known as the dual space and it's a standard trick when dealing with optimality problems (which are pretty much what you do in Operations Research, which was my major at Cornell). Essentially it means that your set of good answers is bounded by a set of infeasible answers and if you can find a point right on that boundary, you've found the best one.

For example, suppose you're trying to figure out the decimal expansion of 3/8. Consider all the decimal number less than or equal to 3/8. Start with a number you know is less, say, 3/10=.3. What about .4? Nope that's greater; it's in the dual space. Now you know that the answer has to be somewhere between those two. Try .35. Less. Great, now you're down to 0.05 difference. Keep going until you have it close enough (or, in this case, you'll probably hit .375 exactly because this is a ridiculously easy example).

That's a purely numeric problem, but the same thing comes up in algorithms, heuristics, and even general approaches. Sometimes, the best way to solve a problem is to consider all the solutions that don't solve it and why.

All this brings me to Huebner Estimators. Huebner estimators fall into a broader class of estimators known as robust statistics. The idea is that these are statistics that don't get creamed just because one or two outliers make it into the data. The median is much more robust than the mean because throwing a crazy number into the mix doesn't change it much. The mean can get completely whacked by one big number. That's why you see things like median household income being reported as opposed to the mean. The fact that one person got rich doesn't really change the outlook for the general population, so you want the statistic to reflect that.

Huebner Estimators clamp observations. That is, any observation within a "normal" range is treated as itself. If it's outside of that range, it's treated as if it was at the end of the normal range. Huebner Estimators tend towards the mean for large samples, but are more robust, like the median, with smaller ones.

This is, of course, the exact opposite of what I want for CISS. Those outliers are crucial. We need to find them. The middle of the distribution isn't particularly interesting. So, look at it as a dual problem. Instead of clamping the outliers and focusing on the middle, grab all the outliers and don't really worry too much about the middle. That's really what CISS does. It gets just enough data from the central strata that the estimates aren't total bogus. But, the further you get from the mean, the more it grabs.

Because we don't really know much about the underlying distribution, we can't completely ignore the middle, but in the theoretical case where the distribution was actually known, the sampling algorithm that returns the closest estimate of the full sample sum is one that strictly samples blocks in order of their variance.

Sunday, October 1, 2017

Wilted

A coworker and friend of mine ran the MO Cowbell marathon today. I met him at 8 miles and ran with him for a bit. He seemed to be doing well. The fact that we talked the whole time indicated he wasn't pushing too hard. He was just ahead of pace for running a 3:40, which was his goal.

I saw him again at 19 and things weren't so pretty. The early clouds had dissipated and the temperature had come up ten degrees (to high 60's). He was still moving OK, but it was obvious that the heat was getting to him. He was just over 3:40 pace.

Given the conditions (temps continued into the 70's over the next hour), he held it together fairly well and managed 3:48. Still a decent finish, but not what he was hoping for. It wasn't a silly hope, either, he has the fitness to run a 3:40; today just wasn't the day.

It's a good lesson for me going into Rock Bridge next weekend. Forecasts a week out are always suspect, but it looks like we'll get some warm weather next weekend. I'll need to measure my effort well to avoid wilting in the final hour.