Friday, September 29, 2017

More math!

I ran the revamped theoretical section past my adviser today. He's generally pleased with it, but did have one more suggestion: can we prove that the algorithm works better than random sampling. It's intuitively obvious and supported by empirical evidence but, this is a math paper, so a proof really is in order.

I said, sure we could do that. Of course, by that I meant that I could do that with an additional 10-20 hours of work.

Sigh...

Thursday, September 28, 2017

Refactoring the discussion

In software development we "refactor" our solutions all the time. Basically, this refers to arriving at the same result, but organizing the steps and structures of the program differently. Sometimes a refactor is as simple as renaming things to be clearer. Other times, it's a complete redesign. In the latter case, it's highly desirable to have a bunch of unit test cases to make sure that the program really does still perform the way it should when you're done messing with it.

I found that to be the case with the re-write of the theoretical section of my CISS paper. As I mentioned last week, I decided to re-frame the problem in terms of sampling efficiency rather than as a solution to the issue with heavy tails. I further discovered that it was a lot simpler to derive all my formulas before bringing stratification into the mix. Once I had everything demonstrated for the homogeneous case, it was pretty obvious how to generalize that to stratified data.

The rub, of course, was that in doing all this I might actually wind up with a different answer. That's where having a working prototype came in handy. After writing up all my results, I then verified that it matched what the program was actually doing. I'm glad I did this because I did have a couple minor errors in what I posted last week and they were immediately obvious when held up against the code. If you really care, I updated the post to correct the errors.

Sunday, September 24, 2017

Stuck on expectations

So, I've raced twice this year. OK, if you count local orienteering meets and the tune-up 5K for Boston, it's been more than twice. But, really showing up at the starting line with the idea that this was an "Event" (not race!) and not a training run, it's been twice: Boston and Leadville.

It's no coincidence that those are two of the most prestigious races in the world. I've been intentionally seeking out events that I have absolutely no chance of winning so I can move myself into a mindset of participating rather than racing. I did that quite well at Boston and really enjoyed running one of my slowest marathons ever. I did it a bit too well at Leadville. Apparently, I need to bring a bit more urgency to that one or I miss the cutoffs. Anyway, even though I missed out on the last 40 miles, it was still a great thing.

So, now comes Pere Marquette. There's simply no way I can show up at this and kid myself into thinking I don't care about the result. While it may not register on the world stage, this really is one of the big deals in Midwestern Trail Running. And, in that pond, I'm a big fish (or, at least, I used to be). This will be my 17th showing. In the previous 16, I've finished in the top 20 twelve times, won my division 3 times, won my age group 8 other times, been second or third another 3 times.

It's not just me, either. The last two years several runners have been annoyingly pleased to have beaten me. Sure, I could be a jerk and point out to them that grad school tends to knock your fitness down a couple rungs. I'm not seeing the upside in that. But, it is a bit painful to see folks so happy to have smoked you when you know you weren't anywhere near your best.

I've decided I'll run another tune-up race in October. If that indicates that I have a legit shot at running under an hour, I'll run. Otherwise, I'll transfer my entry to someone else and just work the race as a volunteer.

Saturday, September 23, 2017

Reframing the problem (again)

I was starting to write up some more explanation of the underpinnings of CISS when it occurred to me that I might be framing this whole thing wrong. It's not that my specific problem isn't valid; I really do have an incredibly unstable distribution. It's just that, even if that wasn't true, stratifying the sample will still get better results when the population is finite. That last bit is really, really important and I haven't been giving it enough ink. So, here's another go at what we're trying to accomplish (again, the notation is a bit rough; I'll typeset it later).

Let {Xi} be a sequence of observations drawn without replacement from a finite set X where ||X||=n. Let Xr be the first r elements of {Xi}. Suppose we wish to estimate Y = sum(Xi) using just Xr. The Minimum Variance Unbiased Estimator (MVUE) is simply the sum so far divided by the percentage of the population sampled: Yr = (n/r)sum(Xi, i=1...r).

Because the population is finite and we are sampling without replacement, the error in this approximation can be estimated either from the data sampled so far, OR (this is the key to the whole line of reasoning) by looking at what's still out there:

Y - Yr | Xr = (1-n/r)sum(Xi, i=1...r) + sum(Xi, i=r+1...n)

When conditioned on Xr, the first term is a constant so the variance of our error is just the sum of the variances of the unsampled data:

Var(Y-Yr|Xr) = σ2(n-r) where σ2 = Var(Xi).

Note that this variance is independent of the observations taken so far. Therefore, there's no upside to doing anything other than random sampling.

Now suppose the set X is partitioned into k>1 stata. Let σj2 be the variance of observations in stratum j and nj indicate how many items of X are in stratum j. Also, let rj indicate how many items of Xr are in each stratum. Assuming that rj>0 for all j, the MVUE is now the sum of the estimates for each strata:

Yr = sum((nj/rj)sum(Xi Ik(Xi), i=1...r), j=1...k) where Ik() is an indicator of whether observation Xi is in stratum k.

The variance now becomes the sum of the variances for each stratum, which is dependent on which strata have been sampled the most:

Var(Y-Yr|Xr) =  sum(σj2(nj-rj), j=1...k).

The rub is that we don't actually know what the variance for each strata is. Since we have to estimate it, a random sampling approach is appropriate where we generally sample from the higher-variance stata, but also go after some of the other stata, because our estimate of their variance might change.

The reason I like this new framing is that there's no mention anywhere here of heavy-tailed distributions, missing higher moments, etc.. Those are still true with respect to my data set, but the algorithm actually has much broader usage. Any time you can partition your data such that the bulk of the data goes into strata that don't add a lot of uncertainty to the outcome, you'll get much faster convergence using this method.

Friday, September 22, 2017

CISS variance

The few who have been reading this for over a year will recall that getting the variance of the CISS estimate right was no small thing. I was reviewing that with my adviser today when it occurred to me that I haven't really made a very good case for why I'm approaching it the way I am.

Normally, the variance of a statistic is based on the data collected. This makes sense when viewing the sample as coming from an infinite population with a finite second moment. However, in the case of CISS, the first assertion is never true and the second is dubious at best. So, instead we consider the fact that, if we kept sampling until the entire population was exhausted, the CISS estimator would be exactly equal to the value we are estimating. That is, the CISS estimator Xn (where n is how many blocks we've read) converges almost surely to the value we're after.

The question, of course, is: how far off is it? As noted, the estimator isn't even unbiased in the early going (though empirical evidence suggests it becomes so quite quickly). To estimate how far off it is, we compute the variance of Y = |Xn - X|. Y converges almost surely to zero, so an estimate of the variance of the error is just E(Y2).

Computing that directly is problematic, but computing it for each stratum is less so. Then, it's a matter of choosing a good prior and summing the variance for each strata as I did here.

Anyway, that's all been done, but I probably should put a paragraph or two in the introduction at least explaining why I'm computing the variance of what I haven't sampled rather than the variance of what I have.

Wednesday, September 20, 2017

CISS vs. U-statistic

Leaving aside the problem of second moments, there are some other important differences between the CISS estimator and a true U-statistic.

Just to be clear, I'll digress to define terms. Given a sample with n observations, we can find functions of the observations that deliver unbiased estimates. That is h(X1, X2, X3, ..., Xn) such that E(h()) = theta, where theta is the parameter we're after. If we can restrict h to the first r observations while preserving the unbiased expectation, we get an unnatural estimator. It works, but it's not using all the information available. To use all the information, we take the average of h over all sets of r observations from the sample. This is called a U-statistic.

U-statistics have lots of nice properties.

CISS isn't a U-statistic.

First and foremost, the estimator CISS produces is biased. That bias goes away the longer you sample. In fact, the CISS estimator converges Almost Surely to the sample mean, which is the strongest form of stochastic convergence you can get. (It's actually even stronger than that; CISS is the sample mean if you let r=n). However, the first items is much more likely to be pulled from one of the strata with large values. Unless the distribution is perfectly symmetric around mean (which is definitely not a safe thing to assume), the first few observations are likely to come from one of the two tails and the statistic will be biased towards whichever tail is fatter.

Secondly, the samples are anything but iid. Sampling a stratum reduces the chance that it will be sampled again. This is particularly true when r is small and we're looking at the blocks from the smaller strata. If one sample from a low-magnitude stratum makes it into the sample, there's just about zero chance that a second will until many higher-magnitude strata have been sampled. Even as r approaches n this doesn't really go away.

So, the CISS estimator, does not qualify as h() and, even if it did, we're still not using that in a way that would make the result a U-statistic.

However...

As noted above, it does converge almost surely to a U-statistic (the sample mean, which is simply the r=1 U-statistic with h(X) = X as the kernel. Seems like there should be something there. I'm not sure how to turn that intuition into a theorem.

Tuesday, September 19, 2017

Moment of truth

So I'm reading all these theorems about U-statistics and thinking this could all work except for one really important detail: they all require that the random variable in question has a finite second moment. Aside from Normals (which my data is definitely not), there are no stable distributions with second moments.

It's not clear from a first reading how big a deal this is. There might be a way to get some useful results even with that condition relaxed. If not, this line of inquiry is going to be a dead end.

Monday, September 18, 2017

Named theorems

My adviser advised me (well, that's his job) to read Asymptotic Statistics by A.W. van der Vaart. Aside from having a really cool name, he does a pretty good job of getting to the point. Lots of results in not much space. As keeping track of named results has served me well over the last year, I will continue to do so. Here's the list from a mere 10 pages (don't really feel like typesetting these, so do your best to wade through the clunky notation):

Portmanteau Lemma

The following are equivalent, describing convergence in distribution:

  • P(Xn <= x) -> P(X <= x) for all continuity points of P(X < x).
  • E f(Xn) -> E f(X) for all bounded, continuous f
  • E f(Xn) -> E f(X) for all bounded Lipschitz functions f
  • lim inf E f(Xn) >= E f(X) for all nonnegative, continuous f
  • lim inf P(Xn in G) >= P(X in G) for every open set G
  • lim sup P(Xn in F) <= P(X in F) for every closed set F
  • P(Xn in B) -> P(X in B) fr all Borel sets B with P(X in boundary) = 0
Probably not terribly useful beyond the first three, but who knows?

Continuous Mapping Theorem

Let G be continuous at every point of a set C such that P(X in C) = 1. Then if Xn->X, G(Xn)->G(X). This holds for all three types of convergence (distribution, probability, and almost surely).

I already knew that one, but I don't think I've recorded it here.

Prohorov's Theorem

If Xn converges in distribution to X, then {Xn} is uniformly tight (that is, if there is a uniform N such that for e>0 there exists M such that P(||Xn|| > M) < e.

Conversely, if Xn is uniformly tight, there exists a subsequence that converges in distribution to X.

Not sure how useful that is; maybe it will become obvious in the next 10 pages of the book.

Helly's lemma

Each given sequence Fn of cumulative distribution functions on Rk possesses a subsequence Fnj such that Fnj(x) -> F(x) at each continuity point x. Here, F(x) may be a defective cdf (that is, the density may not integrate to 1).

He also calls out the Markov inequality and Slutsky's lemma, but those two are so well known I won't bother repeating them.

This book has about 450 pages. At this rate, I'll have nearly a thousand results to learn.

Sunday, September 17, 2017

Fall eventing

Nobody believes that I'm entering races just for the experience. Sadly, I'm not included in "nobody".

Anyway, I'm happier and more productive if I'm running, even if the competitive results aren't what I'm used to. I've decided to enter a few fall "events". As these are all fairly close to home, I'll probably succumb to expectations and end up running them full-on and then getting depressed. Such is life.

Rock Bridge Revenge 50K, October 7. This is one I've wanted to do for quite some time. I've run most of the trails at Rock Bridge State Park in Columbia. It's a slow trail, but very pretty. Also has some extraordinarily cool rock features (one, Devil's Ice Box, is literally quite cool). Of course, I won't have much chance to check out the surrounding in any detail at 50K pace, but it's still a great place to run. Also, I don't have much of a rep in Columbia, so there won't be any pressure.

Gumbo Flats Pumpkin Run 10K, October 21. I probably won't' be fully recovered from Rock Bridge, but I want to get some sort of reading on my fitness so I can decide whether I'll actually run...

Pere Marquette Endurance Trail Run, December 9. This is the big one. Last year convinced me that I really shouldn't do this if I'm not ready to really run. I just have too much history at this event. I've never finished in the top 10, but last year was the first time in a loooong time that I was outside the top 20. I was not as OK with that as I first thought.

I might also run the Possum Trot on November 19 but, unlike the other events listed, that's a whole weekend affair and I don't think my schedule will stand for that.

Friday, September 15, 2017

It's showtime, folks.

Had a good meeting with my adviser today. He's 100% behind my approach of looking at CISS through the lense of U-statistic convergence. He also suggested that we socialize these results a bit by holding a colloquium. I'll need a little time to put a good talk together, but not much. I suggested mid-October. Should be fun, and it will force me to develop an introduction that is both concise and cogent (the intersection of those two has been quite elusive up to now).

Thursday, September 14, 2017

Signs of life

Well, we're a month into the semester (not that semesters matter much at this point). I have actually been putting in a fair bit of work on my survey which will be the basis for my Admission to Candidacy (AKA, the A exam). I haven't been posting about it or logging time because, well, I've been busy.

Anyway, I think my adviser may have tipped me off to a much better way to frame the solution we're after. When I was asking for guidance about reference materials on sub-sampling, he brought up the two most common techniques, the Jackknife and the Bootstrap (from the names they use, you'd think these early statisticians were all frontiersman, not dons at English and East Coast universities). I dutifully wrote a section on them and them, despite thinking it was sort of the opposite of what we were after when it suddenly dawned on me: it's the EXACT OPPOSITE of what we are after. That opens up a whole new line of inquiry: how to undo a process.

To understand why it's the opposite, look at the premise of both methods: data is expensive. You have to get subjects, take measurements, hope nothing gets screwed up that invalidates your measurements, remove observation bias, etc. The whole idea of classical estimator theory is getting as much information out of a small data set as possible.

In our world, the opposite is true. Data is essentially free. We're overrun by data. We have way too much of it. We're trying to subset it in a way that we can actually do something useful with it without compromising the population we're subsetting.

None of this actually solves the problem, but sometimes framing the question is half the battle. (Probably more like 10% of the battle here, but that's still not a small thing).