Friday, May 26, 2017

Summer crush

It's here again; a few weeks early this year: the summer crush. We deliver our big update to systems at work in August. User Acceptance Testing (UAT) takes place in June and July. So, June tends to be pretty insane as we try to get things wrapped up.

This year, the pressure has come a bit early because the re-platform is such a big deal. So, schoolwork is going to take a back seat for a while and I probably will only post once or twice a week through June.

Wednesday, May 24, 2017

Randomized least squares

Here's a somewhat related take on sampling. The emphasis here is on how to actually perform the least squares computation efficiently. In stats class, this is done with QR factorization. That's an O(mnn) algorithm where m is the number of constraints and n is the number of data points. Squaring the number of rows when the number of rows goes into the trillions results in a really, really big number.

Reference: Iyer et al, A randomized least square solver for terabyte-sized dense overdetermined systems. Journal of Computational Science 2016.09.007.

Iyer looks at ways to sample the data and then perform the regression on the sample. This is actually part of a whole field of computational algebra which goes by RandNLA (Randomized Numerical Linear Algebrea). The basic idea is to transform the data into a smaller set which produces a solution that provably approximates the original problem with high probability.

Transforms that have been tried include Hadamard, Randomized Fourier, and Blendenpik. This paper focuses on the latter. The algorithm is implemented on a 5210-node super computer and it appears to work pretty well.

Most of the paper is straight computational considerations; primarily efficiency and preservation of accuracy. Both of these are extremely important when doing large matrix operations. Not so much in my problem domain. They do spend a little time on the randomized transform, but not enough to use it as a primary reference. I'll need to find better references on Blendenpik as well as others.

Monday, May 22, 2017

Bag of Bootstraps

I really hope at least some of my readers get the reference to Louis CK's "Bag of ..." routine.

For those that don't, oh well, you could always google it. Meanwhile, here's the math:

Reference: Kleiner, A, Talwalkar, A, Sarkar, P, Jordan, M: A scalable bootstrap for massive data. Journal or the Royal Statistical Society, 2014 Part 4, 795-816.

The bootstrap is a technique for estimator evaluation that's been around for a while. The name comes from the common phrase of "pulling oneself up by their bootstraps." It's a nod to the fact that the appraisal uses the same data that the estimator is derived from. That said, the technique is sound and widely used.

The problem is that with really big data sets, it becomes computationally infeasible in it's basic form. The authors of this paper show how, rather than using the entire data set at once, one can chop the data set into lots of chunks, apply the boostrap to each chunk and essentially average the results to get the same answer. They call this the "Bag of little Bootstraps".

What interests me is that they have stuck with uniform sampling of items to get their smaller samples. Why? Is there some reason you couldn't bias the tails? If that still produces good results, that would be a big deal for my problem space. So, I need to see if anybody has addressed that. If not, it's a great opportunity for a result that's much more theoretical than what I have been doing while still wholly relevant.

I'm talking with my adviser about it tomorrow.

Friday, May 19, 2017

On the Theory of Sampling from Finite Populations

Well, I was right about this one: it's one of those citation classics that you simply have to put into your bibliography. It's also a really good paper. Definitely required reading for any stratified sampling work.

On the Theory of Sampling from Finite Populations Morris H. Hansen and William N. Hurwitz The Annals of Mathematical Statistics Vol. 14, No. 4 (Dec., 1943), pp. 333-362

The key result is the estimator which weights a point by the inverse of its selection probability. In this paper, this is done with replacement, that is, once you've sampled a point, you return it to the population and may well sample it again. In CISS, we don't do that. Once a block is sampled, it's removed from further consideration. That does require some adjustments, but the basic principal is the same, the strata that get fully sampled (the tails) get a lot less weight in the estimate than the stata that have only a small percentage of blocks read.

This paper also has some good references on cluster sampling. I don't think I want to go back too much earlier than 1943 for my citations since most of the heavy lifting on cluster sampling was done in the 1980's and 1990's, but those references can be reversed. By looking for other papers that cite them, I can find newer stuff that is relevant.

Thursday, May 18, 2017

Leveraging and Weighted Least Squares

Reference: Ping Ma & Xiaoxiao Sun: Leveraging for big data regression. WIREs Comput Stat 2014. doi: 10-.1002/wics.1324

Don't let the Chinese names fool you. These guys are at University of Georgia and they write quite well in English. The "paper" (I have no idea if Wiley peer reviews this publication, but it seems legit) focuses on applying linear models to very large data sets. While I have no particular interest in that, if you can fit a linear model, you can obviously estimate a mean (or, at least a median), and that I do want to do. In fact, this paper is basically the theoretical validation of CISS that I was looking for.

Ordinary Least Squares is really unstable when using heavy-tailed distributions. By intentionally biasing the sample towards the tails (and, obviously, adjusting for that bias via a Weighted Least Squares), you get better convergence. Hooray, I already knew that. Some of their proofs are useful, though. In particular, they show the estimators are unbiased. Since showing the estimator of a mean is unbiased is pretty obvious, I had done that a little less formally in my paper. Rather than beef up those arguments, I can leave them as is and just cite this as backup.

More importantly, they reference two other lines of inquiry that will get put on the must read list, plus another that probably is only marginally relevant, but might be good to know about anyway:

Priority papers:

Drineas & Mahoney: Looking at computational efficiency of sub-sampled regression. Two papers cited: Sampling algorithms for l2 regression and applications. Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, Miami 2006 and Faster least squares approximation. Numer Math 2010.

Ma & Mahoney: Generalizing the above results. Again, two papers: A statistical perspective on algorithmic leveraging. ArXiv e-prints, 6/2013 and the same title (same paper?) in Proceedings of the 31st International Conference on Machine Learning, Beijing 2014

Background:
Hansen & Hurwitz: Defines an estimator based on a weighted sample where the weight is the inverse of the probability of the point being selected (this is sort of what CISS uses, though since CISS samples without replacement, there's a little more to it). The reference is old, which means it's probably one of those results that you simply have to cite in a complete literature review: On the theory of sampling from a finite population. Ann math Stat 1943, 14:333-362.

So, there's one paper reviewed. Everybody sing now:

99 bottles of beer on the wall...

Wednesday, May 17, 2017

Coming Eclipse

No, I'm not talking about the total eclipse this summer (though that is also plenty cool and I plan on being in Columbia Missouri for optimal viewing when it happens). Rather I'm talking about Yaya.

As parents, it's natural to want your child to do great things. Even not great things are often puffed up to seem like great things. This is natural. I happily listen to folks telling me about how their kid scored a goal or won a Science Fair ribbon. It's not really bragging (well, usually not); it's just joy in your kid's success and you want to share it. I totally get that and I'm happy to share in it.

It's a little different when your kid gets better than you ever were.

That hasn't happened yet. I was a pretty decent musician by High School standards. My parents were fairly strict about practice time. My private teacher's day job was at Julliard (really!). That kind of support pays dividends. People who knew what they were talking about said with a straight face that I should consider it as a profession.

I indulged that thought for a while but came to realize that it wasn't going to happen. I just didn't have the passion for it that's required to make it your life's work. Music is very cruel to those who give some but not all.

Yaya may well come to the same conclusion. She's not better than I was quite yet (she's only in 7th grade) but, barring some major change of heart, she will be. Whether she has enough passion to make it her life, only she can judge. What's obvious to me is that she has more than I did. She practices more. I never have to enforce it; if anything, I'm telling her to put it down and tend to her other subjects. She asks more. She's constantly hitting me with questions about music. The questions are getting deep enough that I'm having to bust out my music theory texts for some of them and flat out defer to her teachers on others.

Most importantly, she dabbles. She picks up new instruments on a whim. Listens to weird stuff. Tries new things. All the really good musicians I know do these things. Almost all the people I know who do these things are really good musicians.

There was a brief time when I was a bit whimsical about all this. I saw what she was doing, saw the joy it was bringing her, and wondered why I didn't take the same path. I'm past that. As much as I value all the things I learned both about music and life from my years performing, I'm fine with the fact that I gave it up and soon my daughter will be way better than I ever was. It's not pride, it's joy.

Here she is playing her feature with her school jazz band. And, yes, she likes dressing like a dude. Deal with it.


Tuesday, May 16, 2017

Roadmap

So, when you want to get somewhere, it's helpful to have a map. In that spirit, here's the lay of the land for my summer literature review:



The two rectangles at the top are what I've got now: completed algorithms with demonstrated performance on reasonable test sets. That seems like a good start. So, how to tie them together and pull in enough bona-fide research to make it a dissertation?

My thought is this can evolve into the lower rectangle, which is basically a framework for performing multi-dimensional analysis on huge data sets in real time. The multi-dimensional part is what I've been missing up until now and it's the glue that holds all this together. I didn't really understand that until I put together my term presentation for D-BESt. Generalizing this to relational databases will kill a lot of the optimizations. Restricting the scope greatly increases the potential gains and also moves the research into a bit more of a niche (less likely to get scooped).

The fact that it's a niche, doesn't mean it doesn't have very real value. There are still lots of applications for multi-dimensional analysis. And, multi-dim datasets are growing much faster than existing tools can keep up with.

To get to the lower rectangle will require adding some important features, which are indicated by the diamonds. I thought about drawing arrows, but I don't know that I'll necessarily do them in the most logical order, which would be top to bottom. I'll probably implement them in the order I come up with decent ideas.

As for the ideas, the rounded boxes indicate related topics I can draw from. Again, I considered arrows, showing what related to what, but it got really busy. In general, the closer a rounded box is to a rectangle or diamond, the more relevant it is.

I'm figuring this creates a seed group of around 100 papers, which should be a good start. I'm sure I can't fully digest that many papers over the summer, but I should be able to at least locate them and get the major points into a survey. I'll revisit the details as I move through the actual research.

Friday, May 12, 2017

Full marks

I'll say this for my Evolutionary Programming prof: he turns around grades in a hurry. Less than 12 hours after the exam closed, my grade had been posted. I got full marks on the final which sealed an A in the course. As that's most likely a wrap on my coursework (barring some odd circumstance, the remaining credit hours will be research), that pretty much seals up a 4.0 GPA. Of course, nobody particularly cares about your GPA in a PhD program but, it's still a nice thing to have.

Thursday, May 11, 2017

Final final

Stranger things have happened, but I'm pretty sure I just took my last Final Exam. Ever.

Seems like it went OK, though I'll abstain from bold predictions and just wait for grades to be released.

Wednesday, May 10, 2017

Genetic Programming

Genetic programming is an approach to arbitrary problems where the goal is not so much to optimize a particular function, but to produce a program that optimally solves a more general problem. That is, find a program that maps an input space to an output space as accurately as possible. While the idea has been kicking around since the 1950's, the work of Koza in the late 1980's.

Notes:

Genetic programs typically store their algorithm as a tree. The tree may represent an arithmetic expression, a classification scheme, or any other logical structure where operations are applied sequentially based on the results of prior operations.

Fitness is assessed via supervised learning. That is, inputs are selected where the correct output is known. The fitness of the program is how closely the program output matches the correct outputs.

Crossover is applied by grafting pieces of the tree from one solution to another.

Mutation is applied by changing part of the tree.

Theoretically, the search space is all computer programs which are guaranteed to terminate. In reality, it works best to limit the complexity of the tree to keep things from drifting off into irrelevant parts of the solution space where the gradient is very flat.

A special structure, the Automatically Defined Function (ADF). Such functions encapsulate more complex operations and can be grafted into the tree as part of mutation or crossing.

Tuesday, May 9, 2017

NSGA-II

I joked at the start of my term presentation that one thing I've learned in this class is that you can't use actual words in the name of your algorithm (hence D-BESt). NSGA-II follows form and stands for Non-dominated Sorting Genetic Algorithm. It is designed for solving multi-objective functions where the objectives are conflicting (that is, improving one may hurt the other).

Primary paper: Actually two, since NSGA-II is a refinement of the original by the same principal author. But, we'll just give the latter for now: Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. (2002). "A fast and elitist multiobjective genetic algorithm: NSGA-II". IEEE Transactions on Evolutionary Computation6 (2): 182.

Notes:

The idea of all multi-objective algorithms is not to find the optimal solution, but rather to identify the optimal pareto front, that is, the set of points that beat every other point on at least one objective. Put another way, the point is non-dominated: there is no point that beats it on all counts.

NSGA sorts the existing solutions into "fronts". The first front is non-dominated by all other points in the sample. The second front is non-dominated by all points not in the first front. This is repeated for as many front as you want.

The fitness is then determined by the front and by how "crowded" the point is. This is to force the solution away from a single point, since the idea is to capture the front, not an optima.

Points are then selected for the next generation based on fitness (tournament selection is most common).

Having selected parents, the rest of the generation is normal Genetic Algorithm (GA) stuff. Cross, mutate, etc. The NSGA-II algorithm does not mandate any particular strategy for these steps.

Convergence is somewhat problematic to assess since you can't measure progress against a single optimal value. Elitist strategies are employed so there can at least be an assessment of whether the best values from the previous generation are surviving in the first front of the next. If they consistently wind up in the first front, that's a clue that the front isn't moving very much.

Monday, May 8, 2017

Particle Swarm Optimization

PSO is a fairly well-established method of stochastic optimization so there are many references. The foundational work was:

Eberhart, R. C. and Kennedy, J. A new optimizer using particle swarm theory. Proceedings of the sixth international symposium on micro machine and human science pp. 39-43. IEEE service center, Piscataway, NJ, Nagoya, Japan, 1995.

In a nutshell: Solution points are given a velocity in the solution space. The velocity is modified towards the global best and the "personal" best for that solution. The set of solutions then moves through the solution space with velocities updated at each iteration.

Notes:

The actual velocity update is randomized as follows:
v = v + c1 * rand() * (pbest-present) + c2 * rand() * gbest-present)
where c1 and c2 are "constants" (they may change from one iteration to the next, but they are the same for all particles in an iteration), rand() is a random variable, pbest is the position of the best solution for this particular particle and gbest is the position of the best solution found globally.

c1=c2=2.0 is popular.

In order to keep the particles from leaving the valid solution space, clamping is usually applied on both the velocity within each dimension and the resulting location.

Since the global best is kept for purposes of velocity, this is effectively an elitist algorithm.

There are no genetic operators (crossover or mutation).

This is a heuristic - convergence is highly likely, but not guaranteed. There is no good way to derive how close a solution is to the actual global maximum based on the algorithm performance. There may, of course, be other things known about the problem that can provide an upper bound for the difference.

The main upside of the algorithm is its simplicity. There are few tuning parameters and practically no restrictions on the fitness function (other than it being computable). Convergence is generally quite fast.

The downside is that the algorithm is very prone to getting stuck on local maximums, particularly if the gradient of the fitness function steep in the vicinity of the local extreme point. Once the particles have been pulled away from a global maximum due to faster convergence at a local maximum, there is nothing that pulls them back that direction.

Sunday, May 7, 2017

Short break

The original plan was to finish up this semester and then take a two-week break. After this weekend, I'm revising that.

Yaya was out of town as her school band had a trip to Chicago. Kate and I took advantage of that, and some truly fine weather, to spend some quality down time together. Even though it was the shortest of breaks, it was surprisingly adequate to renew my spirits. So, I'm going shift right into paper-a-day mode tomorrow. I have three papers to review in prep for my last exam on Thursday, so this actually works out pretty well.

Friday, May 5, 2017

The way forward

While I take back none of my concerns, the right path forward is always just that: forward. It may well be that this endeavor is doomed but, if that turns out to be the case, it must be due to things outside of my control, not things that I simply gave up on.

With that said, here's the plan for the summer: read and comment on a paper every day (every day interpreted as 6 days a week minus a few vacation days). The only way I'm going to get a faculty member on board with this research is to show that it's viable. To do that, I need to complete my survey paper. To do that, I need to survey the literature.

It's not particularly inspiring (the "literature" is much more of a wasteland than one might hope). It's a lot of work. It's tedious at times. It involves a lot of effort wasted on papers that end up having no bearing on the problem. It's also what you do to prepare yourself for PhD research. I knew that going in; I had to do it for my master's work as well. I'll do it again. I just hope it works.

Thursday, May 4, 2017

Last class ever?

Well, last graded class, perhaps. Today was the final class for both Set Theory and Evolutionary Algorithms. It's certainly conceivable that I'll take another class in an actual classroom for credit. But, most likely, I won't. All my remaining PhD credits will likely be directed research and I don't see why I'd take any other class for a grade.

Of course, it's not like I'll never set in a "class" again. One doesn't last long in my field without continually learning stuff. That often involves sitting in on a seminar or even a formal training course. Even as part of my journey at UMSL, I'll certainly attend lectures and colloquiums. But those aren't really the same thing as a true graded class.

I don't mind taking classes, but I surely won't miss exams. It's not that I'm bad at them; I just don't like them. Exams force you to put down bad answers. I don't like doing that. There are times in life when you have to make a call on the spot. But, unless we're talking trivial decisions, that's by far the exception rather than the rule. The common adult pattern is to take some time if you need it to get the answer right. Exams don't let you do that.  I'll be really happy next week to know that I'll probably not have to take an exam ever again.

Wednesday, May 3, 2017

I find your lack of faith disturbing...

I got a comment lately that my blog posts had gotten a bit dark. Well, I think they may be about to get a lot darker.

Simply put, I don't think anybody on the faculty at UMSL believes in me. I'm not sure why that's the case. I would think that a 4.0 student who's passed the Q and has two papers more or less ready for publication would generate some interest. But, that is definitely not the vibe I'm getting. One faculty member has more than subtly suggested that I may be at the wrong place.

I have no idea what to do in this situation. Frankly, I never even contemplated it. It's one thing to fight your own doubts; quite another to address those of others (especially when you don't even know where those doubts are coming from).

I'm not giving up yet but, if I can't find someone who is really serious about working with me on a thesis, I'll have to concede that this is a hopeless task.

Tuesday, May 2, 2017

That's what.

Ha, ha! Yeah, that whole I'm not sure what to do with myself in the evening didn't last long. My Set Theory prof changed his mind about the last homework. So, while I was planning on informally running through the exercises anyway, that meant I now had to actually do them right and typeset them. So, that took a while (but not as long as I feared it would; this last set wasn't nearly as rough as some of the earlier ones).

Anyway, it's here if you want to take a peek. I'll probably do a little revising, but not much. I want to turn it in Thursday and move on to finals prep.

Monday, May 1, 2017

Now what?

Well, I actually know the answer to that. Work is heating up big-time and the semester isn't completely done. I still have another Set Theory assignment and there's the little matter of finals. That said, it is almost surreal to have my term project out of the way. It's been so all consuming over the past few weeks that it just seems odd to not immediately start working on it after dinner.