Monday, August 31, 2015

Asymptotic Convergence vs. Bounded Domain performance

Much of what we're looking at in Analysis of Algorithms is Asymptotic Convergence. This is the traditional way to assess an algorithm's performance. Basically, it's looking at the shape of the bounding curve as the amount of input data goes to infinity. That's all well and good but, in reality, the input data doesn't go to infinity. There's some point at which, either due to the problem itself or limitations of the system, that the data size reaches a bound of it's own.

At work we've exploited this fact and designed systems to be performant within a bounded domain. For example, the system I spend most of my time on is a big ETL engine that takes submissions ranging from a few thousand to several hundred million input rows. The Translate step of the ETL does pretty much the same thing for each row, so the process is O(n). However, we scale out onto a 32-node grid. We chunk the process into groups of roughly 200,000 rows. So, for data sets between 200,000 and 6.4 million rows (200K * 32), we have a constant-time process. Anything less than 200K rows processes so fast, nobody cares about performance, so the only submissions that are affected by O(n) processing are the relatively rare submissions larger than that.

It's not that we don't care about the asymptotic performance; we get submissions as large as 60 billion rows, so we do need to be cognizant of what that does to the system. But, by setting a bound on the data that we wish to optimize and then tuning the system for that, we can deliver performance levels much better than what we could if we had to tune to worst case performance.

Sunday, August 30, 2015

Week 1 status

I went to a wedding reception last night and a lot of my friends were there (the bride and groom are both ultrarunners, so a pretty good portion of the guests were folks I know from racing). Naturally, I got a lot of questions about how school was going. I've been fielding such queries at work, too.

The answer, of course, is that it's going fine, but that shouldn't be a surprise after just one week. It reminds me of the early parts of marathons where spectators will cheer you at mile 3 saying, "Looking good!" Well, I should hope so. If you're in trouble at that point, there isn't much chance of even finishing much less doing well.

That said, there is cause for optimism, even at this early stage. I have managed to get the books out every day. I'm up to date on reading and have done every problem at least in my head, if not formally writing it out. The adjustment back to classroom learning wasn't as big as I thought it might be. There's a very long way to go, but so far, so good.

Saturday, August 29, 2015

Lead time

Finished reading the chapter on the history of language development this morning. One thing that really hit me was the amount of time it takes to get a language out the door. I realize that designing a language is no small thing and neither is building a good implementation, but I would have thought 5 or 6 years would be enough time to do that. It appears to be closer to 10. Many languages put something out in less time, but the version that gained wide acceptance is usually a decade after the idea got a green light.

Friday, August 28, 2015

Relic

Yeah, I'm old. Ancient by the standards of Computer Science.

It's weird looking at these little code samples in Sebesta and reading about the languages in a past tense (though he's pretty good about acknowledging that a lot of legacy code is still in use, even if new development has moved on). On the one hand, it does seem a long time ago that I wrote programs in APL, FORTRAN (which is dead, BTW, the current language is Fortran), BASIC, PL/C, BASIC, Pascal, and SNOBOL. However, I really did write professional programs in all of those and seeing them stated in a historical context makes me feel pretty old. Attending classes with people less than half my age does a pretty good job of that, too.

Thursday, August 27, 2015

Assembly and C

Sebesta makes an interesting and (to my mind, anyway) highly debatable statement in his review of programming language evolution: that the development of assembly languages had no real impact on high level languages, so it isn't considered as part of the review.

On the one hand, I see his point. If anything, assembly pre-processors started offering macros borrowing control structures and data definitions from high-level languages in an effort to impose some structure on an unruly code base. However, to view the development of C without consideration for assembly structures misses the whole point of the language.

During the 1960's, application programmers moved away from assembly en masse for obvious reasons. The performance hit was far outweighed by the gains in development time and maintenance. System programmers, on the other hand, stuck with assembly because they needed both the efficiency and the direct access to the hardware. Recognizing this, C provided features that extended the capabilities of assembly structures to the semantics of the language. If C had not provided things like register variables, absolute memory addressing, pointer arithmetic, not to mention the radical idea of "in-line" assembly blocks, the language would not have gained acceptance among system programmers.

Consider the following piece of code that exists in just about every system program ever written in C:
while (*target++ = *source++);
We've written this so many times, it's come to be thought of as a high-level structure, but it's not. A high-level structure is something that imposes structure on top of the underlying implementation. This is the opposite. Machine designers figured out very early on that copying strings was both common and expensive. So, they started building in instructions that would perform block memory to memory copies given two starting addresses and a terminating condition (typically a value of zero or a specified byte count). The above "loop" captures all the side effects of such an instruction, so the compiler can replace the construct with the single instruction.

C is loaded with these sorts of things. *p is MOVI; += is ADD; i-- is DECR. Array indices start at zero instead of one because that's how assembly works. Assignments can be made inside conditional expressions because no competent assembly programmer would ever re-test a result they just saved. All these features are mappings of assembly constructs into a high-level context, not the other way around.

As C still sees heavy use as a systems language, not to mention the enormous application code base written in its offspring, C++, and grandchildren, Java, and C#, I'd say that languages, and the programmers that use them, have been very much influenced by the design of assembly instruction sets made in the 60's (even if many of them have never programmed at the machine level).

Wednesday, August 26, 2015

All in

So, after reviewing the syllabus and text for the Programming Languages course, I decided that I really should take a second class this semester. So I signed up for Advanced Data Structures and Algorithms. That makes it four nights a week and probably 15-20 hours outside of class, but I'm pretty optimistic. These are core courses, so my 30 years of practical experience counts for a lot.

Tuesday, August 25, 2015

Pseudo answers

This morning I went through all the questions at the end of the first chapter of Sebesta and answered them in my head. Yes, actually writing down the answers would have been better, but I wanted to get through them all and only had about an hour to spend on it. I was a little surprised at how often I flipped back through the chapter to find a supporting fact or clarify an answer. I guess I need to be a bit less arrogant about how well I know this material. Sure, I'm familiar with it, but I don't KNOW it. The former needs to become the latter.

Monday, August 24, 2015

Courses start

Courses started today, but my class this semester is Tues/Thurs. I did read the first chapter of the book (Sebesta, Concepts of Programming Languages). Most of it was material known to me, but I didn't mind going through it. I did catch what I believe is a pretty big omission: no discussion of data-driven programming languages.

This isn't a surprising bias in academia; data-driven languages are generally all vendor 1-off's to support ERP systems. That said, the people who know them are the highest paid folks in the industry, so it seems that one might want to come to terms with them for that reason alone.

The major difference between them and imperative languages is the way variables are handled. Most imperative languages scope variables as local and global. Object oriented languages have more scope options, since things can be visible to a method, a class, a derived class, or everybody. Data-driven languages treat variables as fields in records. As such, the scoping really comes down to who can access the record, not where it's declared or how it gets referenced. Persistence of variables is also very different. Again, it's tied to the underlying record (which may get modified, duplicated, or deleted by any number of processes at pretty much any time) rather than the code constructs.

I'll concede that, in practice, most data-driven code looks a lot like imperative code. But, where it is different, it's REALLY different. Given that the bulk of business programming is moving to ERP systems, an extra chapter on that would have been in order.

Wednesday, August 12, 2015

Researching bugs

Blah. I hate researching bugs. Microsoft SSAS (their BI reporting solution that sits on top of SQL Server) has more than it's share of interesting quirks. Most of them have well-publicized workarounds. We hit one this week that doesn't. Basically, we need ordered sets coming in on a select distinct. Oracle (which we use as our underlying data store) used to always order select distincts. But then they started ordering the results by the hash code rather than the actual data element. No problem, there was a hint you could use to get around that. Then, with Oracle 11, that hint stopped working. Not sure how we're going to get around this, but it's brushing up my research skills trolling the web for answers.

Monday, August 10, 2015

Two weeks out

Well, things got very busy at work (as they always do over the summer since our big deployment for the year is always in August). I have actually got a lot done on the grad school front; I just haven't been very good about recording it. With classes starting in two weeks, I'm going to try to be better about that.