Sunday, January 31, 2016

Centered

Well, my intent was to write up the eigenvalue stuff for Data Mining today but the production support work yesterday messed up my schedule a bit. I ended up taking today pretty much off rather than stress myself out this early in the semester. It's now 9:30PM and I should get to sleep so I'll just drop a little pearl that anybody who's taken a stats course has probably heard, but the application of it doesn't become clear until you start working with large data sets.

Data can always be centered. That center may be very artificial, but you can always transform it so that it does have a center. In the vast majority of cases, the sample mean is used. Of course, one could just as easily use the median or the mode. Why the mean? It comes down to how we measure variability.

The variability of data is usually measured by using a sum of squares. There are good reasons to do this when you get to higher dimensions. If you're projecting a set of points in, say, 3-space, onto a 2-dimensional line, the projection lies along the line and the error is a perpendicular vector. As even the scarecrow can tell you at the end of The Wizard of Oz, perpendicular projects mean you square the sides to get the magnitudes to add up.

So, since the point of a center is to minimize the variability between the data and the center, you solve for the minimum sum of squares and that turns out to be the sample mean. If, instead, you just used linear distance rather than squared distance and minimized that, you'd get the median. And, if all you cared about was whether things were equal or not (that is, your distance from the center is zero if you are at the center and 1 if your not) then your minimum variability is around the mode.

It's hard to see why this matters in a stats class. Everybody knows we're going to use the mean. But, when analyzing big data sets, your distance measure may be all sorts of weird things. Certainly, the 0/1 measure that yields the mode is a very common thing to care about and finding the most frequent value (i.e., the mode) is a very common database operation. Medians also get used a lot; knowing the point that cuts your data in half is useful in many situations. It's helpful to keep in mind that the answers aren't random - they result from the conscious choice of how to measure things.

Saturday, January 30, 2016

The other side of Big Data

This will sort of be my off-day post because it's sort of an off day. I worked late last night and some more today monitoring some jobs that run long.

I'm currently the subject in a trial that, presumably, has some Big Data implications (though I have no idea what those are). For obvious reasons, the research department at work is interested in fitness trackers. My understanding is that the correlation between fitness and life expectancy is very weak (the quality of life may be better, but actual longevity is what a Life Insurer cares about). The hope is that some better correlating variables can be found with the use of the latest generation of fitness trackers.

So, lot's of folks at work are wearing fitness trackers for the next month and a half. I got a Basis Peak. On the surface, it's an exceedingly cool little device. The screen is like that of a smartwatch (actually, since it pairs to your phone, I guess it qualifies as a smartwatch). The user interface is reasonably intuitive and the touchscreen is a nice mix of responsive without being twitchy.

Underneath the skin, I'm less convinced. On my run this morning, it had my heart rate varying all over the place. In reality, it was quite steady (as one would expect on an easy run with a group). The distance measured was also a bit short. It did seem to do a pretty good job of counting my strides. It collected lots of other data, too, but I have no idea how to interpret that.

Truth is, I don't really care. I know what running easy is supposed to feel like. I know what tempo pace feels like. I know what marathon pace feels like. I really don't need a device of any kind to run a desired pace. A few years back, I ran the St. Louis Track Club pace series. It's 16 "races" where your finish time is the absolute difference between your projected and actual time. In other words, if you say the 5K will take you 19:30 and you run it in 19:45, your "finish" time is 15 seconds. Obviously, you're not allowed to wear a watch during the run. My total for the series (best 3 out of 4 in a month count, so 12 total) was under two minutes. And if you think that's good, yes, it was (third overall if I recall correctly), but the series winner was half that. Good runners have pacing dialed in pretty tight.

Anyway, some of the variance I was seeing could well be user error. I just started using this thing yesterday and may well not have the strap tight enough or something else set up wrong. Regardless, it's information I just don't need and I'm not really a gadget geek so, cool as it is, I won't be running out and buying one when this study is over.

Now, what I would like to know is what they are doing with the data? Of course, it's anonymous (and, I wouldn't care if it wasn't - there's nothing about my fitness that I need to keep secret). What I mean is how is this going to help them make more money on life insurance My guess is that, since people are so willing to post this stuff on publicly available sites, they want to make special rate offers to folks they can identify as very good risks. It's illegal to ask for this sort of information, but if somebody wants to volunteer it, using it is fair game. Google is doing the same thing getting into car insurance (hint: if you get pissed at the driver who cuts you off, don't post that on Facebook; your insurance company WILL find out about it sooner or later and classify you as having potential for road rage).

It's far enough from my thesis topic that my interest is purely curiosity, but it's still a question I will bother to ask.

Friday, January 29, 2016

More graphing in R

Apparently, our graphs are supposed to look nice. I suppose that does matter sometimes. I had to monitor a process all evening, so I spent it messing around with R's many and varied graphing packages trying to make my homework assignment pretty. Here's an example using just the base package.

#probability and cummulative functions
p = function(x) 6 * x * (1 - x)
cdf = function(x, width) cumsum(p(x)*(width/length(x)))

#take the midpoint of 100 even-spaced intervals
x = ((0:99) + 0.5)/100

#plot the pdf
plot.new()
plot(x, p(x), type="l", col="blue", ylab = " ")

#plot the cdf with a reference line to show it goes to 1
lines(x, cdf(x, 1), type = "l", col="red")
lines(c(0,1), c(1,1), lty=2)

#pretty it up a bit
title("p(x) = 6x(1-x)")
text(0,1.3, "p(x)", pos=4, col="blue")
text(0,1.1, "cdf(x)", pos=4, col="red")


Oooh and aaaah (though I have no idea why blue and red saved to very dark blue and maroon - they were quite bright when displayed in RStudio).


Thursday, January 28, 2016

Small steps revisited

As I mentioned as part of The Ultrarunner's Guide to Long IT Projects, small steps is the way to get things done when you're in it for the long haul (actually, it works pretty well all the time). It seems my advisor agrees. We met today on what I would do to make the Bayesian class a true graduate course and decided we'd look for a piece of my thesis topic that could be tackled as a little chunk.

The most promising approach seems to be to try to solve a much more specialized case of the problem. Specifically, I'm going to look at how sampling affects some individual queries that that we run at work. We'll tune them for both performance and accuracy and see if we can't beat the regular cube. Most importantly, we'll look at whether our confidence intervals (or, more correctly, our posterior distribution on the result) matches empirical results. Even in this reduced form, this is not an easy task, but at least it's well defined.

Another avenue that occurred to me while we were tossing out ideas was the thought that rather than waiting for the posterior distribution to sufficiently converge, we could put a cost function on the error and then use hitting the expected value of that cost as our stopping rule (that is, keep sampling until the expected cost drops below a pre-set threshold.)

At any rate, I'm pretty excited to actually have some direction. Up until now, it's been lots of ideas but no clear path on how to proceed. I'm willing to put the ideas down for a bit in exchange for making some real progress.

Wednesday, January 27, 2016

Graphing in R

OK, I'll give it that much: graphing in R is trivial. I took an example diagram from the class text (cumulative proportion of coin tosses coming up heads) and tried to replicate it. Even this code is pretty verbose; you could jam all the graph commands into a single call. But, I don't like going for super-concise code as it's too hard for the next person to figure out what you did.
CummFlips = function(n, p)
  cumsum(rbinom(n, 1, p)) / (1:n)

plot.new()
plot.window(c(1,500), c(0,1), log="x")
plot.xy(xy.coords(CummFlips(500, .5)), "o")
axis(1)
axis(2)
title("Running Proportion of Heads",NULL,
  "Flip Number","Proportion")
From what I've sorted through online, the R crowd isn't nearly as fanatical about sticking to the functional roots in the language as are the adherents of it's progenitors. I think a lot of R practitioners would have put the binomial vector in a variable rather than using nested functions. As a C# programmer, I certainly have no problem with using variables, but it doesn't seem necessary and if you can't even get past an exercise this simple without trashing the language design philosophy, you might want to move on to something else.

The graph (at least the left hand portion) obviously changes each time you run this. Here's an example output:


Tuesday, January 26, 2016

Missing observations

Data cleansing was the subject of today's lecture for Data Mining. We spent most of the time looking at what to do in the case of missing observations. Granting that filling in a few values here and there is probably simplifying things without biasing results, I'd say the strategies offered were pretty bogus. I've held for some time that Data Science is just Statistics without the rigor and this bag of unsupportable tricks didn't do much to dissuade me.

I'm pretty sure that part of the problem is simply that the techniques that have some merit are beyond the level of this course (it's a mixed grad/undergrad course so, while we PhD types will be expected to produce some meaty theoretical work, the lectures are tuned to the undergrads). That said, there's at least one strategy for missing observations that I think a lot of CS folks are too quick to dismiss: go get the data.

Yes, it's a pain. I spent hundreds of hours pouring through motor vehicle records and mortgage papers preparing the data set I used for my masters work. Not because I needed a clean data set for the thesis, but because the work was actually being used by the New York State Department of Health to direct cancer investigations. Sure it, would have been easy enough to just fill in some data, but there are times when getting the answer right really matters.

You could certainly argue that many of today's data sets are too large for that level of effort (my research set was a mere 592 cases of Leukemia). But, detecting and correcting missing data is becoming a whole lot easier due to the tremendous growth in computational power. At work, we spend quite a bit of our time writing tools to detect gaps in data. When they are found, we report it to Corporate Modeling, who then chase after the providers (typically, actuaries in our remote offices) to fill it in. The result is that we have an incredibly robust set of projections data. That allows us to confidently trim the buffer between our actual and required capital and price more competitively. In the context of 4 trillion dollars of insurance in force, the million or so we spend doing this seems like effort well directed.

Monday, January 25, 2016

Terms from Data Mining ch2

More poorly organized notes, primarily definitions from Data Mining.

Data type portability - ease at which a data type can be transformed to another type required for analysis.

Data type porting - the process of transforming. Can be as simple as casting or may involve significant ETL.

Imputation - assigning a best guess for a missing value based on other values in the observation and correlations in the data set.

Latent Semantic Analysis (LSA) - Transforming text to a nonsparse space of lesser dimension through use of word counts and covariances. Leverages the following two properties of text:

  • Synonymy - multiple values mean the same thing
  • Polysemy - a value can mean more than one thing based on context

Symbolic Aggregate Approximation (SAX) - Method for converting time series to discrete sequence. Aggregates value over time windows and creates a discrete approximation for the entire window.

Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT) - methods for converting time series to series of real-valued coefficients.

Multidimensional Scaling (MDS) - Method for converting graph information to multidimensional vectors for purposes of determining distance between graphs. Typically uses Euclidean measure on resulting vectors, but can use other distance functions as in nonmetric MDS.

Neighborhood graph - used to represent similarity along dimension of any type.

Sampling techiques:

  • Biased - intentionally shifting sampling distribution to get more "interesting" data.
  • Stratified - form of biased sampling that assures representation in each of several segments of data.
  • Reservoir - used to ensure proper representation of moving sample from stream data.
Principal Component Analysis (PCA) - rotating multidimensional basis to reduce number of independent variables (or, more precisely, leveraging correlations in data to capture the information using fewer variables.

Singular Value Decomposition (SVD) - More general case of PCA.

Mean centering - transforming scalar values so the have a mean of zero. PCA and SVD yeild the same results when values are mean centered.

Energy of a transformation - sum of the squared distances from the origin of the original observations projected onto a orthonormalized subspace basis. SVD maximizes energy under this transformation.

Spectral decomposition (also, Schur decomposition) - Diagonalization of a symmetric matrix used by SVD.

Homophily - tendency (preference) of individuals to repeat joint ventures with those with whom they have prior experience. Leads to highly correlated data.





Sunday, January 24, 2016

More linear algebra

Anyone who has a kid in High School has heard, "Why do I need Algebra? I'm never going to use that." Well, you will if you do statistics for a living. And, as it turns out, you'd better be up on it for data mining tasks as well. I read through the second chapter of the primary text today, and it was packed with all sorts of non-trivial transformations based on eigenvectors. I'd write more about it, but, I'm tired. I'll put a substantive post together this week. Meanwhile, I'm really glad I spent the break shoring up Linear Algebra. Looks like I'm going to need it sooner than I thought.

Saturday, January 23, 2016

2016 Frostbite Half Marathon

As I took today's race pretty seriously, I'm writing about it rather than a throwback report this week.

Run January 23, 2016

Of course I knew that grad school was going to kill my competitive running. I stress competitive because one doesn't have to race well to enjoy running. Or, so I'm told. I hope that turns out to be true because today was pretty much the confirming observation that my competitive days are done.

My intention was to run this race as well as possible to get some sense of what pace to run at Woodlands. From my regular workouts, I knew that my fitness had declined since Milwaukee, but it wasn't obvious just how much. Getting the opening pace of a marathon wrong by even a few seconds per mile can be disastrous. I did a mini-taper, taking this week light, particularly Thursday and Friday.

The mile markers at St. Louis Track Events only loosely correlate with the actual distance in miles, so it's never clear really how fast you are running. However, I was pretty sure I was somewhere in the 6:35-6:40 range for the first few miles. That's not where I wanted to be, but the effort felt right, so I went with it, hoping I might find something in the second half.

As the course is a loop, I was able to get a reasonably accurate pace observation at the close of the first lap. It wasn't pretty: 43:30. I tried to pick it up in the second half, but to no avail. It turns out, my pacing was spot on. I was turning myself inside out to hold on to it at the end and wound up with nearly even splits, finishing in a dismal 87:07; my slowest road half since 1996.

Well, shoot. Things are much worse than I had hoped. A sub-3 at The Woodlands isn't looking very likely. There was nothing about today that would indicate the time was a fluke. In fact, if I was used to running halves in around 87 minutes, I'd be patting myself on the back for such a well-metered effort. Unfortunately, to run a sub-3 full, you really need to be running halves around 85. And, as a 52-year-old marathoner, you really need to be running sub-3's to be competitive.

To those who haven't seriously competed, a couple minutes in a 90-minute race that doesn't sound like a big difference, but it certainly is. I'm not quite ready to give up and call The Woodlands a fun  run. After all, I am in the elite field, so I really should give it my best shot even if that's likely to be five or six minutes slower than last year. After that, I am going to acquiesce to reality and really start running just for fun. How fun that will actually be for a type-A like myself remains an open question. But, so is my thesis topic, and that's really where I need to be putting my efforts.

Friday, January 22, 2016

Buzzwords from Data Mining chapter 1

I'm sure the emphasis in this class is practical, but it never hurts to know all the terms. So I'm going to start building a glossary. There are two texts for the class and they use different terms. This is from Chapter 1 of the primary text: Aggarwal: An Introduction to Data Mining.

Four major problems: association pattern mining, clustering, classification, outlier detection.

Data stream: continous data which has to be processed in a single pass because the raw data isn't saved (at least, not for long).

Major processing steps:

  • Data Collection
  • Processing
    • Preprocessing
      • Feature extraction - organizing raw data into collections of attributes
      • Data cleaning
    • Feature selection and transformation
      • May include Integration - combining with other data sources
  • Analytic processing and algorithms.
Basic data types (this is where there's a LOT of disparity among authors)
  • Nondependency - Data items that do not rely on neighboring fields or observations
  • Dependency - Data items that are correlated either with other fields within an observation or with neighboring observations.
    • Implicit dependencies are things that are correlations that are generally known within the data (e.g., it was 70 degrees a minute ago, it's probably around 70 now).
    • Explicit dependencies are built into the data model; often, relationships represented as graphs.
Multi-dimensional or multivariate data - set of (records, instances, data points, vectors, entities, tuples, objects) consisting of (fields, attributes, dimensions, features).

Atomic data types
  • Quantitative (measures): continuous, discrete, finite
    • Ratio: Can be compared as ratios - i.e., quantitative with a true zero point
    • Interval (this term is from the other text): Quantitative, but not valid as ratios.
  • Categorical: discrete
    • Ordinal: Can be ordered, but no arithmetic operations apply.
    • Nominal: merely a label.
Binary and set data - special cases of categorical and quantitative.
Text - can be just about anything; I'd also put blob data in this group, even though text is technically more parsable, image and sound files can also be parsed and mined.

Contextual attributes: attributes that define the context of the data. We just call these "attributes" at work.

Behavioral attributes: attributes that represent values measured in the context. At work, we call these "measures". I don't know if we are just behind the times or if these are both accepted usages.

Multivariate time series. Multivariate data with a contextual time stamp. Duh.

Multivariate discrete sequences. Like multivariate time series, but with categorical measures and only order of observations is important (not the time between observations).

Univariate discrete sequences. Fancy phrase for strings.

Spatial and spactiotemporal data. Extended time series to more contextual dimensions. Typically stops at 3-space x time, but it could be any number of dimensions as long as ordering and distance along each dimension is meaningful.

Another form of spatiotemporal data is when the spatial component is behavioral rather than contextual, i.e., a trajectory.

Networks and graphs. Alternative representations of dependency data.

General 4 problems framed as operations on a data matrix.
  • Classification: looking for relationships between columns in which one (or more) columns is predicted from others. Also known as supervised classification.
  • Clustering: looking for similarities between rows. Can also be thought of as unsupervised classification.
  • Association Pattern Matching: like classification, but looking at correlation between columns rather than predicting one from a set of others. Strictly, columns are (0,1) variables and we are trying to find subsets of the columns where all values are 1 at least s% (known as support) of the time.
  • Anomaly Detection: looking for rows that don't cluster.
Association confidence rule: A => B is the fraction of transactions supporting A that also support B. A=>B is valid at support s and confidence c if support of A is at least s and confidence of A=>B is at least c.




Thursday, January 21, 2016

Unit testing in R

Well, it didn't take too much research to come up with some unit test tools for R. They're clunky compared to Visual Studio, but not terribly so. The one that seems to hold the most promise is a package called testthat. Here's quick summary of how it works:

Create the test file(s). Here's a test file for a function Im that creates an identity matrix of dimension d:

expect_null(Im(0), 'Dimension 0')
expect_equivalent(Im(1), matrix(1, 1), 'Dimension 1')
expect_equivalent(Im(2), matrix(c(1, 0, 0, 1), 2), 'Dimension 2')
expect_equivalent(Im(3), matrix(c(1, 0, 0, 0, 1, 0, 0, 0, 1), 3), 'Dimension 3')

The test files get put in a test directory and can be run individually or as a group using a test script:

library('testthat')
source('Im.R')
test_dir('tests', report = 'Summary')
Now we can get to writing code (in file Im.R):

First try:
Im = function(d) NULL
First passes, others fail. Add trivial case:
Im = function(d)
{
  if (d <= 0)
    return(NULL)

  result = matrix(1, d, d)
  return(result)
}
First two pass. Now handle non-trivial case:
Im = function(d)
{
  if (d <= 0)
    return(NULL)

  result<-matrix(0, d, d)
  for (i in 1:d)
    result[d,d] = 1
  return(result)
}
Oops, 2&3 still fail because I goofed the index:
Im = function(d)
{
  if (d <= 0)
    return(NULL)

  result = matrix(0, d, d)
  for (i in 1:d)
    result[i,i] = 1
  return(result)
}
and we're in business; all pass. Now we can confidently refactor. Wouldn't it be simpler to just pass the ones and zeros to the matrix data source rather than making it all zero and then overwriting? How about this:
Im = function(d)
{
  if (d <= 0)
    return(NULL)

  result = matrix(c(1, rep(0,d)), d, d)
  return(result)
}
Tests pass, but we get a warning message because the data source isn't a multiple of the row or column length. We could patch that:
Im = function(d)
{
  if (d <= 0)
    return(NULL)

  result = suppressWarnings(matrix(c(1, rep(0,d)), d, d))
  return(result)
}
That passes. Or, we could just generate the entire data source with the right length. There are d-1 sequences of one followed by d zeros and then a trailing one.
Im = function(d)
{
  if (d <= 0)
    return(NULL)

  result = matrix(
    c(rep(
      c(1, rep(0,d)),d),
      1),
    d, d)
  return(result)
}
 And, of course, there's no need to store the intermediate result at this point:
Im = function(d)
{
  if (d <= 0)
    return(NULL)

  return(
    matrix(
      c(rep(
        c(1, rep(0,d)),d-1),
        1),
      d, d))
}
All pass, good enough (though throwing a couple comments in there would probably be in order).

Wednesday, January 20, 2016

Thin slicing

Not sure why this is just occurring to me now, sometimes it just takes a while to make a connection.

Malcolm Gladwell is one of my favorite authors. In his book Blink he looks at why experts can come to correct conclusions almost instantly when a far more methodical examination of the data by non-experts turns up nothing (or, worse, conclusions that are false). He refers to the phenomenon as "thin slicing", the ability to cut through all the noise and recognize just the information that matters. This is often subconscious. Ask a master chess player why a certain positional move is bad and you might get a response like "it's just not the right move for that position." They're right, and they know they are right, but the reason they know it isn't easily articulated. They've seen thousands of similar positions and simply "know" what works and what doesn't. Tactical moves, of course, do require computation even by masters, but they will do it much faster because they are quicker at discarding variations that are clearly going nowhere.

At yesterday's lecture in Data Mining, it was mentioned that all this work is really just boiling huge data sets down to reveal the actionable information. In other words, it's thin slicing by machine.

Here's the rub. While a renowned expert may be able to get away with an opinion supported only by their reputation, machines can't. Or, at least, they shouldn't. The actuaries in my company are not going to accept conclusions without supporting documentation, even if the results match what their gut is telling them. They can't. Regulators require that the means by which decisions are made are documented and conform to accepted practice. One such accepted practice is that results be stated not as point estimates but confidence intervals.

So, how do you state at thin-sliced opinion as a confidence interval? Well, if I already had that worked out, I'd already have my PhD, but I should be able to tell you in a couple years.

Tuesday, January 19, 2016

Spring is here!

No, not real spring. It's 18F and snowing right now. But spring semester has officially begun. Today's class was "Data Mining". It's a joint upper undergrad/grad class, meaning it's the same lectures, but the grad students have to do a bunch of extra independent stuff. I'm fine with that. Actually, more than fine with it; I'm at my best when I can just dig into a problem on my own.

The prof is personable and engaging. This is his particular research interest, and that definitely shows. I was sufficiently impressed after the first class that I may ask if he'd be interested in joining my dissertation committee. I think I'll turn something in first, though. Always good to ask such things after you've done something to prove yourself.

Monday, January 18, 2016

Languages HW2

So, I put together a Courses page (see pages on right) to organize all my posts specific to my courses. Realized I never posted my second homework for Languages. Well, here it is:

CS4250: Programming Languages
HW#2: Eric Buckley - 18148800
5.6) Consider the following JavaScript skeletal program:
// The main program
var x; //main.x

function sub1() {
      var x; //sub1.x

      function sub2 {
            ...
      }
      ...
}

function sub3{
      ...
}

Assume that execution runs main calls sub1 calls sub2 calls sub3.
a) Assuming static scoping, which declaration is the correct one for a reference to x?
                i. sub1: sub1.x since it is local to the routine.
                ii. sub2: sub1.x since sub2 is declared within sub1, sub1 is the next available scope.
                iii. sub3: main.x since sub3 is declared directly within main and has no visibility to other locals.
b) Assuming dynamic scoping, which declaration is the correct one for a reference to x?
                i. sub1: sub1.x since it is local to the routine
                ii. sub2: sub1.x since the next available scope is the calling program
                iii. sub3: sub1.x since sub2 will pass this reference in its call to sub 3.




5.7) Assume the following JavaScript program was interpreted using static-scoping rules.
var x; //main.x

function sub1() {
      document.write(“x = “ + x + “”);
}

Function sub2 {
      var x; //sub2.x
      x = 10;
      sub1();
}

x = 5;
sub2();

What value of x is displayed in function sub1?
                x = 5 since x refers to main.x
What value of x is displayed if dynamic scoping rules?
                x = 10 since x refers to the calling program (sub2.x)

5.8) Consider the following JavaScript program:
var x, y, z;

function sub1() {
      var a, y, z;

      function sub2() {
            var a, b, z;
            ...
      }
      ...
}

function sub3() {
      var a, x, w;
      ...
}

List the variables, along with the program units where they are declared, that are visible in the bodies of sub1, sub2, and sub3 assuming static scoping is used
                sub1: sub1.a,  main.x, sub1.y, sub1.z
                sub2: sub2.a, sub2.b, main.x, sub1.y, sub2.z
                sub3: sub3.a, sub3.w, sub3.x, main.y, main.z
5.11) Consider the following skeletal C program:
void fun1(void); /* prototype */
void fun2(void); /* prototype */
void fun3(void); /* prototype */
void main() {
      int a, b, c;
      ...
}

void fun1(foid) {
      int b, c, d;
      ...
}

void fun2(foid) {
      int c, d, e;
      ...
}

void fun3(foid) {
      int d, e, f;
      ...
}

Give the variables visible under dynamic scoping during the last function call if the order of execution is:
a)            main calls fun1 calls fun2 calls fun3
The following table gives the variables defined at each call level
code
a
b
c
d
e
f
main
main.a
main.b
main.c



fun1

fun1.b
fun1.c
fun1.d


fun2


fun2.c
fun2.d
fun2.e

fun3



fun3.d
fun3.e
fun3.f

Thus, at fun3, the variables available are main.a, fun1.b, fun2.c, fun3.d, fun3.e, and fun3.f



5.12) Consider the following program, written in JavaScript-like syntax:
// main program
var x, y, z;

function sub1() {
      var a, y, z;
      ...
}

function sub2() {
      var a, b, z;
      ...
}

function sub3() {
      var a, x, w;
      ...
}

Given the following call sequence and assuming that dynamic scoping is used, what variables are visible during execution of the last subprogram activated?
c)            main calls sub2 calls sub3 calls sub1
The following table gives the variables defined at each call level
code
a
b
w
x
y
z
main



main.x
main.y
main.z
sub2
sub2.a
sub2.b



sub2.z
sub3
sub3.a

sub3.w
sub3.x


sub1
sub1.a



sub1.y
sub1.z

Thus, at sub1, the variables available are sub1.a, sub2.b, sub3.w, sub3.x, sub1.y, and sub1.z


Sunday, January 17, 2016

Algorithms Summary

CS 5130 - Advanced Data Structures and Algorithms

Course Outline (Chapters 1-4, 6-8, 15-16, 22-25, 29, 34, and 35 in text Cormen et al., Introduction to Algorithms, 3ed):
  1. Growth functions, Divide and conquer, Recurrences, and Asymptotic bounds.
  2. Sorting algorithms
  3. Greedy Algorithms
  4. Dynamic Programming
  5. Graph Representations
  6. Depth and Breadth first searches
  7. Minimum Spanning Trees
  8. Shortest Path
  9. NP-Completeness
  10. Approximation algorithms
  11. Linear Programming
Key Takeaways:

Recurrence trees are stupid. Unless the recurrence is so trivial that you don't need any help solving it, the tree is difficult to construct, even more difficult to verify, and ends up giving you a sloppy loose bound. Prof. Pan would probably very much object to that assessment, but on the homework I was able to produce and prove tighter bounds using other methods than I (or he) could using trees. Then I had to spend an inordinate amount of time trying to construct a tree to support the bound I had already proven correct. CPU time just isn't that precious anymore - write the recurrence in some language that supports list arithmetic (so your computation doesn't cause arithmetic overflow; you often have to go well beyond reasonable size values to get the true asymptotic behavior) and crank the actual numbers. Then plot it against various curves to find a tight bound. Takes less time and gives better results.

That said, I really need to shore up my Calculus. It's very rusty and that hurt me not just with recurrences, but with bounds in general. Think Calc might come up on a qualifying exam for math? Project for the summer, I guess.

Kinda odd, given the course title, that the two sections of the book that we skipped completely were "III - Data Structures" and "V - Advanced Data Structures". I skimmed through them and there's a ton of good stuff in there. This text is a keeper.

I think that graph algorithms in general are a very fertile field for probabilistic algorithms. Might talk to Prof. Pan about that since that's pretty much his thing.

The NP-Completeness stuff seemed very natural to me. A lot of the students seemed to struggle with the concept of reductions. It's a technique I've always liked. I remember being given an open problem in Simulation at Cornell where the Prof (Lee Schruben, yes, really) asked if a specification method was sufficient for any kind of simulation. As we generally did in such situations, most of the students gave it a go and when nothing obvious presented itself, we gave up. It was, after all, a problem that Schruben himself conceded was "probably true, but might be really hard to prove." Dave Tate was a bit more tenacious and showed that you could use the method to specify the simulation of a Turing Machine and since anything that can be simulated can be simulated on a Turing Machine, the method must be complete. It was brilliant (Schruben agreed) and I resolved to become a master of restating problems. I don't know that I've fully achieved that, but I love the technique and use it any time I can.


Stuff from class that's probably worth remembering:
  • Bounds
    • Loose upper: o(f)
    • Upper: O(f)
    • Tight: Θ(f)
    • Lower: Ω(f)
    • Loose lower: ω(f)
  • Bounds can be demonstrated by definition or by taking the limit of the ratio of the bound and bounded function as the argument goes to infinity. If it converges to a constant, the bound is tight. If it goes to infinity or zero, it is loose. If it does not converge, the bound is invalid.
  • Time and space bounds of exponential order or greater are considered intractable, even though small and/or specialized cases may be solvable. Conversely, polynomially bounded problems are considered solvable, even if the order of the polynomial is too high for practical computation.
  • If an upper bound for a recurrence is leaving an annoying low order term during an induction proof, try SUBTRACTING that term from the bound. Counterintuitive, since you're actually tightening things, but by putting the tighter bound back into the recurrence, you get less slop after sending it through the inductive step.
  • "Master method" for recurrences (not really broad enough to be considered a master method in my mind, but that's what it's called and it is often useful). Works for T(n) = aT(n/b) + f(n), a≥1, b>1, f(n) asymptotically positive plus a few more caveats.
  • All compare and exchange sorts are Ω(n2). All sorts on unbounded cardinality are Ω(n log n). Linear-time sorting is possible with restrictions on keys.
  • Optimal Sub Structure (OSS) for Dynamic and Greedy algorithms: Optimally solving subproblems yields global optimums.
  • Dynamic Programming works when the subproblems are Independent, Overlapping (used more than once by the global solution), and exhibit OSS.
  • Greedy Choice Property - there exists a globally optimal solution that incorporates the locally optimal choice. This and OSS enables greedy algorithms to always yield optimal results.
  • Greedy algorithms often return "good enough" results even when there are exceptions to the Greedy Choice Property.
  • Memoization (which is very hard to say without sounding like you have a speech impediment): leaving breadcrumbs along the way so you can reconstruct the solution when you're done.
  • Graph representations: Adjacency Lists and Adjacency Matrix. Lists are smaller and faster for sparse graphs. External note from the creators of HP Vertica, which have been spending some time with us at work: in the Big Data space, Adjacency Matrices are the way to go, since trading space for time is almost always a good thing.
  • All the basic graph algorithms (depth first search, breadth first search, minimal spanning tree, shortest path, etc) come down to how fast you can recognize a connected set of nodes. Several interesting data structures for managing sets are helpful in such operations. Notably, Fibonacci Heaps.
  • Many graph problems can be restated as linear systems and solved with Linear Programming solutions. This was very much news to me, and surprising news at that. I'm sure it got mentioned in one of my LP classes at Cornell, but I missed it.
  • Standard form for LP problems: minimize cTx subject to Ax ≤ b, x ≥ 0.
  • Slack form for LP problems: s = b - Ax, s > 0. This form leads to the basic (on the left) and non-basic (on the right) variables which get pivoted in the Simplex method.
  • NP does NOT mean Non Polynomial. It means Nondeterministic Polynomial. I already knew that, but it's the sort of thing you want at the front of your brain because otherwise you might blurt out the wrong answer while making a bad joke about a hard problem after a couple of beers at happy hour and immediately be classed as The Guy Who Thinks He's Smart But Doesn't Know Shit.
  • Any NP-Complete problem can be stated as any other NP-Complete problem. Obviously, some of the reductions are more straightforward than others.
  • When showing NP-Completeness, it is OK to restate an optimization problem as a decision problem. That is, if the decision problem (is there a solution bigger than x?) is NPC, the optimization problem (what's the best possible solution) is also intractable (assuming P <> NP; if it turns out that P = NP, then the whole conversation is moot).
  • Approximation schemes come in several grades:
    • approximable (APX): for fixed e>0 the scheme is polynomial in n
    • Polynomial-Time (PTAS): for any e>0, the scheme is polynomial in n.
    • fully Polynomial-Time: polynomial in both n and 1/e.

Saturday, January 16, 2016

2011 Ultra Race Of Champions

Seems about time to throw a true disaster tale into the mix. This week's off day throwback race report is from my ill-fated brush with the best.

Run September 24, 2011

Elite. As with many positive adjectives, overuse has robbed much of its meaning. Even in cases where the standard is objective, the criteria are bewilderingly inconsistent. The Houston Marathon considers you elite if you can cover the distance faster than five minutes per mile. Choice Hotels sets the bar a bit lower; book 10 rooms in a year and you've got it. And, while Houston wants no part of me and Choice Hotels regularly sends me emails noting I'm just short of their threshold, the selection committee at the Ultra Race Of Champions (UROC) seemed to think I was worthy of the designation. So it was that I travelled to the Blue Ridge Mountains of Virginia to toe the line with some of the best ultra runners in the world.

My Carol's Team mate (and designer of our jerseys and logo), Kevin Robertson, will be crewing for me. He lives on the east coast and picks me up at the Baltimore airport. From there, heavy rain and heavy traffic slow our progress such that we don't get to Wintergreen Resort until shortly before the "Meet the Elites" panel is to convene at 7PM. With runners like Geoff Roes and Dave Mackey present, I'm certainly not the star attraction, but Race Directors J. Russell Gill and Francesca Conte (who simply go by "Gill and Fran") seem genuinely happy that I've made it. Twenty or so of the elite field are on hand to field questions. The event lasts about an hour, after which I get in a short jog and some dinner before heading to bed.

UROC has negotiated some excellent rates with Wintergreen so we stay at a condo only half a mile from the start/finish. While that obviates the normal race morning rush, I still wake up at 4:30AM so I can get my oatmeal and coffee two hours ahead of the 7AM start. The morning is damp and cool. A few gaps in the mostly cloudy sky to the east reveal a very pleasing sunrise just as the elites are being called to the line.

Gill says we'll be running a "parade lap" around the parking lot and back through the start to string things out a bit before hitting the trail. I don't know if the time counts or not, but the men's field seems to think it does and heads off at something approaching my 10K pace. I settle in with the women's pack which adopts a much more reasonable tempo. While my goal is simply to beat anybody in this field, I'm happy enough to sit at the back for now. After a few miles of singletrack we hit the big climb up to the Wintergreen Summit aid station at the high point of the course. It's a very steep mixture of singletrack and road. Despite mixing walking and running, I pass a couple of the women. By the time I reach the aid station I can't see anyone ahead or behind. The first five miles have certainly spread the field.

In hindsight, the next section of the course is probably the most critical. It starts easily enough, winding its way down the mountain on a mix of grass and dirt singletrack. After a mile or so, the trail hits the steeper portion of the ridge and becomes more technical. The added rocks are a blessing as they force me to slow down a bit. After another mile, however, the course switches to roads for the steepest part of the descent. I try to hold back, but there just doesn't seem to be any stride length that can mitigate the pounding of running down a grade this steep. It's with considerable relief that I reach the gatehouse at the entrance to the resort and start the climb back up to the top of the ridge. I get to the Reed's Gap aid station (9.3 miles) at 8:45, which isn't far off my intended pace (I could tell from the profile the opening would be slow), but I have a terrible feeling that the big descent has done some real damage to my legs. I tell Kevin that he should expect a long day.

The next section is about 5 miles of easy running on the Blue Ridge Parkway to White Rock Gap. Then comes the descent off the ridge to Sherando Lake. It's a fair drop on singletrack, but spread out over three miles so it's also pretty easy running. There's just one problem: neither section feels easy. Not really hard, either, but not easy. And at barely a quarter of the way in, everything should be feeling easy right now. Near the bottom, I pass the leaders who are coming back having done the lap around the lake. The issue is clearly in doubt as the top five are all within a few minutes of each other. At the lake, Kevin is upbeat, but there's concern in his face. If my stride has degraded enough that he's noticing it, I'm in trouble.

The loop around the lake is a gentle and picturesque trail of just over a mile. Back at the aid station, I grab plenty to eat and drink as the next aid station is over seven miles away on top of the ridge. There are no crew accessible stations until Whetstone Ridge at mile 33, so I won't be seeing Kevin for quite some time. "I'll run with you when you get there," he says. I take it as something of a vote of confidence that he thinks I'll still be running at that point.

For the moment, running is still a happening thing. In fact, the uphill muscles are in fine shape and I run the ascent to the ridge feeling pretty good. The bulk of the 100K field is coming down now and they aren't as good at "running skinny" as the leaders were. Fortunately, I'm the one going up, so stepping off the trail a few times to let them by doesn't slow me down much. After the ascent, the trail becomes undulating and quite rocky. My quads begin some serious complaining. Oddly, while the pain is significant, I'm actually covering the ground pretty well and pass a couple runners along the way.

After the Bald Mountain station (25.9 miles) the course returns to pavement for 3.4 miles to the water drop at Spy Run Gap. I cover this in under 34 minutes and just as I'm thinking 10-minute miles aren't that terrible, the road heads downhill.

The searing sensation in my quads is something I normally associate with the day after a race gone bad. I don't ever recall my legs hurting so much during an event. By the bottom of the hill I am cursing out loud with each step. As the road levels off, the pain relents a bit, but there's no getting around the fact that my quads are gone with half the race to go. I stagger on towards Whetstone Ridge trying to remember if I've ever run in such pain. I'm not tired; in fact I feel rather fresh, but every contraction of the quadriceps feels like it might be the last. I start seeing returning runners. Some of them look fine, but some look surprisingly shot. It's obvious I'm not the only one who's been caught out by the opening part of the course. At Whetstone Ridge I'm told several of the elites have already called it a day.

Kevin is ready to run with me and we head out to the turnaround on the Dragon's Back trail. The aid station workers had encouraged us by saying the trail was quite runnable and that turns out to be true. With the exception of one really rocky section, it's a fairly gentle path. Kevin notes that my stride seems to be coming back. Indeed, I am running better, though the pain is still pretty intense. At the turnaround, a password is posted that we have to remember to verify we went all the way out. The word is "quadzilla." We get a laugh out of that; it certainly won't be hard to remember in my current state.

The fields are getting mixed as the non-elites started only 15 minutes behind us. We pass quite a few runners both heading out and coming back. It seems I'm somewhere between 15 and 20 in the elite field and barely in the top 30 overall. Normally, I'd be pretty happy with that as I'm typically a good second half runner. Today, however, I don't kid myself that my final result will be anything like that good. It's just a matter of time before the legs quit altogether and I have to walk it in.

By the time we get back to the Whetstone Ridge station, my brief resurgence has ended. I stumble a few times on the trail, but manage to stay upright. Kevin looks at me seriously and says, "I'm worried about you." "I am, too," I reply. I'm heading back into the section of the course that isn't crew accessible, which means that if I don't pack it in now, I'll pretty much have to get all the way back to White Rock Gap on my own. It's only 12 miles and seven of them are on roads, so it seems a reasonable gamble. As far as I can tell, the pain is merely indicative of some very tight muscles and no long-term damage is being done.

After a short climb up from the aid station, the road is gently downhill for a couple miles. It hurts in ways I don't have words to describe but I'm able to run it. I rejoice as I reach the bottom of the steep climb up to Spy Run Gap (the one that caused the cursing coming down) and settle into a firm walk. At the top of the hill I get back to running and manage to get back to the Bald Mountain station only five minutes slower than the trip out.

It's 5:00PM and I have a half marathon to go. If I had any legs at all, I'd still have an outside shot at finishing by dark. I don't, so my real concern is covering the 5.3 miles to White Rock Gap before the sun goes down at 7. While taking the headlamp at Whetstone Ridge would have been prudent, I'm pretty sure I can make it. My worry is the last 2 miles into the station which takes a different route by way of a trail on the east side of the ridge rather than going back to the lake. I've been warned by local runners that it is exceedingly steep and technical. Also, since it's on the east face, it will get dark sooner. The aid station workers think I'll be fine, but I'm sure they don't realize how much trouble I'll have on the technical descent.

The first mile is jeep track and I run it knowing that this is probably the last bit of the course where both my feet will be off the ground at the same time. Once I get to the rocks, I slow to a walk. As the trail begins to descend off Bald Mountain, the walk literally becomes a crawl. I simply can't step both forward and down anymore, so I have to go down sideways grabbing onto trees and boulders to keep my legs from giving out. I get to the fork onto the new section just before 6PM.

For about half a mile the trail is gentle bringing me to the top of a magnificent cliff. There's a waterfall just to the north and I know we cross that stream somewhere below the base which is a really long way down. The descent is a series of rocky switchbacks. Even sideways isn't working particularly well anymore; I have to turn completely around and crawl backwards on all fours to get down the bigger drops. I'd hate to bail with less than two miles of tough trail remaining, but I'm not going to be stuck in the woods after dark without a light. I set a turnaround time of 6:40. If I don't cross the stream by then, I come back up and walk to White Rock Gap on the road. I get to the base of the cliff at 6:20, but the trail continues to drop. At several points it gets tantalizingly close to the stream, but then the stream drops away some more. Finally, at 6:32, I reach the bridge and get to hustling back up the other side. The trail back up is easier and I breathe a huge sigh of relief as I waddle into to White Rock Gap station at 6:55.

There are nine miles of road remaining and I've got my light. It won't be pretty, but I'll finish. Except, I don't have my light. I look around the parking by the aid station and there's no sign of Kevin or his car. The aid station workers try to call ahead to the next station figuring he's probably there, but the radio operator isn't able to get through. Then comes one of those gestures that ultra runners just take for granted, but my experience in other sports has shown it to be quite exceptional: the crew for another runner digs up a spare light and gives it to me. I'm only on the road for a couple minutes before Kevin comes by. He had missed the note that he was to go to this station amongst the scribbles we had put in the race bible last night and finally decided to come back when it was clear I was getting caught by dark. I take the opportunity to change into a long-sleeved shirt and tell Kevin to return the light and then find a real dinner. Everything will be closed by the time I get done and I don't want him to have to settle for aid station fare after all he's done for me today.

With any hope of a respectable finish gone, I settle into a firm walk for the rest of the way home. No point in making the recovery any longer than it already will be. I'm surprised by how few people pass me. Only six come by on the way to Reeds Gap and nobody passes me after that. The first to pass asks how I'm doing and I reply truthfully that I'm wrecked and a bit disappointed. "You usually go faster than this?" "Yeah, by the time I finish I'm gonna be close to my 100 mile PR." "Wow, this is the farthest I've ever run." And off he goes.

It occurs to me that I may have been a bit of a jackass. Beating one of the elites might just be the proudest moment of this guy's life. At least on this day, he was better than me and I had no business suggesting it was a fluke. When the next guy comes up and asks the same question, I stick to the facts, "This course kicked my ass." "Yeah, well, hang in there." "You, too." And off he goes.

The 1-mile descent down to the gatehouse from Reeds Gap is every bit as unpleasant as I thought it would be. About halfway down, I look up and see an airplane emerging through the clouds with its landing lights on. This seems a bit odd since there is nowhere to land on the ridge and it's going the wrong way to be on approach for the valley. Then, I realize it's not a jet, but the floodlights next to the finish. I've lived in the Midwest long enough that I'm just not used to looking that far up and seeing land.

As much as it is a grunt, the climb to the finish is a welcome change of pace. I cross the line at 9:47PM (14:47 elapsed time). Gil is there and gives me a sincere handshake. Inside, Fran is serving up pasta, capping the excellent support and organization throughout the weekend. I confess to them that I'm a bit ashamed to reward placing me in the elite field with my worst run in recent memory. They take the opposite stance, saying how happy they are to see people hanging in to finish. That's a view echoed by Kevin who notes that there are a lot of less able runners struggling in when some of the biggest names in the sport have given up. I'm not too hard on the top guys for being pragmatic and saving themselves for another race when this one went bad, but the point is well taken: at the end of the day, it's not what you can do but what you actually do that matters.

While I am the last elite to finish, half of them didn't; I'm 12th in the elite men's field so the goal of not being last was met (it's not total bogus, either, as I passed another runner in my field on the trail heading out to Bald Mountain while he was still in the race). Overall, I'm 44th which is pretty dismal. The disappointment for me is not that I finished badly, but that this should have been a very good course for me. The first 9 and last 4 miles were absolutely brutal but, aside from the descent by the waterfall, everything else was relatively fast running on roads and moderately technical trail; just the sort of terrain I typically tear up. Instead, I covered all of that distance hobbled by the fact that I'd destroyed my quads in the first 90 minutes. It's not like I would have won the thing, but a good performance was there for the taking.

At any rate, I'm not one to pine over what might have been. Upon hearing my tale of woe, several folks have offered condolences. I need no pity. First, there's the simple fact that I knew that running four ultras in eight weeks was probably asking a bit much from my body. I wouldn't really want to give back the second place at Howl or the win at Flatlanders. This was the price and I accept that. More importantly, while the performance was a disaster, the experience was quite fine. I have a friend who paid $20,000 to go to a fantasy camp and play baseball with major league players. He says it's the best money he ever spent. I got to play with the same level of athletes and it wasn't fantasy, it was a real game. I'd be just a bit disappointed if they hadn't dished me a thumping. I don't suppose many race directors will be inviting me to join the elite field again anytime soon but, if I really want the status, I could always stay a few more nights at the Comfort Inn.

Friday, January 15, 2016

Resources

Math posts are still on hold while I get my Algorithms summary done. This one is a little less fluffy than yesterday.

Among the grave crimes of Corporate HR is the trend towards dehumanizing the workforce. In particular, companies have taken to referring to the folks that work for them as "resources" or, even worse, "assets". Listing a person as an asset on your books became illegal on December 6, 1865 with the ratification of the 13th amendment. I've been supporting HR and accounting systems in various forms for 20 years and, trust me, if companies could claim employees as assets, they most certainly would.

The reason they can't is because to claim something as an asset, you have to actually have ownership of that thing. Even though it's clearly an investment that will be realized over time, you can't capitalize the cost of a training class or tuition reimbursement. Why? Because that person can walk out your door at any moment and never come back and there's not a thing you can do about it. The only exception to this is the military, which is protected under desertion laws and can (and does) require service in exchange for education. Some companies try to get around this by requiring you to pay back the cost if you leave. Of course, that's completely unenforceable; the worst they could do is ding your credit rating. At any rate, if a company ever asks you to do that, you might want to ask them why they're so scared you'll bail. Maybe it's not a very good place to work.

Resources and assets are things you use to get a job done. You may do well to conserve them or use them wisely, but you're not required to. They belong to you. People are living creatures created in the image of god. To treat them as resources is an insult to them and their maker. Even if you don't believe that, you'll get a lot more out of them if you act like you do. Unlike resources, people can choose how useful they want to be. Motivated people are incredibly more capable than people who don't care about the organization they work for. Why companies want to diminish this by reducing them to financial line items is beyond me.

My current client doesn't want to do that. They're taking THE ENTIRE IT staff (employees and contractors) to Star Wars this afternoon. You wouldn't believe what happened to morale when that was announced unless you were here to see it. They'll have to expense the cost, but the gains will be realized for quite some time.

Thursday, January 14, 2016

Practical engineering triumph of the day

Hey, I warned you that some of these posts would be fluff...

Aside from running and singing, cooking is one of my off-hours passions. I've done it since college, but I've only taken it seriously in the last six or seven years. As with running and singing, I've had to back off quite a bit to make time for studies. One of the obvious optimizations is to make meals that really are two meals. Chili mac is a favorite. I make chili one day and then dump the leftovers in a baking dish, top it with mac&cheese, and bake it the following night.

Now, as anyone whose cooking skills have advanced beyond mac&cheese from a box knows, the trick is getting the cheese sauce the right consistency. The ratio of butter, cheese, milk, and flour can be tweaked on the fly to get the desired thickness, the problem is lumps. If you add flour at the end, it all clumps up. I've usually mixed it into the milk with a whisk and then added the milk slowly to the melted cheese and butter until it's the thickness I want. That works, but you still wind up with a few lumps. Nobody really cares because, after you bake it, the little flour lumps are indistinguishable from little pieces of pasta. It's just one of those things that is fun to get really right (plus, there are lots of other uses for cheese sauce where the lumps do matter).

Enter the immersion blender. Does a great job of getting rid of the lumps but, since I only need a few ounces of milk, it's hard to get it to actually immerse. As you might guess from the name, it doesn't work very well when not fully immersed. My immersion blender is from Cuisinart and it came with a 2-cup measure. I didn't read the pamphlet that came with it, so I just figured that this was one of those things that get thrown in with an appliance whether you need it or not. It's a nice graduated measuring cup and it's microwave safe so, fortunately, I held on to it.

It occurred to me while splattering myself a few weeks back trying to use the blender in the actual pan on the stove with the melted sauce (DON'T TRY THAT; IT DOESN'T WORK AND HOT CHEESE WILL BURN YOU), that the measuring cup is just barely wide enough for the immersion blender to fit into. Could it be that Cuisinart actually designed these things to work together? So, this morning, I tried it (I usually assemble the dish in the morning so Kate can put it in the oven before I get home). Wha'dya know? It whipped the tablespoon of flour into 4 ounces of milk perfectly. As I gently stirred it into the melted cheese and butter a blemish-free sauce formed in the pan before my eyes. It's like Velveeta except that, being made from substances found in nature, it actually tastes really good.