Monday, February 15, 2016

Data in R

So, I'm working with the data in R now. Haven't done a whole lot yet, but here are some graphs that make the problem obvious enough (no attempt at making them pretty right now):

partPlan = read.csv('data.csv')
amount = sort(partPlan$AMOUNT)
index = 1:length(amount) #length is approx 700,000 observations
plot(index, amount, type='l')
OK, obviously we have some sort of exponential growth going on at the tails. Let's take a closer look at the left side (the right isn't much different).

smallest = 1:1000
plot(smallest, amount[smallest], type='l')
That's ugly, but it's only a few observations at the tail. Let's trim that off and see how it looks.

nearlySmallest = 1000:10000
plot(nearlySmallest, amount[nearlySmallest], type='l')
Uh, oh, magnitude is less, but the shape hasn't changed. Wonder if this keeps up?

notSoSmallest = 10000:100000
plot(notSoSmallest, amount[notSoSmallest], type='l')
Hell's bells. We're now 1/3 of the way to the median. I had hoped things would get more linear as we moved into the middle of the data. If anything, the bend is more pronounced. What this means is that even if we stratify on order of magnitude, we're still going to run into the same problem at each stratum and if the maximal observation for the slice we're looking at is in one of the lower strata, we may be sifting through darn near the entire data set.

That's not to say I think the general idea won't work. Just that it's going to have to be a bit more sophisticated in trimming the data.

No comments:

Post a Comment