partPlan = read.csv('data.csv')
amount = sort(partPlan$AMOUNT)
index = 1:length(amount) #length is approx 700,000 observations
plot(index, amount, type='l')
smallest = 1:1000
plot(smallest, amount[smallest], type='l')
nearlySmallest = 1000:10000
plot(nearlySmallest, amount[nearlySmallest], type='l')
notSoSmallest = 10000:100000
plot(notSoSmallest, amount[notSoSmallest], type='l')
Hell's bells. We're now 1/3 of the way to the median. I had hoped things would get more linear as we moved into the middle of the data. If anything, the bend is more pronounced. What this means is that even if we stratify on order of magnitude, we're still going to run into the same problem at each stratum and if the maximal observation for the slice we're looking at is in one of the lower strata, we may be sifting through darn near the entire data set.
That's not to say I think the general idea won't work. Just that it's going to have to be a bit more sophisticated in trimming the data.
No comments:
Post a Comment