Well, we knew it was coming. Production went down big time last night. We had it back running by mid-morning, but it took the rest of the day and some of the evening to clean up the mess. At least we now know one of the failure patterns. One of the disconerting things about a complete overhaul of a system is that when it goes into production, you really don't know how it's likely to fail. All the failure patterns identified prior to go-live are addressed, but only a fool thinks there aren't more waiting out there.
The old system failed a lot, but it failed in ways we knew. It wasn't great fun to be woke at 2AM to fix it, but it generally was pretty easy to fix. The fact that the new system has been running fine in prod for over a month with no failures is certainly an indication that it's more robust than the old one. But, today was a reminder that when it does fail, we have some thinking to do.
Anyway, it's all back running again now. And, I did slide in just a little math today. I proved the bias on the CISS variance is O(1/h) where h is the number of non-zero blocks sampled so far. That result is pretty useful when deriving the optimal number of strata.
No comments:
Post a Comment