Saturday, March 11, 2017

We got data

I'm not exactly sure why I have 2.3 billion fact rows in my data set. All indications were that I would only have 780 million. However, all the rows have good referential integrity and there are no duplicates, so I'm going with it. I think it has to do with the way we exclude partitions from the cube. The Plan dataset is only 5-years, but a lot of the submissions were 20-year projections. The partitions in the cube total 780 million, but I picked up all the other stuff as well. At least, I think that's what happened. I'm not going to spend a day trying to verify that because I don't really care; I needed data, I got data.

Anyway, the load worked and it's still "only" 100Gb, so it's not killing my laptop. I now have 4680 blocks to work with. I expect the block count to rise significantly when I start partitioning them, since partitions never fill perfectly. I'll fill in the dimensions today and get going on the query engine. Once that's done, I can run a baseline and start reblocking stuff (which is the whole point of this exercise.

I am a little worried about performance. With this many rows, running the queries is going to take a while, at least until the blocks start getting closer to optimal. I'll basically be table-scanning the entire fact set until I can start excluding partitions. That might mean that I only get one overnight run on the early partitions. Unless that improves near the end, the most I can hope for is around 30 generations between now and when the project is due. That's not really very many. I'd rather have a few hundred.

On the other hand, it does make for a more realistic test. If we were to implement this thing in Production, the time to do the blocking would be when the new version of the business dimensions arrives. We use slowly-changing dimensions, so any attribute blocking would have to be re-done at that time anyway. Since we get dimension updates twice a day, this thing will need to tune itself pretty quickly. I'm fine with that, I just don't know that my first stab at a solution is going to converge that quickly.

No comments:

Post a Comment