Friday, March 10, 2017

Crunching

I spent most of the week refactoring my caching because I was concerned that I'd run out of memory on my laptop (no such worries once I move this to the Hadoop cluster; it can easily fit the entire data set in memory). Anyway, the big pull is underway. I'll be taking in 760 million rows, which is almost twice as many as I used for CISS. While the bigger rowset will slow down processing a bit, it will also mean that I can use a larger block size, which should give better options for partitioning. I'm using a 500,000 row block limit, so I'll have over 1500 blocks to work with. That seems like a good sample size. We typically partition our cubes into around 1000 partitions, so it's a pretty realistic number.

As you can see from the monitor shot below, Oracle is the problem as far as pulling the data goes. My PC isn't even breaking a sweat; it's waiting on Oracle to retrieve rows. There is a faster way to do this which we discovered when we hooked up the HDFS ingestion. There's a native driver that pulls entire partitions. As long as you're OK with getting every row and column in a partition, it's really fast. I could have done that here, but I'm OK with waiting a few hours for the load to compete. I have to sleep sometime.


No comments:

Post a Comment