Big Data - Deep or Wide?

There has been a revolution going on in the Business Intelligence (BI) world in recent years. Those who follow the trends in BI and data warehousing are probably aware of the growing interest in a wave of database systems expressly developed to analyze the unbelievably huge data stores created by the maturing internet juggernauts. Companies such as Google, Facebook, Amazon, and Yahoo now want to analyze literally hundreds and thousands of terabytes of data that they find are essential to their business. Welcome to the brave new world of "Big Data." Technologies such as NoSQL database systems and MapReduce algorithms, and products such as Hadoop, Hive, and Pig seem to be becoming more and more mainstream, and consequently more and more the topic of discussion on blogs and at conferences.

The question is, how much of this really pertains to the world of higher education management systems, i.e. the institutions that run SunGard HE, Datatel (both now Ellucian), Jenzabar, Campus Management, and PeopleSoft systems? Aren't they also struggling to make sense of copious amounts of data? As someone who has worked with BI and reporting in this space for most of the last decade, I find the focus on the "Big Data" solutions a bit frustrating, because I see these tools addressing a different problem than that faced in the higher education BI world. This may seem a little counter-intuitive, as there certainly is more data than ever involved in running our campuses and institutional systems. Wouldn't tools focused on "Big Data" help us too?

To illustrate, if you think of data as a swimming pool, the typical "Big Data" applications work with swimming pools that are very, very deep, and contain a whole lot of water. The ability to pump lots and lots of water volume is what the job is all about. On the other hand, I see our data in the higher education management space as being a swimming pool that is not very deep, comparatively, but which has an incredibly broad surface area. The overall volume water is not comparable to those "Big Data" swimming pools, but the surface area may be much greater and the structure and interrelationships of the different parts of the "swimming pool" are very complex.

Typically, an institution is not dealing with mammoth volumes of administrative data (unless it is really big school doing clickstream analysis on its websites and learning management system, perhaps). The total number of customers at our enterprises (our students) and the number of items they typically buy (classes, housing, meal plans) are relatively modest, again compared to the Amazons of the world. However, the variety of types of data we deal with is huge, and ranges from housing preferences to complicated faculty contract tenure payments to accounts payable records to course prerequisite and degree requirement rules, etc. The list of business transactions that occur in the management of an institution is incredibly diverse and complex. It is a wide swimming pool of data with a huge surface area, though as I said, perhaps not that deep at any point.

As someone working in the higher education reporting world, I am looking for support not for "Big Data," but for what I think of as this "Wide Data" paradigm. Rather than tools that support incredible throughput on massive data sets, this implies a need for tools that help in the analysis of complex data sets. In particular, these tools should make us more nimble in quickly modeling and integrating new data sources into BI and data warehousing environments. This data must be readily available for our reporting and analytic delivery to our end-users.

There is another trend in the BI world which may prove much more fruitful for our future endeavors, in my view. This would be the emergence of in-memory databases such as SAP's Hana or even Microsoft's PowerPivot and BI Semantic Model that essentially are making the whole idea of pre-aggregated measures a thing of the past. But more on that in a future post.