Drowning in a Data Lake

Two years ago, all people wanted to talk to me about was how data lakes would revolutionize the way we "do" analytics. The popularity of data lakes was built upon the hype that consolidating data removes the information silos. Great theory, but bad practice. Organizations cannot just store data where and how they please. Take a state revenue agency for example. There are rules, and security around data and what can be "co-mingled". Anyone familiar with IRS Publication 1075 knows exactly what I mean. The protection of data is vital to compliance with state and federal regulations, and it requires strict governance or else agencies are leaving themselves wide open to privacy and compliance risks.

Do these data lake initiatives really make sense for your organization? The answer is most likely no, but not for the reasons you think. Regulations, cost, and IT infrastructure are important, but the most limiting factors are people and good ideas. Do you have the people that can develop and execute on good ideas that will extract value from the data lake? Even if you do have those people, can they be devoted to answering questions with the data lake, or do they have a day job that will keep them from ever using it.

Data lake initiatives often fail because organizations are obsessed with getting data in without having a clear understanding of what information they want to get out. Ingest now, analyze later = bad idea. I was speaking with someone recently that described how hard it was to get their data lake started, but after two years and several million dollars, now they are ingesting over 1 TB per day into the lake. The project was declared a success. I asked: "What are you analyzing with all these data?". His response: "I'm not sure, but at least the data are all in one place.".

Without having a clear understanding for how to analyze these data, he is drowning in his data lake. These data will continue to grow untamed - something he admitted. The focus of the project becomes keeping the lake filled, instead of filtering the lake for information that they can use and take action. Before jumping into the lake, organizations need to have some idea what information should come out of the lake and whether or not users of the lake understand how to use it.

After about ten minutes of discussing his data lake, it became clear that there were some simple questions they forgot to ask:

  • Can we correlate case notes and call center data and customer demographics to improve our call resolution times?
  • Does anyone know how to conduct a text mining project so we can examine patterns in transcribed customer calls? What is NLP and does anyone know how to read the output from this procedure? Our analytics vendor told us NLP was great, but does anyone know why?
  • Are the chatbots we deployed solving customer inquiries? Do we need to alter the scripts for the bots?
  • Do we have a person that can interpret these regression coefficients from our call volume forecasting model? What is a regression coefficient?

If you haven't thought through the problems you are trying to solve with the data that will fill the lake, how will you be able to define the return on your investment? It is doubtful that data lakes like this one will deliver the results on which it was sold.