The concept of the Data Lake is shrouded in mystery, what is it? Is there some kind of Loch Ness data monster swimming around at the bottom of it?
As a term, Data Lake is being defined as it develops. But simply, a Data Lake holds raw data in its native format for later use. Pentaho‘s CTO James Dixon is credited with inventing the term. The practice has grown in popularity recently, and is used heavily in Big Data initiatives. As a tool, Data Lake is disrupting the Data Integration market and helping to redefine the way enterprises handle their data. Providing a more in-depth definition, a Data Lake stores disparate information while ignoring almost everything. Unlike a data warehouse or datamart, which is a small slice of a data warehouse that users extract their data from, the lake pays no attention to how or when its data will be used, governed, defined or secured.
Big Data initiatives have begun to use Data Lakes much more of late because a Data Lake holds all of its data in an unstructured, unorganized format. The data is not specialized, meaning that it can be manipulated in a variety of ways. In lots of cases, Big Data works better in this way. In the past, data warehouses were sufficient storage areas for data because they were organized better, and that’s still true. However, it becomes difficult for data scientists to uncover insights when data is pre-organized. Sure, it may take longer to get from point A to point B, but what the Data Lake has going for it is that all of the information stored within it is available at any given time, in its native format.
In a competitive world where every scrap of data matters, the Data Lake can be intriguing. Considering that the Internet of Things is the next big topic in Data Integration, its popularity should continue to grow. The Data Lake is not limited to specific, static structures like a data warehouse is. Further, in using a Data Lake, one can dictate the kinds of analysis that are possible using that data, not the other way around.
Not everything is perfect at the lake, however. While it does allow for more advanced searching of larger volumes of data, there are no unique identifiers. The extractor has to start from scratch in order to create a new data analysis since there is no metadata. It’s a lot more difficult to search through a pool of unfiltered data when nothing has a category or class designation. In short, gaining value from a Data Lake is difficult.
Since the data cannot be defined, there is no oversight as to what exactly is being dumped into a lake. Is the data useful? No one knows until it is analyzed. At least in the use of a data warehouse, data can be organized by quality. Here, it’s all meshed together. This also raises security concerns. If no one knows what kind of data resides in the lake, they might not find out that some of it is corrupt until it’s too late. Shortcomings in this space are important to note, as organizations have started using this technology with no real push for security measures. Compromises in security need to be addressed.
Business Intelligence tools have a tough time sifting through all the mud at the bottom of the lake. BI solutions, for the most part, are engineered to analyze organized data. They simply don’t function at a high level when asked to take on the task of completely unstructured information. Though data warehouses provide a lot less raw data, they are drastically more defined. One of the biggest problems in the Data Integration space to begin with was a skills gap. The use of the data lake requires more highly-skilled integrators, something that may not be available for quite some time.
In a recent post, Gartner warned against falling into the “Data Lake Fallacy.” Their viewpoint was clear: enterprises need to be careful of jumping right into Data Lakes and using them as their main integration source for analytics. They argue that while there are benefits, the industry has yet to adapt, and applications within the enterprise environment are uncertain at this point.
Andrew White, VP of Gartner writes: “The need for increased agility and accessibility for data analysis is the primary driver for Data Lakes. Nevertheless, while it is certainly true that Data Lakes can provide value to various parts of the organization, the proposition of enterprise-wide data management has yet to be realized.”
The industry has started to latch on to the Data Lake initiative. Informatica has just joined forces with Pivotal and Capgemini to put forth a Data Lake program they call Business Data Lake, a solution that aims to reign in the Data Lake and make it usable for a wider audience of businesses.
They describe the initiative: “Current Big Data solutions face limitations and are not comprehensive enough to support the data pipelines and real-time capabilities required for operational systems and often do not meet the required levels of data governance, quality and security. The Business Data Lake addresses these issues and helps businesses leverage their data in a way that makes sense, from both an individual and business perspective, rather than just a single enterprise view.”
One has to think that Data Lakes will continue to grow in popularity as the Internet of Things boom looms and all its connected devices begin to stream into the marketplace. For now though, it seems safer to store data outside of the lake, citing the concerns outlined above. It doesn’t look like Data Lakes will make warehouses obsolete any time soon, at least until someone finds a way to provide enough organization and security to them to make them worthwhile for medium-sized initiatives. But if someone swoops in to organize and secure the Data Lakes, won’t that just make them Big Data warehouses?