The Emergence of Enterprise Data Lake: A Brief
The concept of the Data Lake is shrouded in mystery, what is it? Is there some kind of Loch Ness data monster swimming around at the bottom of it?
The Data Lake is being defined as developers evolve it. Simply, a Data Lake holds raw data in its native format for later use. Pentaho‘s CTO James Dixon is credited with inventing the term. Data Lake deployment has grown rapidly in recent years and is now a key cog in a wide variety of Big Data initiatives. Data Lakes are disrupting the Data Integration market and helping to redefine the way enterprises handle their data. Providing a more in-depth definition, a Data Lake stores disparate information while ignoring almost everything. Unlike a data warehouse or datamart, which is a small slice of a data warehouse that users extract their data from, the lake pays no attention to how or when its data will be used, governed, defined or secured.
Data Lakes store data in a completely unstructured and unorganized format. The data is not specialized, meaning that it can be manipulated in a variety of ways. In lots of cases, Big Data works better in this way. In the past, data warehouses were sufficient storage mediums for enterprise data because they were organized better, and that’s still true. However, it becomes difficult for data scientists to uncover insights when data is pre-organized. Sure, it may take longer to get from point A to point B, but what the Data Lake has going for it is that all of the information stored within it is available at any given time, in its native format. In a competitive world where every scrap of data matters, the Data Lake can be intriguing. Further, in using a Data Lake, one can dictate the kinds of analysis that are possible using that data, not the other way around.
Not everything is perfect at the shore, however. While it does allow for more advanced searching of larger volumes of data, there are no unique identifiers. The extractor has to start from scratch in order to create a new data analysis since there is no metadata. It’s a lot more difficult to search through a pool of unfiltered data when nothing has a category or class designation. In short, gaining value from a Data Lake is difficult.
Since the data cannot be defined, there is no oversight as to what exactly is being dumped into a lake. Is the data useful? No one knows until it is analyzed. At least in the use of a data warehouse, data can be organized by quality. Here, it’s all meshed together. This also raises security concerns. If no one knows what kind of data resides in the lake, they might not find out that some of it is corrupt until it’s too late. Shortcomings in this space are important to note, as organizations have started using this technology with no real push for security measures. Compromises in security need to be addressed.
Business Intelligence tools have a tough time sifting through all the mud at the bottom of the lake. BI solutions, for the most part, are engineered to analyze organized data. They simply don’t function at a high level when asked to take on the task of completely unstructured information. Though data warehouses provide a lot less raw data, they are drastically more defined. One of the biggest problems in the Data Integration space to begin with was a skills gap. The use of the data lake requires more highly-skilled integrators, something that may not be available for quite some time.
In a recent post, Gartner warned against falling into the “Data Lake Fallacy.” Their viewpoint was clear: enterprises need to be careful of jumping right into Data Lakes and using them as their main integration source for analytics. They argue that while there are benefits, the industry has yet to adapt, and applications within the enterprise environment are uncertain at this point. Andrew White, VP of Gartner writes: “The need for increased agility and accessibility for data analysis is the primary driver for Data Lakes. Nevertheless, while it is certainly true that Data Lakes can provide value to various parts of the organization, the proposition of enterprise-wide data management has yet to be realized.”
One has to think that Data Lakes will continue to grow in popularity as the Internet of Things boom looms and all its connected devices begin to stream into the marketplace. For now though, it seems safer to store data outside of the lake, citing the concerns outlined above. It doesn’t look like Data Lakes will make warehouses obsolete any time soon, at least until someone finds a way to provide enough organization and security to them to make them worthwhile for all Big Data initiatives.
But if someone swoops in to organize and secure the Data Lakes, won’t that just make them Big Data warehouses?