Companies vary in their approach to Data Management. Some enterprises collect only a few types of data, thus the traditional data warehouse technique works quite well. For others, expanding sources from which they retain data is forcing them to change their viewpoint, and they’ve moved over to collecting all of their data into the Data Lake. Now that we’ve covered the technological differences between these two approaches, we need to have an extended look at how the composition of the Data Lake changes the way it is able to function.
The benefits of the Data Lake approach are numerous, and as data volumes continue to expand, companies are increasingly realizing the need for a more agile and unstructured way to manage enterprise data. Enter the Data Lake, a technology usually associated with the Hadoop platform that has taken the enterprise world by storm, with many of the top companies in the world investing. Data Lakes typically have very few to no regulatory functions, meaning that any size or scope of data can be collected. The companies that are finding this approach intriguing are the ones that collect data from non-conventional sources, including data from social media outlets and IoT product sensors, to name only two applications.
In this way, the Data Lake enables the storage of data in its raw form, with virtually no limit as to how much data can sit in the waters at any one time. The data can be historical or streaming in real-time. This allows organizations to use data as it is funneled into the Data Lake or very far in the future. Since data is collected straight from the source systems, companies don’t have to put resources aside to regulate it. The Data Lake enables digital businesses to go beyond the capabilities of data warehouses, but it does not come without added responsibility, and herein lies the crux of the issue.
The Data Lake is a breath of fresh air to many, especially those within an organization that regularly work with data and need the ability to run analysis on-demand, without waiting for IT. However, given the free spirited nature of what can be stored in the lake’s architecture, they can quickly turn into data dumps. This comes as a result of the de-regulation of what can be stored, which can create headaches for those who need to query through potentially billions of records to find a single piece of information. So, Data Lakes were inherently developed to avoid large degrees of Data Governance – and this is a very big problem for those that take their data seriously.
This unregulated “data dump” can also turn into a breeding ground for bad data, and not just data that doesn’t live up to an organization’s high standards. Think about it, if there’s no governance structure in place to vet what comes in, how can a business be sure that what is being dumped into their lake won’t potentially hurt them, let alone provide any long-standing business value? This is where security plays a bigger and bigger role as data volumes become in many cases, too much for companies to handle. Enterprises are trying to ensure that they can gain answers to the questions worth asking, but oftentimes, as a result of the structure of a Data Lake, wind up breaking compliance and regulatory rules as a result.
This is where many organizations are running into trouble. They were pitched some admittedly great ideas by a vendor marketing team about how they would be able to full advantage of the Hadoop platform via their awesome new tool but live to find out that their connectors were really just dragging in data that will have no use for them in the present or future, with much of it being potentially hazardous to the health of their data infrastructure. On that same front, what makes the Data Lake unique is the fact that it can hold an undefined amount of unstructured data for an indefinite amount of time with very little oversight. Data architects and scientists remain in a conundrum as a result.
Forward-thinking solution providers are undoubtedly aware of this murky situation, and what could differentiate one vendor from another in the not-so-distant future is which ones are able to provide the ability to collect and store overwhelming amounts of unstructured data but provide just enough governance and security to help ward off the trash that turns the Data Lake into a data dump. It’s a thin line to walk however, since too much regulation could wound a technology which many data enthusiasts see as the future of Data Management. Data Lakes don’t yet allow data professionals who are accustomed to the too-structured data warehouse to have their cake and eat it too.
- What to Expect at the CDO TechVent for Next-Generation Data Catalogs on June 21 - June 8, 2023
- Data Management News for the Week of June 9; Updates from Monte Carlo, Snowflake, TIBCO & More - June 8, 2023
- Data Management News for the Week of June 2; Updates from Monte Carlo, Satori, Snowflake & More - June 2, 2023