Missing the T? Data Storage ETL an Oversight, Says KNIME CEO

By Michael Berthold
Best Practices,

ETL Data Storage

Solutions Review’s Expert Insights Series is a collection of contributed articles written by industry experts in enterprise software categories. In this feature, KNIME CEO Michael Berthold offers commentary on why data storage ETL is an oversight in the enterprise right now.

Making sense of all of an organization’s data (and ideally enriching it with lots of data “out there”) is easy to say and still surprisingly hard to do.

Rather than seeing holistically, how many people interact with content on our website, on our forum, or on social media, wouldn’t it be nice to see activity grouped by the organization? We’d see not just an individual’s view of the most recent blog but also her colleagues’ comments on LinkedIn the following day. It would be even better if we could see the connection between the two–enabling us to distinguish between high engagement on a single team or interest from a new department. Wouldn’t it be great if an account manager tasked with growing a given account could spot patterns between support calls, social media comments, and online-store visits–even if some of that data came from a recently acquired company.

The biggest problem to allow for this continued “making sense of (all of our) data” is the nasty combination of ever-changing requirements or questions seeking an answer with ever-changing data sources that need continuous cleaning, transforming, and integrating. Without first organizing and adding structure to all those data sources, it’s impossible to derive interesting insights. The prominent claim that “data is the new oil” is surprisingly apt. Like oil, data in its raw form is initially useless– only once you refine it is it valuable and useful.

But how do I get to this state of well-organized data?

The solution for this used to be to build a data warehouse i.e. define the one, and only proper structure once and for all and then live with it. When that turned out to be infeasible since data and data sources are ever-changing, data lakes became popular – until they also turned out to be, well, rather messy. Then things moved to the cloud, but that didn’t really solve the problem of reaching and maintaining a state of well-organized data. Instead of solving it via smart (or not so smart) storage setups, meta query or federated setups promise another answer. Still, they, too, only solve a part of the puzzle.

Keeping your data accessible goes beyond just figuring out how to store the data. Teams also need a way for transformation (the “T” in ETL) to happen as needed without compromising resources or time. In this piece, we argue that low-code offers exactly that flexibility–giving anyone access to just the insights they need, as they need them.

But first, let’s revisit what’s been tried so far.

ETL Data Storage

Data Warehouse, the Dream

Data Warehouses have been the holy grail for ages but are rarely spotted in real life. The truth is that they are easy to imagine, hard to design, and even harder actually to put to work.

Let’s say we came up with the one true relational model to structure all the data floating around in an organization. In an automotive plant, for instance, perhaps your database holds manufacturing data (e.g., cycle times, lot priorities), product data (e.g., demands and yields), process data (e.g., control limits and process flows), and equipment data (e.g., status, run time, downtime, etc). If you can make sure all this data is properly cleaned, transformed, and uploaded–which is a big If–then theoretically, you’d see immediate benefits because the architects of the data warehouse made it easy for you to ask specific questions of your data. Perhaps you’d be able to reduce costs related to equipment failures. Or better optimize inventory because you become familiar with the patterns of demand versus yields. Or improve end-of-line testing for higher product quality.

But what happens when we want to add new data from a new machine? Well, we rework the relational model–something that is expensive, difficult, and often politically challenging. And what happens when we want to evaluate our CO2 footprint, so we need to connect data from suppliers and data from logistics? We, again, rework the relational model.

Even if people are successfully using our data warehouse to create new insights, new requirements will pop up that we did not think about when we first designed the structure of our warehouse. So rather than freezing that structure once and for all, this will quickly turn into a never-ending construction site, which will never have a coherent, consistent structure that includes all current data of interest. This will, at the very least, delay finding the answers to new questions but more likely make it simply impossible. Not at all the agile, self-service data warehouse we had in mind when we started this project years ago.

Data Lakes, the Mess

After data warehouses, the industry came up with the idea of a “data lake” – don’t worry about structure (not even in the data itself), just collect it all and figure out later how to organize it when you actually need it. That was made possible by increasingly cheap storage facilities and NoSQL storage setups. Distributed mechanisms to process this data were also developed, MapReduce being one of the most prominent examples back then.

Our manufacturing, product, process, and equipment data is never cleaned or transformed but dumped, as-is, into one centralized storage facility. When analysts want to make sense of this data, they rely on data engineers to custom-build solutions that include cleaning and transforming for each bespoke question. Although we don’t need to rebuild an entire relational model, data engineers do need to be involved in answering each and every business question. Also, an old problem resurfaced: lots of data keeps sitting across the organization in various formats and storage facilities, and even newer data continues to be generated outside of that swamp.

Data Lakes force us, just like data warehouses, to ensure all data sits within that one house or lake; we just don’t need to worry about structure before moving it there. And that’s precisely the issue – the organizing of data into the proper structure still needs to be done; it just gets done later in the process. Instead of structuring the warehouse upfront, we now need to deal with the mechanisms to add structure to the data lake at the time when we look for insights in our data. And we need the help of data engineers to do that.

The next generation of this type of setup moved from on-premise distributed storage clusters to the cloud. The rather limiting map-reduce framework gave room to more flexible processing and analysis frameworks, such as Spark. Still, the two main problems remained: Do we really need to move all our data into one cloud to be able to generate meaningful insights from all of our data? And how do we change the structure after it’s been living in our data lake? This may work for a new company that starts off with a pure cloud-based strategy and places all of its data into one cloud vendor’s hands. Still, in real life, data has existed before, outside of that cloud, and nobody really wants to lock themselves in with one cloud storage provider forever.

Leave the Mess Where it is

One big problem of all the approaches described so far is the need to put it all into one repository – may that be the perfectly architected warehouse, my inhouse data lake, or the swamp in the cloud.

Federated approaches try to address this by leaving the data where it is and putting a meta layer on top of everything. That makes everything look like it all sits in one location but under the hood it builds meta queries ad hoc, which pull the data from different locations and combine them as requested. These approaches obviously have performance bottlenecks (Amdahl’s law tells us that the final performance will always depend on the slowest data source needed) but at least they don’t require a “once and for all” upload to one central repository. However, querying data properly is much more than just building distributed database queries. Structuring our distributed data repositories properly for every new query requires expert knowledge for all but basic operations.

Missing the ‘T’

The central problem of all these approaches is the need to define the overall structure, e.g. how all those data storage fragments fit together. Beforehand in case of data warehouses, at analysis time for data lakes, through automatic query building for federated approaches.

But the reality is different. In order to truly aggregate and integrate data from disparate sources we need to understand what the data means so we can apply the right transformations at the right time to arrive at a meaningful structure in reasonable time. For some isolated aspects of this, automated (or even learning) tools exist, for instance for entity matching in customer databases. But for the majority of these tasks, expert knowledge will always be needed.

Ultimately, the issue is that the global decision of how we store our data is based on a snapshot of reality. Reality changes fast, and our global decision is doomed to be outdated quickly. The process of extracting insights out of all available data is bottlenecked by this one be-all-end-all structure.

This is why the important part of “ETL”, the Transformation is either assumed to have been figured out once and for all (in data warehouses), completely neglected (in data lakes), or pushed to a later stage (in federated approaches). But pushing the ‘T’ to the end has, despite making it someone else’s problem, a performance impact as well. If we load and integrate our data without proper transformations we will often create extremely large and inefficient results. Even just ensuring database joins are done in the right order can change performance by several orders of magnitude. Imagine doing this with untransformed data, where customer or machine IDs don’t match, names are spelled differently, and properties are inconsistently labeled. It’s impossible to get all of this right without timely domain expertise.

Putting the ‘T’ where it belongs

Transformation needs to be done where it matters and by the person who understands it.

Low Code allows everybody to do it on the fly, SQL or other experts inject their expertise (code) where it’s needed. And if a specific type of load, aggregate, transform process should be used by others, it’s easy to package it up and make it reusable (and also auditable if needed because it’s documented in one environment – the low code workflow). Low-code serves as a lingua franca that can be used across disciplines. Data engineers, analysts, and even line-of-business users can use the same framework to transform data at any point of the ETL process.

Should that low code environment be an integral part of (one of) our data storage technologies? Well, no–unless we plan to stick with that data storage environment forever. Much more likely we’ll want to keep the door open to add another type of data storage technologies in the future or maybe even switch from one cloud provider to another one (or go a completely hybrid path and use different clouds together). In that case a low code environment, which after all, is home to lots of our experts’ domain expertise by now, should make it easy to switch those transformation processes over to our new data environment.

Final Thoughts

Why did warehouses fail and data lakes don’t provide the answer either? Just like with software engineering, the waterfall system doesn’t work for dynamic setups with changing environments and requirements. It needs to be agile, explorative when needed, and documentable/governable when moved into production. But since data transformation will always require expertise from our domain experts, we require a setup that allows us to add this expertise continuously to the mix as well.

In the end, we need to provide the people who are supposed to use the data with intuitive ways to create the data aggregations and transformations themselves – from whatever data sources, however they want. And at the same time we want to keep the doors open for new technologies that will arise, new tools that we want to try out, and new data sources and types that will show up.

This article was written by Michael Berthold on March 30, 2023

Michael Berthold

Michael Berthold is co-founder of KNIME, the open analytics platform. He recently left his chair at Konstanz University and is now CEO of KNIME AG. Before that, he held positions in both academia (Carnegie Mellon, UC Berkeley) and industry (Intel, Tripos). He has co-authored several books (the second edition of the Guide to Intelligent Data Science appeared recently), is an IEEE Fellow, and a former president of the IEEE-SMC society.

Best Practices

Missing the T? Data Storage ETL an Oversight, Says KNIME CEO