Solutions Review’s Expert Insights Series is a collection of contributed articles written by industry experts in enterprise software categories. In this feature, Matillion co-founder and CTO Ed Thompson offers commentary on why modern data integration means zero-friction ETL.
Data integration remains the lynchpin of any data analytics strategy. And the poster child for data integration has always been ETL: extract, transform, load. So when AWS CEO Adam Selipsky touted a Zero-ETL future during the 2022 re:Invent keynote, it piqued everyone’s attention. However, Selipsky’s speech only tells part of the story.
ETL essentially automates the extraction and loading of data from source to destination and often encompasses the movement of data from a database cluster or clusters to a cloud data warehouse. To be valuable also encompasses complex transformation logic to get that data into a useful state for analysis, either automated or valuable. Selipsky’s case for Zero-ETL centers around enabling organizations that exclusively rely on the Amazon ecosystem for their data management and analytics needs to provide their data teams with more seamless access to data, with much less manual work or management, and ultimately making it easier to perform near-real-time analytics. However, even in an ideal world where data from any system in existence can be easily integrated, it still misses the point of what constitutes an ETL or ELT pipeline.
In concept, ‘Zero-ETL’ sounds great. Who wouldn’t want all of their data instantly ready for analytics? But the concept does not match the reality of cloud data architectures. The fact is that all of the world’s data does not solely live in Amazon – so this argument for Zero-ETL is really only applicable within the Amazon bubble and doesn’t apply to those who house their data elsewhere. While Zero-ETL does sound like a dream, the reality of today’s centralized, hybrid cloud and multi-cloud data sources makes it much less of a reality.
Rather than focusing on eliminating ETL altogether, organizations should instead focus on Zero-friction ETL. The future of data management and analytics should be more focused on accessibility to data and data processes that everyone uses and understands, which in turn helps to drive an organization’s data productivity. Thankfully, there are a few tactics organizations are implementing to make ETL more manageable.
Data Pipeline Automated Testing
Simplifying the ETL Process
ETL is a three-phase process and includes extraction of the data from one or more sources, the transformation of that data so that it is clean, sanitized, scrubbed, etc., and then the loading of that data into a destination where it is analyzed and brings value to an organization. Zero-ETL, as currently constituted, glosses over a number of these critical processes. While it’s not feasible to get rid of ETL entirely, there are ways to make each step of ETL less laborious throughout the full data movement process.
While there are many ways organizations can make each phase of the ETL process easier, these suggestions are a great starting point. In tandem, they should also work to ensure that data productivity is emphasized throughout each phase. Doing so will empower data teams to load more data, transform it faster and sync it with core systems throughout the organization, all while orchestrating the process to remain efficient in future ETL batches.
Don’t forget the T!
There is a definitional issue at the heart of what Adam Selipsky said as well. Everything he talks about is data movement with light and simple transformation, but data transformation is much more than that. Transformation is about business encoding rules, business processes, and logic to clean and combine disparate data silos into analytic-ready data sets. AWS seems to consider this part of the process as trivial, ignoring years of data integration history.
Data integration has always been dominated by low-code tools, and there are many to choose from besides Matillion. Informatica, Talend, IBM Datastage, and the like have been big players in the market for years, and there is a reason for this. Transformation logic is hard to maintain, and expressing it as complex SQL only increases that burden and makes it harder to manage. Therefore, it’s very valuable to have low-code tools that simplify the development and maintenance of transformations in data pipelines.
Make ETL “Everyone-Ready”
Until recently, there hasn’t been a single approach for data teams with varying technical skills to collaborate on a single project, move data faster and reduce the toil associated with processes like ETL. However, with new innovations and disruptive technologies, there are steps that make these processes available to a larger number of people with different technical skill sets.
ETL is made more accessible by adopting low-code/no-code approaches that meet users, no matter their skill level, enabling access to heterogeneous business data on their terms. This will save data teams time from having to translate what the extracted and transformed data means to other teams, and allow them to process certain queries on their own. It will also reduce manual cycles for other teams, as they can find and make sense of the data they need. In the end, making ETL processes “everyone-ready” and empowering everyone to be more productive with data helps to cut down on the toil associated with ETL.
Prepare for Multiple Data Sources
As data sources and data volumes continue to grow, the reality is that organizations will require multiple data sources rather than having everything locked in one place. As this occurs, Zero-friction ETL becomes even more critical for organizations to access, move and make sense of their data to drive business value.
In practice, the data environment for most organizations is not exclusive to a single vendor or cloud. Most organizations have their source data in multiple databases or applications from multiple vendors – whether it be an Oracle database for customer orders, salesforce for CRM, or Workday for HR. Many also rely on a multi-cloud, and often multi-data platform, strategy for storing their data and for analytics. To maximize its utility, Zero-ETL would tie organizations to a single ecosystem, resulting in vendor lock-in and limiting the power of their analytics.
We all want to end the ETL toil. But, there will always be a need to move data, integrate it, crucially transform it, and orchestrate data pipelines across platforms. Critical business data comes from dozens, if not hundreds, of sources – and it’s essential to get all this data into your analytics infrastructure and, once there, into business- and analytics-ready formats as quickly as possible. Only then does the really interesting and valuable work start for transforming data into something than is useful to analyze. By implementing the tactics outlined above into their data strategy, organizations are making ETL processes as easy as sending an email and accelerating the benefits data transformation provides to their business.
- Modern Data Integration Means Zero Friction ETL - March 28, 2023