A Two-Part Solution to the Data Integration Challenge of ETL

By Tim King , Executive Editor at Solutions Review
Best Practices,

As the costs of computation, storage, and internet bandwidth have plummeted, data has increasingly become the lifeblood of organizations striving to optimize their operations. Many essential services that organizations rely on are now hosted in the cloud. These services generate huge volumes of valuable data concerning customer relationship management, enterprise resource planning, event tracking, payment processing, and more. As the Internet of Things grows, a multitude of sensors, devices, and vehicles will soon stream terabytes of data per day, each.

To be made useful, all of the data must be centralized into a data warehouse. How would an organization do so? Traditionally, extract, transform, and load (ETL) has been the standard for centralizing business data. Engineers build custom software to extract data from API endpoints, then write a series of orchestrations to transform data and prepare it for replication to a data warehouse. This process made sense at a time when computation, storage, and bandwidth were all scarce and is specifically designed to preserve those resources.

Building the technology to extract and transform data is extremely laborious. A typical data connector can take more than the equivalent of two-people months to build and maintain over the course of a year. The problem only continues to scale as more connectors to more data sources are added, to say nothing of the fact that some API endpoints are poorly documented or extremely abstruse.

A number of data integration and management tools exist to simplify this process, but traditional ETL pipelines remain extremely complex, code-intensive, and inaccessible to teams with limited engineering resources. Moreover, the architecture of an ETL pipeline is inherently brittle and risky. Upstream schema changes and changing analytics needs can lead to extensive revisions of the code base and prolonged downtime, leaving organizations flying blind.

The difficulties posed by brittle ETL pipelines not only interfere with day-to-day decision-making but also hamper an organization’s ability to pursue higher-value, time-sensitive uses of data, namely predictive modeling and machine learning.

The solution to these challenges involves two parts. The first part is to leverage the plummeting cost of computation, storage, and internet bandwidth by moving all data infrastructure to the cloud and using extract, load, and transform (ELT) instead of ETL. By delaying transformations, ELT allows faithful, continuous replication of all data from the source straight to the data warehouse. Analysts and engineers can then perform transformations with full confidence that they won’t destroy or obscure any of the original source data.

The second part of the solution is to leverage the division of labor by using an off-the-shelf solution. Since the transformations do not take place until after the data is warehoused, every organization using ELT with the same data source faces exactly the same problem, featuring the same schemas. This allows third parties to design and build standardized solutions, and for organizations to use these solutions in lieu of building pipelines by hand.

The range of cloud-based data sources will only continue to grow in quantity and variety in the foreseeable future. As big data becomes ever bigger, the traditional ETL approach will push up more and more against the limitations of its design philosophy. The new way of ELT and modular, third-party data connectors does away with these limitations, multiplying data engineering efforts and enabling timely, data-driven decision making.

By Charles Wang

Charles Wang is the Product Evangelist at Fivetran. A self-described “thinker and nerd”, Wang aims to educate Fivetran customers about the technical benefits of the solution. He attended the University of Chicago and earned a Bachelor’s degree in Economics and Political Science. Connect with Charles on LinkedIn.

This article was written by Tim King on June 18, 2019

Tim King

Executive Editor

Tim is Solutions Review's Executive Editor and leads coverage on data management and analytics. A 2017 and 2018 Most Influential Business Journalist and 2021 "Who's Who" in Data Management, Tim is a recognized industry thought leader and changemaker. Story? Reach him via email at tking@solutionsreview dot com.

The 17 Best API Integration Platforms, Software and Tools for 2024 - October 26, 2023
The 6 Best Geospatial Data Integration Tools to Consider in 2024 - October 20, 2023
The 19 Best Big Data ETL Tools and Software to Consider in 2024 - October 19, 2023

Best Practices

A Two-Part Solution to the Data Integration Challenge of ETL

Tim King

Executive Editor

Expert Insights

Latest Posts

Categories

Important Links

Useful Pages

A Two-Part Solution to the Data Integration Challenge of ETL

Share This

Tags

Tim King

Executive Editor

Related Posts

The Benefits of Solutions Offering Open-Source Libraries of Transfo...

GenAI & Data Transformation in Online Retail: Expert Commentary

The Relationship Between Enterprise Data & Talent Retention

Expert Insights

Latest Posts

Follow Solutions Review