A Two-Part Solution to the Data Integration Challenge of ETL
As the costs of computation, storage, and internet bandwidth have plummeted, data has increasingly become the lifeblood of organizations striving to optimize their operations. Many essential services that organizations rely on are now hosted in the cloud. These services generate huge volumes of valuable data concerning customer relationship management, enterprise resource planning, event tracking, payment processing, and more. As the Internet of Things grows, a multitude of sensors, devices, and vehicles will soon stream terabytes of data per day, each.
To be made useful, all of the data must be centralized into a data warehouse. How would an organization do so? Traditionally, extract, transform, and load (ETL) has been the standard for centralizing business data. Engineers build custom software to extract data from API endpoints, then write a series of orchestrations to transform data and prepare it for replication to a data warehouse. This process made sense at a time when computation, storage, and bandwidth were all scarce and is specifically designed to preserve those resources.
Building the technology to extract and transform data is extremely laborious. A typical data connector can take more than the equivalent of two-people months to build and maintain over the course of a year. The problem only continues to scale as more connectors to more data sources are added, to say nothing of the fact that some API endpoints are poorly documented or extremely abstruse.
A number of data integration and management tools exist to simplify this process, but traditional ETL pipelines remain extremely complex, code-intensive, and inaccessible to teams with limited engineering resources. Moreover, the architecture of an ETL pipeline is inherently brittle and risky. Upstream schema changes and changing analytics needs can lead to extensive revisions of the code base and prolonged downtime, leaving organizations flying blind.
The difficulties posed by brittle ETL pipelines not only interfere with day-to-day decision-making but also hamper an organization’s ability to pursue higher-value, time-sensitive uses of data, namely predictive modeling and machine learning.
The solution to these challenges involves two parts. The first part is to leverage the plummeting cost of computation, storage, and internet bandwidth by moving all data infrastructure to the cloud and using extract, load, and transform (ELT) instead of ETL. By delaying transformations, ELT allows faithful, continuous replication of all data from the source straight to the data warehouse. Analysts and engineers can then perform transformations with full confidence that they won’t destroy or obscure any of the original source data.
The second part of the solution is to leverage the division of labor by using an off-the-shelf solution. Since the transformations do not take place until after the data is warehoused, every organization using ELT with the same data source faces exactly the same problem, featuring the same schemas. This allows third parties to design and build standardized solutions, and for organizations to use these solutions in lieu of building pipelines by hand.
The range of cloud-based data sources will only continue to grow in quantity and variety in the foreseeable future. As big data becomes ever bigger, the traditional ETL approach will push up more and more against the limitations of its design philosophy. The new way of ELT and modular, third-party data connectors does away with these limitations, multiplying data engineering efforts and enabling timely, data-driven decision making.
By Charles Wang
Charles Wang is the Product Evangelist at Fivetran. A self-described “thinker and nerd”, Wang aims to educate Fivetran customers about the technical benefits of the solution. He attended the University of Chicago and earned a Bachelor’s degree in Economics and Political Science. Connect with Charles on LinkedIn.