If you work with data on a daily basis, you’re probably familiar with the process. Data is captured through any number of sources and stored inside of a data warehouse until it is ready for extraction and transfer into an analytics tool. Traditionally speaking, the integration and analytics tools would be separate entities altogether. More and more BI providers are beginning to include integration capabilities inside their analytics platforms, and even more of the top enterprise integration vendors are moving away from the relational integration method of ETL. Vendors that started out offering Data Integration solutions are now branching out in other directions, with many of them moving in the direction of data management platforms (Apache Hadoop and Spark).
With data volumes exploding, unstructured data on the rise, and analytics vendors siphoning off features that used to be specific to integration providers alone, one has to ask, is Data Integration dead?
The truth is, Data Integration has never been more important. However, the way in which organizations traditionally integrated data, via the “relational” ETL method, is beginning to whither away. Legacy ETL integration tools cannot withstand the sheer volume, variety or velocity of data that companies now generate, rendering these tools unusable. Companies are collecting more data than ever, and from an infinite number of additional sources which include IoT device sensors, mobile applications, machine data, social media, and more. With the size, scope and diversity of data growing at an unprecedented rate, the traditional data warehouse/ETL system no longer holds up.
Business users no longer speak regularly to their data warehouse. Services now do that on our behalf, moving mixing, and matching those data sources with Big Data systems. This has created a need for Data Lakes, which will largely replace data warehouses in the future, according to some experts. Thus, existing systems have to evolve further to ensure the automation of data access for the end-user. The complexity of data sources cannot be discounted, and now that companies are moving scores and scores of data to the cloud, these pipelines need to be replaced. For Data Integration to work properly for enterprise organizations in the modern day, it has to be able to collect all of the data, not just stores that are held within a specific source.
In a world where the growing majority of collected data is unstructured in nature, it is impossible to expect a relational data warehouse to integrate in a way that will drive valuable business insight. These legacy systems simply cannot keep up with the amount of data that needs to be integrated into other environments. Data Lakes are dealing with this problem. By having the technical ability to store any type of data in any realistic capacity, newer data management tools such as Hadoop and Spark are able to act as the middleware that legacy integration tools once did. Within this evolving paradigm, every piece of data should be collected.
Traditional integration tools and techniques required stringent controls over data quality and governance in order to extract usable sets from the data warehouse. This is no longer the case, as the Data Lake can house any data type without much thought. This method also allows for a serviced-based approach to data management, which, as a result of important applications and enterprise analytics platforms, enterprises are now employing. For what it’s worth, Gartner rolled ETL into their pure-play Data Integration Tools Magic Quadrant way back in 2006. If your organization relies on only structured data and relational data warehouse techniques, then you’ve got nothing to worry about. But if your company collects or plans on collecting data from different sources in the future, the ETL method of doing business has certainly gone the way of the dinosaurs.