Searching for ETL and data integration software can be a daunting (and expensive) process, one that requires long hours of research and deep pockets. The most popular enterprise data management tools often provide more than what’s necessary for non-enterprise organizations, with advanced functionality relevant to only the most technically savvy users. Thankfully, there are a number of free and open source ETL tools out there. Some of these solutions are offered by vendors looking to eventually sell you on their enterprise product, and others are maintained and operated by a community of developers looking to democratize the process.
In this article we will examine free and open source ETL tools, first by providing a brief overview of what to expect and also with short blurbs about each of the currently available options in the space. This is the most complete and up-to-date directory on the web.
Apache Airflow is a platform that allows you to programmatically author, schedule and monitor workflows. The tool enables users to author workflows as directed acyclic graphs (DAGs). The airflow scheduler executes tasks on an array of workers while following the specified dependencies. Airflow provides rich command line utilities that make performing complex surgeries on DAGs simple. The user interface also provides capabilities that enable users to visualize pipelines running production, monitor progress, and troubleshoot issues when needed.
Apache Kafka is a distributed streaming platform that enables users to publish and subscribe to streams of records, store streams of records, and process them as they occur. Kafka is most notably used for building real-time streaming data pipelines and applications and is run as a cluster on one or more servers that can span more than one datacenter. The Kafka cluster stores streams of records in categories called topics, and each record consists of a key, a value, and a timestamp.
Apache NiFi is a system used to process and distribute data and offers directed graphs of data routing, transformation, and system mediation logic. NiFi features a web-based user interface that enables users to toggle between design, control, feedback, and monitoring. It is highly configurable (dynamic prioritization, back pressure, flow modification at runtime), and can be designed for extension. NiFi also offers multi-tenant authorization and internal authorization and policy management.
CloverETL (now CloverDX) was one of the first open source ETL tools. The Java-based data integration framework was designed to transform, map, and manipulate data in various formats. CloverETL can be used standalone or embedded, and connects to RDBMS, JMS, SOAP, LDAP, S3, HTTP, FTP, ZIP, and TAR. Though the product is no longer offered by the provider, it can be downloaded securely using SourceForge. CloverDX still supports CloverETL in line with their standard support agreement as well.
Jaspersoft ETL is a part of TIBCO’s Community Edition open source product portfolio that allows users to extract data from various sources, transform the data based on defined business rules, and load it into a centralized data warehouse for reporting and analytics. The tool’s data integration engine is powered by Talend. The Community Edition offers a graphical design environment, more than 500 connectors and components, and job versioning. TIBCO also offers an open source business intelligence solution we’ve covered in a previous resource.
KETL is a production-ready ETL platform that is designed to assist in the development and deployment of data integration efforts which require ETL and scheduling. It allows for the management of complex manipulation of data while leveraging an open source data integration platform. The KETL engine consists of a multi-threader server that manages various job executors. Each executor performs a specific function, and job executors fall into the categories of SQL, OS, XML, Sessionizer, and Empty.
Pentaho Kettle offers ETL capabilities using a metadata-driven approach. Now part of the Hitachi Vantara Community, the tool features a graphical drag-and-drop design environment and a standards-based architecture. Pentaho allows users to create their own data manipulation jobs without entering a single line of code. It uses a common, shared repository which enables remote ETL execution as well. Hitachi Vantara also offers open-source business intelligence tools for reporting and data mining.
Talend Open Studio
Talend Open Studio for Data Integration is a free and open-source ETL tool. It provides users with a graphical design environment, ETL and ELT support, versioning, and enables the exporting and execution of standalone jobs in runtime environments. The software features a variety of connectors for RDBMS, SaaS, packaged applications, and technologies like Dropbox, Box, SMTP, FTP/SFTP, LDAP, and more. Talend also offers open-source solutions for data preparation and data quality, among others.
GeoKettle is a metadata-driven spatial ETL tool designed to integrate different spatial data sources for building and updating geospatial data warehouses. It is a spatially-enabled version of Pentaho Kettle. GeoKettle also benefits from geospatial capabilities from mature open source libraries like JTS, GeoTools, and deegree. The tool also features a cartographic viewer to preview your transformations, including map customization tools and basic cartographic functions.
HPCC Systems is an open source platform that incorporates a software architecture implemented on commodity shared-nothing computing clusters. It is configurable to support both parallel batch data processing and high-performance data delivery applications using indexed data files. HPCC’s ETL engine is called Thor and uses an ECL scripting language specifically designed to work with data.
Apatar is a free and open-source data integration software package designed to help business users and developers move data in and out of a variety of data sources and formats. The tool requires no programming or design to accomplish even complex integration with joins across several data sources. Apatar provides a visual interface to minimize the impact of system changes. The tool comes with a pre-built set of integration tools and enables users to re-use previously built mapping schemas as well.
If you’re looking for an enterprise data management solution, consult our freshly updated Data Integration Buyer’s Guide.
Latest posts by Timothy King (see all)
- The Three Best Data Engineering Books on Our Reading List - April 8, 2021
- The 8 Best Data Engineering Courses and Online Training for 2021 - April 8, 2021
- Trifacta Launches Industry First Data Engineering Cloud - April 8, 2021