Data Integration Buyer's Guide

Apache Software Announces Arrow; a Top-Level Project

Arrow

In a recent press release, the Apache Software Foundation announced a new top level project – Apache Arrow. According to the company, Arrow is a high-performance cross-system data layer for columnar in-memory analytics. Arrow will provide accelerated performance of analytical workloads, in some cases by more than 100 times. In addition, the Big Data tool will enable multi-system workloads by eliminating cross-system overhead communication.

Arrow was initially seeded by code from another project named Apache Drill. However, Arrow was built on top of a number open source collaborations and establishes a de facto standard for columnar in-memory processing and interchange. Code committers to Apache Arrow include developers from a variety of other Big Data projects including Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Phoenix, Spark and others.

Jacques Nadeau, Vice President of Apache Arrow and Vice Presidet of Apache Drill, adds: “The Open Source community has joined forces on Apache Arrow. Developers from 13 major Open Source Big Data projects are already on board –by introducing a new era of columnar in-memory analytics, we anticipate the majority of the world’s data will be processed through Arrow within the next few years.”

In many workloads, 70 to 80 percent of CPU cycles are spent serializing and deserializing data. Apache Arrow solves this problem by enabling data to be shared between systems and processes with no serialization, deserialization or memory copies. Arrow also supports complex data with dynamic schemas. An example of this would be JSON data which is commonly used in IoT workloads, modern applications and log files. Implementations are also available for a number of programming languages including Java, C++ and Python to allow greater interoperability among a number of Big Data solutions.

Parth Chandra, member of the Apache Arrow and Apache Drill Project Management Committees, notes: “Real world use cases often include complex combinations of structured and rapidly growing complex-data. Already tested with Apache Drill, the efficient in-memory columnar representation and processing in Arrow will enable users to enjoy the performance of columnar processing with the flexibility of JSON.”

You can witness Apache Arrow live in the wild at this year’s Strata+ Hadoop World in sunny San Jose California in March.

For Apache’s full press release, click here.


Widget not in any sidebars

Share This

Related Posts