Combining Spark & Batch Processing for Real-Time Analytics

By Tim King , Executive Editor at Solutions Review
Best Practices,

Spark and Batch

By Yann Delacourt

Companies that use Hadoop’s big data processing platforms typically look to one of two integration modes depending on their usage. The two integration modes – asynchronous and synchronous both come with their benefits and limitations. It follows that as the pace of business increases, more and more organization are looking to use these integration modes interchangeably to pull as much benefit and analysis from their data as possible.

Asynchronous mode, often referred to as “batch” is typically used for methodical, overnight processing. Organizations will process huge data sets to meet the needs of most traditional corporate analytics initiatives. For instance, when a bank branch integrates the deposits from the day into its books, batch processing is often used.

However, demand for quicker insights are driving corporate analytics teams to look for technology that supports real-time integration and ultimately predictive analytics. The latency period of batch processing makes this impossible. If a financial institution needs to detect and stop fraud as it happens, or an e-retailer wants to recommend a related add-on purchase, batch processing won’t cut it.

Spark, a technology developed by the Apache Foundation for the Hadoop ecosystem provides an option for real-time integration. This multifunction analysis engine allows for a synchronous integration mode, which is commonly referred to as “streaming.” Spark quickly processes large data sets and conveniently includes the same functions as MapReduce, but with vastly superior performance: Both data acquisition and processing can be managed at a processing speed 50 to 100 times greater than MapReduce.

Widget not in any sidebars

Streaming works by processing a collection of events over a period of time, but it only makes a record of the group, and so doesn’t provide a timestamp for each and every record. Also, data quality can be impacted by streams of data arriving out of order, or with missing records, so having batch processed records may be necessary in certain aspects of business or regulated industries.

When companies combine these two modes of processing however, they get the best of both worlds. The newest wave of data integration technology supports both integration modes while making it possible to switch between them transparently. Previous generations have allowed switching, but only with a complete overhaul of the data integration layer. This simplifies processing development and the management of the overall life cycle, including updates, changes, and re-use.

The e-retailer that was looking for a way to provide recommendations may now combine browsing history data with the very latest information available – even from social networks. Banks can now do more than synchronize daily activity: They can create data lakes to store all internal and external market data, then compile the data with no volume restrictions and integrate it with other types of data for a predictive program. Spark and batch processing also enables huge volumes of data to be extracted for predictive maintenance, or to predict the outcomes of various scenarios.

Retail and banking are just the tip of the iceberg. There is unprecedented analytical potential when combining Spark and batch processing to align the current reality of business with greater accuracy. Data-driven companies that take advantage of this technology – across all industries – will find that they are able to maximize the value derived from the data and stay ahead of market needs and customer demands.

Yann Delacourt is director of product management at Talend. His field of expertise covers data integration, big data and analytics. Yann has more than 15 years of experience in the software industry having held various leadership positions in product management and engineering at SAP & Business Objects. Connect with him on LinkedIn.

This article was written by Tim King on August 14, 2015

Tim King

Executive Editor

Tim is Solutions Review's Executive Editor and leads coverage on data management and analytics. A 2017 and 2018 Most Influential Business Journalist and 2021 "Who's Who" in Data Management, Tim is a recognized industry thought leader and changemaker. Story? Reach him via email at tking@solutionsreview dot com.

What the AI Impact on Data Engineering Jobs Looks Like Right Now - April 24, 2025
The 17 Best AI Agents for Data Integration to Consider in 2025 - April 22, 2025
What to Expect at Safe Software’s The Peak of Data and AI 2025 May 6-8 - April 17, 2025

Best Practices

Combining Spark & Batch Processing for Real-Time Analytics

Tim King

Executive Editor

Expert Insights

Latest Posts

Categories

Important Links

Useful Pages

Combining Spark & Batch Processing for Real-Time Analytics

Share This

Tags

Tim King

Executive Editor

Related Posts

The Holy Grail of Data Integration Is AI-Driven, Seamless & Secure

Outmaneuvering Tariffs: Navigating Disruption with Data-Driven Resilience

The Great Debate: Will AI Help or Hinder Data Engineering Roles?

Expert Insights

Latest Posts

Follow Solutions Review