Combining Spark & Batch Processing for Real-Time Analytics
By Yann Delacourt
Companies that use Hadoop’s big data processing platforms typically look to one of two integration modes depending on their usage. The two integration modes – asynchronous and synchronous both come with their benefits and limitations. It follows that as the pace of business increases, more and more organization are looking to use these integration modes interchangeably to pull as much benefit and analysis from their data as possible.
Asynchronous mode, often referred to as “batch” is typically used for methodical, overnight processing. Organizations will process huge data sets to meet the needs of most traditional corporate analytics initiatives. For instance, when a bank branch integrates the deposits from the day into its books, batch processing is often used.
However, demand for quicker insights are driving corporate analytics teams to look for technology that supports real-time integration and ultimately predictive analytics. The latency period of batch processing makes this impossible. If a financial institution needs to detect and stop fraud as it happens, or an e-retailer wants to recommend a related add-on purchase, batch processing won’t cut it.
Spark, a technology developed by the Apache Foundation for the Hadoop ecosystem provides an option for real-time integration. This multifunction analysis engine allows for a synchronous integration mode, which is commonly referred to as “streaming.” Spark quickly processes large data sets and conveniently includes the same functions as MapReduce, but with vastly superior performance: Both data acquisition and processing can be managed at a processing speed 50 to 100 times greater than MapReduce.
Widget not in any sidebars
Streaming works by processing a collection of events over a period of time, but it only makes a record of the group, and so doesn’t provide a timestamp for each and every record. Also, data quality can be impacted by streams of data arriving out of order, or with missing records, so having batch processed records may be necessary in certain aspects of business or regulated industries.
When companies combine these two modes of processing however, they get the best of both worlds. The newest wave of data integration technology supports both integration modes while making it possible to switch between them transparently. Previous generations have allowed switching, but only with a complete overhaul of the data integration layer. This simplifies processing development and the management of the overall life cycle, including updates, changes, and re-use.
The e-retailer that was looking for a way to provide recommendations may now combine browsing history data with the very latest information available – even from social networks. Banks can now do more than synchronize daily activity: They can create data lakes to store all internal and external market data, then compile the data with no volume restrictions and integrate it with other types of data for a predictive program. Spark and batch processing also enables huge volumes of data to be extracted for predictive maintenance, or to predict the outcomes of various scenarios.
Retail and banking are just the tip of the iceberg. There is unprecedented analytical potential when combining Spark and batch processing to align the current reality of business with greater accuracy. Data-driven companies that take advantage of this technology – across all industries – will find that they are able to maximize the value derived from the data and stay ahead of market needs and customer demands.
Yann Delacourt is director of product management at Talend. His field of expertise covers data integration, big data and analytics. Yann has more than 15 years of experience in the software industry having held various leadership positions in product management and engineering at SAP & Business Objects. Connect with him on LinkedIn.