Hadoop vs. Spark; Which Big Data Framework is Better?

By Tim King , Executive Editor at Solutions Review
Best Practices,

Hadoop vs. Spark A recent Forbes post, written by internationally recognized Big Data expert Bernard Marr, covers a key question for today’s data-driven organizations: Hadoop or Spark? You may find yourself asking that very question when looking into which Big Data Framework to deploy at your company. The column was very informative, so I am taking the liberty to summarize it and provide my own thoughts as a courtesy to those of you who may be in this type of conundrum.

Spark has overtaken Hadoop as the most active open source project currently, but while they are not directly comparable in terms of what they offer, both have similar uses. Hadoop and Spark are the two most commonly used Big Data Frameworks in the enterprise.

For years, Hadoop was the leading open source Big Data framework. More recently however, Spark has become the more popular of the two Apache tools. The two tools are not mutually exclusive, and can be paired together in certain circumstances. Spark has been reported to work up to 100 times faster than Hadoop, however, it does not provide its own distributed storage system.

Distributed storage is an important factor to many of today’s Big Data projects, as it allows multi-petabyte datasets to be stored across any number of computer hard drives, rather than involving expensive machinery which holds it on one device. These systems are also scalable.

Spark requires a third party system for organizing files in a distributed way, so for this reason many Big Data projects involve installing Spark on top of Hadoop, where Spark’s Advanced Analytics application can make use of data stored using Hadoop’s Distributed File System.

Where Spark really has the edge over Hadoop is speed, Marr writes. Spark handles most of its operations in-memory, which copies them from the distributed storage system. This reduces the amount of time-consuming writing and reading to and from slow, mechanical hard drives that needs to be done under Hadoop’s MapReduce system. MapReduce then writes all of the data back to the physical storage medium after each operation.

Marr explains: “Spark’s functionality for handling advanced data processing tasks such as real-time stream processing and machine learning is way ahead of what is possible with Hadoop alone. This, along with the gain in speed provided by in-memory operations, is the real reason, in my opinion, for its growth in popularity. Real-time processing means that data can be fed into an analytical application the moment it is captured, and insights immediately fed back to the user through a dashboard, to allow action to be taken. This sort of processing is increasingly being used in all sorts of Big Data applications, for example recommendation engines used by retailers, or monitoring the performance of industrial machinery in the manufacturing industry.”

The bottom line is that the two frameworks aren’t really in competition, especially considering they are born of the same Apache womb. There is some crossover functionality between the two, and the companies that monetizing these frameworks typically offer both, allowing the buyer to choose which functions they prefer to use. One thing to note though is that since Spark is in its infancy, it is likely behind Hadoop in security and support infrastructures. Companies that provide enterprise Big Data solutions will allow companies to work with whichever framework they choose, it all depends on what kind of data the organization has stored.

Bernard Marr was featured in our top Big Data Twitter follows earlier in the year, check it out.

This post was inspired by an article originally published in Forbes.

Widget not in any sidebars

This article was written by Tim King on November 6, 2015

Tim King

Executive Editor

Tim is Solutions Review's Executive Editor and leads coverage on data management and analytics. A 2017 and 2018 Most Influential Business Journalist and 2021 "Who's Who" in Data Management, Tim is a recognized industry thought leader and changemaker. Story? Reach him via email at tking@solutionsreview dot com.

What the AI Impact on Data Engineering Jobs Looks Like Right Now - April 24, 2025
The 17 Best AI Agents for Data Integration to Consider in 2025 - April 22, 2025
What to Expect at Safe Software’s The Peak of Data and AI 2025 May 6-8 - April 17, 2025

Best Practices

Hadoop vs. Spark; Which Big Data Framework is Better?

Tim King

Executive Editor

Expert Insights

Latest Posts

Categories

Important Links

Useful Pages

Hadoop vs. Spark; Which Big Data Framework is Better?

Share This

Tags

Tim King

Executive Editor

Related Posts

The Holy Grail of Data Integration Is AI-Driven, Seamless & Secure

Outmaneuvering Tariffs: Navigating Disruption with Data-Driven Resilience

The Great Debate: Will AI Help or Hinder Data Engineering Roles?

Expert Insights

Latest Posts

Follow Solutions Review