A recent Forbes post, written by internationally recognized Big Data expert Bernard Marr, covers a key question for today’s data-driven organizations: Hadoop or Spark? You may find yourself asking that very question when looking into which Big Data Framework to deploy at your company. The column was very informative, so I am taking the liberty to summarize it and provide my own thoughts as a courtesy to those of you who may be in this type of conundrum.
Spark has overtaken Hadoop as the most active open source project currently, but while they are not directly comparable in terms of what they offer, both have similar uses. Hadoop and Spark are the two most commonly used Big Data Frameworks in the enterprise.
For years, Hadoop was the leading open source Big Data framework. More recently however, Spark has become the more popular of the two Apache tools. The two tools are not mutually exclusive, and can be paired together in certain circumstances. Spark has been reported to work up to 100 times faster than Hadoop, however, it does not provide its own distributed storage system.
Distributed storage is an important factor to many of today’s Big Data projects, as it allows multi-petabyte datasets to be stored across any number of computer hard drives, rather than involving expensive machinery which holds it on one device. These systems are also scalable.
Spark requires a third party system for organizing files in a distributed way, so for this reason many Big Data projects involve installing Spark on top of Hadoop, where Spark’s Advanced Analytics application can make use of data stored using Hadoop’s Distributed File System.
Where Spark really has the edge over Hadoop is speed, Marr writes. Spark handles most of its operations in-memory, which copies them from the distributed storage system. This reduces the amount of time-consuming writing and reading to and from slow, mechanical hard drives that needs to be done under Hadoop’s MapReduce system. MapReduce then writes all of the data back to the physical storage medium after each operation.
Marr explains: “Spark’s functionality for handling advanced data processing tasks such as real-time stream processing and machine learning is way ahead of what is possible with Hadoop alone. This, along with the gain in speed provided by in-memory operations, is the real reason, in my opinion, for its growth in popularity. Real-time processing means that data can be fed into an analytical application the moment it is captured, and insights immediately fed back to the user through a dashboard, to allow action to be taken. This sort of processing is increasingly being used in all sorts of Big Data applications, for example recommendation engines used by retailers, or monitoring the performance of industrial machinery in the manufacturing industry.”
The bottom line is that the two frameworks aren’t really in competition, especially considering they are born of the same Apache womb. There is some crossover functionality between the two, and the companies that monetizing these frameworks typically offer both, allowing the buyer to choose which functions they prefer to use. One thing to note though is that since Spark is in its infancy, it is likely behind Hadoop in security and support infrastructures. Companies that provide enterprise Big Data solutions will allow companies to work with whichever framework they choose, it all depends on what kind of data the organization has stored.
- The 6 Best Geospatial Data Integration Tools to Consider in 2022 - October 20, 2022
- The 15 Best Open-Source Data Engineering Tools for 2022 - October 13, 2022
- The 10 Best Data Engineering Tools (Commercial) for 2022 - October 11, 2022