By Bernd Harzog
6. Relationships between sources of data are crucial: Let’s consider some retail sales data, where we have sales data by product, store and geography. The customer has an account with us so we know some basic demographics like gender and age range. All of this data comes to us from the online business system. Now we need to understand if the operation of the site affects revenue. This means that we need to combine end-user experience data, application performance data and IT Operations data with business data, as all of these disparate streams of data arrive. To analyze these disparate streams of data in real-time, ETL no longer works. There is no time to do ETL, so batch ETL needs to be replaced with real-time and continuous discovery of the relationships between streams of data from disparate sources as these sources of data arrive.
7. Statistical relationships are only valuable against deterministically related data: Much hope has been placed in “Big Data analytics” and “machine learning.” Both are powerful techniques, which are benefiting from enormous innovations created by very smart people doing world-class work in self-learning algorithms. To the earlier point of “garbage in, garbage out” – the best algorithms will produce mediocre or worthless results when applied against data where the items have no inherent relationship.
8. The graph database is a crucial innovation: The graph database is an important part of the solution to this problem. LinkedIn can show you who you are connected to and show you your entire “social graph” (who is connected to whom). We need to connect our business, IT Operations and IoT data in the same manner.
9. The Hadoop stack is not the only future of Big Data: The Hadoop stack has provided an exponential improvement upon the capabilities of the previous and now legacy data warehouse class of solutions. But when it comes to real-time and continuous stream processing, with the requirement to ingest data continuously and process it for immediate consumption in real time, Hadoop is itself a legacy solution. The modern innovations in this area include Cassandra, InfluxDB, FiloDB, Scylladb, Spark, Kafka, various graph databases like NEO4J and Titan, etc. In fact, there is even an acronym for a real-time Big Data stack which is SMACK – Spark, Mesos, Akka, Cassandra and Kafka. If you want to run it at scale then run SMACK HARD, where HARD is – High, Availability, Redundancy, and Distributed.
10. Real-time Big Data will bind the entire online enterprise together: Every part of the modern online enterprise is producing valuable streams of data. Each stream constitutes “Big Data” on its own. Together, they constitute a real-time deluge of data that must be collectively related, processed and made useful in seconds after ingest. Successful online enterprises will use their ability to take action upon real-time Big Data to achieve advantage over slow moving rivals. In today’s world, it is not about the big vs. the small. It is about the fast vs. the slow.
Bernd Harzog is the CEO and Founder of OpsDataStore Inc., where he is responsible for the strategy, execution and financing activities of the company. Before Bernd founded OpsDataStore, he was the CEO and founder of APM Experts, CEO of RTO Software, Inc., founding VP of Products at Netuitive, a general manager at XcelleNet, and a Research Director for the Gartner Group focusing upon the Windows Server Operating family of products. Connect with him on LinkedIn.
- The 22 Best ELT Tools (Extract, Load, Transform) for 2022 - September 16, 2022
- The 7 Major Players in Data Integration Tools, 2022 - August 29, 2022
- The 17 Best API Integration Platforms, Software and Tools for 2022 - August 26, 2022