How to Reinvent Your Data Streaming Architecture: A Brief
This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, Upsolver Co-Founder and CEO Ori Rafael outlines key ways to reinvent your data streaming architecture with examples.
Data volumes are exploding, driven mostly by tech-led companies (e.g. not just the big ones, but emerging players in fintech, adtech, edtech, etc.). However, old-school companies have jumped in as well, by adding IoT sensors to automobiles, factory lines, and oil pipelines, plus collecting and analyzing customer interactions from their websites, digital products, and customer support. Data is used for everything from automating highway toll payments to measuring earthquakes to monitoring assembly lines, as The Wall Street Journal reports. But most data never leaves the nest. IDC estimates nearly two-thirds of 2020’s data existed only briefly. Of the other third, much of it sits in storage for years, unused.
This is because data is born as a new event in some source system, and the value of these data events degrades very quickly. So, you have to utilize new data quickly so you can act while it still matters. And to understand and act on real-time events you have to implement streaming data technologies.
Make no mistake, exploiting streaming data requires a paradigm shift. We’ve spent decades taking data events and batch processing them on an hourly, daily or weekly basis. To deliver real-time action requires a stream processing mindset where new data is continually compared to historical data to identify interesting changes that can be acted upon by the business.
This means that your organization can’t continue to rely on the data infrastructure it has deployed over the last three decades. Unless it is updated to support a streaming process, your analytics will fare poorly relative to what is possible.
In short, you need to change your mindset and reinvent your architecture. Here is how you can do that.
Architect for Events
While data is born as an event, these events are usually batched because that’s the process supported by traditional data infrastructure. A traditional data pipeline typically involves batch processing with defined start and end times (say a batch-processed at the end of each hour).
Obviously, you can’t enable real-time analytics using batch. Batch processing creates delay, and that delay equates to missed opportunities. To get to fresher analytics, you must adopt a streaming data pipeline approach for which logic runs on each new event as it arrives and allows you to detect changes as they happen.
Besides providing fresher data, stream processing has the added benefit of being backward traceable. A well-designed streaming architecture employs an “event-source” approach that keeps a log of every change made to the data set since its inception. This allows you to make changes to your logic and rerun the new logic on old data. So, if you discover a bug, or an unexpected change to your source data, you can rerun your pipeline. This makes your data operations more flexible and resilient.
Reengineer for Freshness at Scale
Imagine you want to implement a “next best offer” system, which combines real-time behavioral data from an app user with all sorts of contextual data about them (e.g. browsing history, location, demographics) to determine which offers make the most sense for that individual. This kind of in-the-moment action relies on data freshness enforced by strict service level agreements (SLAs).
As these SLAs tighten, and the amount of data you source and process increases, and the number of pipelines you run increases (as the data is used by more data consumers), you will need to scale your infrastructure flexibly to maintain your data freshness.
As your requirements change and your demand grows, make sure that you can maintain real- time performance by using cloud data processing services with elastic scaling. Ensure that you don’t have any long-running operations that will slow down your process, or excessive memory utilization that can lead to service level inconsistencies.
Implement a Real-Time Data Lake
Streaming data necessitates a number of new technologies. You need a way to ingest events in a stream, store them affordably, process them efficiently and distribute the transformed data to various analytics systems. The good news is that there are technologies available and proven in the marketplace. In total, these technologies constitute a real-time data lake.
Streaming ingestion: You need to install and manage a message queue such as Apache Kafka, Amazon Kinesis or Azure Event Hub that can collect events into streams.
Data lake storage: Streaming data can accumulate to a huge size, so a cloud data lake based on object storage such as Amazon S3 or Azure Data Lake Service (ADLS) is the most economical way to handle it. Many tools will allow you to connect message queues to a data lake.
Processing platform: Once data is stored in the data lake, you can either use Apache Spark – jobs are written in Python, Java or Scala – or a SQL-based tool such as Upsolver to blend recent real-time data with historical data to feed downstream systems. This blending occurs continuously as new data arrives.
Data lake optimization tools: Data lakes are very affordable, but they require performance optimization, something we call PipelineOps. Here are some examples. First, a compaction process is required to turn millions of single-event files into bigger files that are efficiently processed. Second, blending streaming and batch data requires orchestration of the tasks required to execute the processing job. Third, a state store is required for stateful processing that joins batches with streams.
These processes can be accomplished through a set of specific tools such as Airflow for orchestration or RocksDB, Redis or Cassandra as a state store, with these tools glued to the processing engine via code. Alternatively, you can implement a declarative data pipeline platform such as Upsolver, which automates these PipelineOps functions.
Analytics systems: Once you’ve successfully optimized your data lake, you will be able to output your streaming data as “live tables” (they auto-update as new events arrive) that any analytics system can use.
Streaming Is the Way Forward – and It’s Time to Get Moving
As you can see, streaming requires a mindset shift, thoughtful planning and new infrastructure. However, if you invest the time and effort to implement modern streaming processes and systems, you will be able to leverage fresh data and timely analytics insights.