This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, CTERA CTO Aron Brand offers an introduction to edge to cloud data pipeline components.
Looking to get ahead of the competition? Edge-to-cloud data pipelines can provide the data-driven edge needed to succeed. If you look at any organization, whether it’s a for-profit business, a nonprofit, or a government agency, they’re all trying to become more data-driven for one simple reason: it’s the best way to make decisions.
In the past, organizations would make decisions based on gut instinct, or on what the boss said. But that’s no longer good enough. With the advent of big data and powerful data analysis tools, organizations can now make decisions based on hard evidence.
Data-driven decision-making has a number of advantages. First, it leads to better decisions. Second, it allows for making decisions more quickly. Third, it’s more transparent, because everyone can see the data that went into the decision. The problem is that most organizations are not very good at data-driven decision-making.
Why? Because they don’t have the right data where it’s needed.
The challenge is that in the enterprise, 75 percent of data is created at the edge – in branch offices, on mobile devices, and by remote smart devices. How do you quickly and easily access, analyze, and draw conclusions from data that is spread across hundreds of silos? You need to aggregate all this data in one place. What you need is a data lake – a centralized, curated, and secured repository that allows you to collect all your data, both structured and unstructured from across your organization, and make it available for analytics and data processing.
The key to success with data lakes is to make sure that they are updated continuously so that the data is timely and not stale. This means that you need to have a way to collect data from all of your silos and then ingest it into your data lake – and in a timely manner.
In simple terms, you may have plenty of data, but like milk, your data has an expiration date. To extract the most value from your data, you need to be able to collect it quickly, store it efficiently, and analyze it effectively. Otherwise, your data will go bad before you can use it to make informed decisions.
So, in addition to having a data lake, you need to add two key pieces of the puzzle, the first being a way to ingest data quickly so that it can be available for analysis as soon as possible.
The second key piece of the puzzle is to be able to react to the data that you have ingested, quickly enough to be useful. This means having the ability to analyze the data and then take action based on what you have learned.
An edge-to-cloud data pipeline can provide these missing pieces of the puzzle. Edge-to-cloud data pipelines provide an efficient way to manage data that is generated at the edge. By automating the ingestion, preparation, and management of data, these pipelines enable organizations to receive actionable intelligence in real-time.
A typical edge-to-cloud data pipeline consists of three components:
- A cloud gateway: Located at the edge, for securely ingesting data files into the cloud.
- A data lake: Implemented as a cloud filesystem or object store, for storing the data.
- A message broker: Such as Apache Kafka, for notifying cloud-based consumers of new data assets as they are ingested into the data lake.
The cloud gateway provides a way to ingress data quickly and reliably into the data lake, which stores the data in its raw format, providing a central repository for long-term data retention. The message broker enables real-time processing of data at scale, without incurring the typical delays of a periodically scheduled extract, transform, load (ETL) process.
To give you an idea of how this works in the real world, consider the following use cases:
- A retail company uses edge-to-cloud data pipelines to ingest video surveillance data from branches to the cloud in real-time. This provides the company with the ability to analyze how shoppers are browsing in stores, quickly identify potential issues such as theft and take corrective action.
- A railway operator uses edge-to-cloud data pipelines to consolidate signaling log data from multiple train stations into a global filesystem, creating a single source of truth for all their stakeholders and data analysts, who then use the data to improve the rail network.
- A global records management company uses edge-to-cloud data pipelines to ingest scanned documents from hundreds of points of presence across multiple continents to create a global service for optical character recognition (OCR) and document processing.
By consolidating data from multiple sources and making it available in real-time, edge-to-cloud data pipelines make it possible to quickly and effectively analyze data to make informed decisions. In a world where the ability to make better decisions faster than the competition is the key to success, edge-to-cloud data pipelines can give you the edge you need.
- An Introduction to Edge to Cloud Data Pipeline Components - August 22, 2022