
More Common Principles for Cloud Data Architecture Design Patterns
There are many ways to describe the design pattern of a CDA. This section presents descriptions based on the number of clouds and which established “micro architectures” (e.g., warehouses, lakes, operational databases) dominate the macro architecture of the CDA.
Cloud-Driven Design Patterns
Some design patterns are determined by the number of clouds involved and the amount of interoperability that occurs among them:
- Single-Cloud Data Architecture: Features data centralization on a single cloud, usually accompanied by a single, central team for data and analytics.
- Multi-Cloud Data Architecture: This architecture is about the challenge of managing distributed data and the challenges of moving data among multiple clouds.
- Inter-Cloud Data Architecture: Merely storing data on two or more clouds is not really a functional architecture, as in the Multi-Cloud example above. However, it is possible to extend Multi-Cloud with integration and analytic processes so that there is rigorous interoperability across multiple cloud platforms and the multiple datasets on them. This is called Inter-Cloud.
- Hybrid CDA: This design pattern is a CDA that is distributed both on premises and on cloud. It may or may not include the interoperability of Inter-Cloud (but it should, for highest value).
Database Design Patterns, commonly deployed within a CDA
Some CDA macro design patterns are determined by a micro architecture that dominates it. Here follow examples:
Data Lake-Focused Cloud Data Architecture
Increasingly, data architects are designing CDAs where the vast majority of data managed is in the data lake partition of the CDA. This is true even if the CDA includes other substantial partitions, such as a data warehouse, data science labs, or consolidated operational datasets (e.g., complete view of customer). The idea is that the lake becomes the source of all refined datasets and use cases within the CDA. Furthermore, the lake retains massive volumes of raw data long-term, to support as-yet unknown analytics or use cases. In these design patterns, the data lake touches and influences all use cases and internal datasets.
Data Warehouse-Focused Cloud Data Architecture
This is like the CDA design pattern focused on the data lake, except that users prefer to manage more data in the warehouse partition than the lake one. Or there may not be a data lake in a particular CDA, so the warehouse fulfills its purposes.
Lakehouse-Focused Data Architecture
The line between the lake and warehouse is already fuzzy, and it gets fuzzier as time passes. A lakehouse micro architecture is great for a CDA that demands an agile data warehouse. But a lakehouse is typically there to only serve the warehouse, not other use cases and datasets. In other words, a data lake is enterprise scale, a lakehouse usually just exists within the scope of a single data warehouse.
Real-Time Focused Data Architecture
Streams and other real-time data make special demands of a CDA. In reaction, technical users add a layer to the CDA specifically for capturing and processing real-time data. Names for this include: Real-Time Layer, Speed Layer, and Streaming-First Layer. That layer typically implements message brokers (Kafka, Pulsar), application programming interfaces (APIs, with a focus on REST), and various tools for data integration and app integration. The point is that streams need special tech to capture real-time data and achieve business benefit from its freshness.
Here are a few observations about the CDA’s Speed Layer:
- The Speed Layer may be a unique and autonomous layer or it can be functionality included in the Data Fabric layer of the CDA.
- The Speed Layer’s competency in real-time tech is a complement to the batch/bulk competency that serves the vast majority of the CDA’s data. One does not replace the other.
- They both interoperate regularly with the CDA’s shared layers for storage and data consumption.
Now, that we’ve discussed most of the salient components and characteristics of CDAs, let’s pull all that information together in a single big picture. The reference architecture in Figure 1 illustrates most components of the average data architecture, as discussed in this guide. Here are a few principles derived from that illustration.
Know and respect the four broad architectural areas
Figure 1 reveals the four broad areas that are typical of the modern Data Architecture, namely (from bottom up) Compute Platform, Data Storage, Data Fabric, and Data Consumption. Note that each, in turn, divides into multiple areas. This is normal, since few IT architectures exist as an isolated silo or exhibit a monolithic design. All architectures assume a multiple tool, platform, and solution environment, and each of those can have its own substantial design pattern or architecture.
Expect architectures within architectures
The average data architecture is a kind of “macro architecture,” which includes multiple “micro architectures.” Some micro architectures also contain other micro architectures. For example, in Figure 1, the Data Warehouse is included (in the Storage area); most warehouses have their own well-designed architecture, but it typically relies heavily on Data Integration’s architecture (in the Data Fabric area). Similarly, Data Lakes and Data Lakehouses are usually based on a distinct architectural design pattern. As another example, handling Real-Time Data (which relies on messaging and APIs) usually has its own micro architecture because it differs from the rest of the macro data architecture (which relies on bulk and batch technologies for data movement).
Think of your CDA as more than data layers
Note that the actual data architecture per se in Figure 1 is the Data Fabric and Data Storage layers. Other layers are included because they interact deeply and regularly with the two data architecture layers. For example, all architectures (for data, applications, networking, security, etc.) must run on some kind of Compute Platform, although that layer does not define the solution at hand. Also, there are always numerous Advanced Analytics and Business Intelligence tools in the Data Consumption layer that access the Data Fabric and Data Storage layers relentlessly; so they are included as a matter of completeness.
Make your CDA more than a tech stack
Figure 1 illustrates CDAs and other data architectures as a technology stack. But it also assumes that other conceptions of the are possible CDA (as listed and discussed earlier in this guide), though not included on the illustration. For example, a tech stack can be so simple that it is merely an inventory or portfolio manifest for the many components included. However, remember that a process-based conception would stress the numerous processes that move across the architecture, such as those for data integration, data refinement, federated queries, virtualization, accessing semantics, and so on. Other processes could be mostly driven by humans, as with governance, curation, data documentation (usually via semantics), and development work.
Diligently manage your CDA’s potentially massive software portfolio
As illustrated in Figure 1, there are dozens of tool and platform types that could be used with a CDA – and the figure omitted many, for reasons of space. (Readers should pour other Figure 1, to read the many tool and platform types mentioned.) When an organization matures into using most of the possible tools, they end up with a massive software portfolio. It behooves organizations to keep their portfolios lean, because of the costs of acquiring and maintaining them all, plus training and staffing for each, and making diverse tools and platforms interoperate with the rest of the CDA.
Don’t forget about the data
Tech stack drawings (as in Figure 1) focus on the stack layers and especially the large portfolio of tools and platforms involved. This runs the risk of forgetting the raison d’etre of the CDA: data and managing it for business advantage. A CDA design is just a starting point or an overview. Be careful not to burn up so much time and resources on the large-scale design that there is little left for the actual work of developing and maintaining data management solutions that give the organization data of the best possible relevance, quality, and business value.