
What is Cloud Data Architecture? Cloud Data Architecture Definition
Cloud Data Architecture (CDA) has become important, as increasing numbers of user organizations adopt cloud as their preferred computing platform, migrate existing data-driven solutions from on premises to cloud, and build new analytics solutions on cloud. A well-designed and curated CDA is now a critical success factor for high-value, data-driven programs on cloud, namely analytics, data science, self-service data, monitoring operations via data, and sharing data across enterprise departments and partners.
Hence, many IT organizations are in the midst of migrating older data and analytics solutions to cloud and/or building new ones there. Whether migrating or building anew, they need to design and maintain a solid data architecture on cloud, if they are to cope with cloud’s massive scale and data diversity or – more importantly – to attain the maximum business value of data. Yet, many organizations still have limited skills and experience with clouds, analytics, and any kind of data architecture.
To help those organizations plan a new CDA or modernize an existing one, this guide is an introduction to the modern Cloud Data Architecture. By “modern,” we mean up-to-date or new, as in solutions built with new technologies and their best practices, like clouds, open source, or data science. Note that modern technologies are available both on premises and on cloud, and some hybrid data architectures span both of those. So, modern is not exclusively cloud, and many of the principles discussed here apply to all compute environments, not just cloud.
This guide is organized around the principles for various aspects of CDAs. By “principles,” we mean a mix of best practices, guidelines, and tips. In particular, we drill into CDA principles for defining and designing a CDA, how to adjust to cloud differences, attaining business benefits and technology benefits from a CDA, business use cases for CDAs, making a business case for the CDA, various styles of CDAs, the CDA’s reference architecture, governing data in CDAs, and planning a CDA project.
Why Care about Data Architecture?
- High-value, data-driven use cases demand lots of data. A well-designed data architecture provides a home for this data
- Distributed data continues to be a challenge, and data architecture addresses this
- A CDA centralizes data to simplify and control usage, plus raise business value of data
- Governance requirements are intensifying, and a centralized CDA facilitates this kind of control
- Cloud is a chance to reset your practices and improve your data architecture
- Digital business transformation needs a unified data architecture to enable many of its goals
Defining Cloud Data Architecture
Let’s start with a couple of broad definitions of our terms:
- “A data architecture is a large-scale design that describes the many components of a data environment that includes numerous datasets, data platforms, tools, use cases, and interfaces, plus how these components integrate and interact. The architecture is usually accompanied by policies and rules for data governance, data standards, modeling, and development procedures.”
- “A cloud data architecture is a data architecture whose components are wholly or partially deployed on one or more clouds.”
Fundamental Principles
Most CDAs are guided by well-established assumptions about what is required within a data architecture, plus how the architecture should operate and be handled:
- It’s about the data, first and foremost: After all, the point of all the use cases that a CDA supports is to provide just the right data, in the right condition, and at the right moment for numerous use cases that involve consuming data.
- It’s also data about the data: This includes multiple forms of metadata, data catalogs, data lineage, and other ways of describing data elements and data structures.
- It assumes numerous datasets, integrated for common goals: This is especially true of CDAs that mostly serve analytics use cases, since the aggregation of data from many sources is a regular first step in analytics development and usage.
- It documents and leverages relationships across datasets: In other words, individual datasets are important, but relations across them are, too. The complete view of the customer is a classic example; it is not complete without data from the many source systems that interact with the customer.
- It involves both data at rest and data in motion: In most CDAs, the vast majority of data enters the architecture via latent bulk and batch data movement technologies. However, some data arrives via streams or APIs, which operate close to real-time. A mature CDA will support both latencies.
- It enables and tracks data movement across datasets: Data tends to travel within an architecture, as it is combined and transformed multiple ways for multiple use cases.
- It runs on data platforms and data management tools: Software products for database management systems (DBMSs) and data integration, reporting, analytics, and so on play important roles in an architecture. This is why some people think that their software portfolio is their architecture; the truth is that the software is just one aspect of the CDA.
- A data architecture unifies all the above: Unification takes many forms. Data semantics provides views of data that can encompass the entire architecture. Data management processes can sync data across a CDA’s many datasets. An individual tool within the architecture may interoperate directly with other tools.
Principles for Conceptualizing a CDA
There are many ways to think of a cloud data architecture. It is important to consider them all, because each provides a unique insight into CDAs:
The data is the whole point of the CDA: At the end of the day, it is the data of a data architecture that matters most. That’s because users need the data to create their own data products, to use the data to answer business questions, to plan and strategize at the business management level, etc. If you haven’t designed an architecture that can gather and manage the data that users and the business require, then you have failed with your CDA.
CDA as a Series of Data Models and Design Patterns: This includes local data models as seen in each database table or data file format (XML, JSON, Parquet). But a data architecture is also about how various databases, data files, and other datasets relate to each other. For example, the complete view of a customer may be constructed with data elements that came from many datasets across a CDA. Each dimension of a data mart or each metric in a dashboard may be likewise distributed over the CDA. These relationships unify a CDA on its broadest scale, as much as the technology stack, software portfolio, and numerous processes of a CDA do.
The Technology Stack for CDAs: Although on cloud, a CDA still has a “technology stack” akin to data and analytics solutions on premises. The stack usually has the following macro layers, and each includes micro layers: [FOOTNOTE: See Figure 1 for a Reference Architecture that adds more details to this tech stack.]
- Data consumption – Where end-users with tools for reporting, self-service, and analytics access the data of the architecture, as it is repurposed for multiple use cases.
- Data fabric – A layer that combines data integration, pipelining, metadata, catalogs, and interfaces, which brings data into the CDA, then refines and moves data from one area of the architecture to another.
- Data storage – Where data is persisted in cloud storage, then managed via database management systems (DBMSs) and file systems, plus accessed for reuse.
- Compute platform – This is the actual cloud infrastructure, which replaces or complements traditional on-premises infrastructure.
CDA as a Software Portfolio: Note that a CDA or other architecture requires many types of tools (reports, analysis, integration, quality, etc.), DBMSs, data platforms, and other forms of software that go into the CDA. Software portfolio is an important consideration, but too many organizations think that acquiring a portfolio is the same as building an architecture. Portfolio is only one aspect of the CDA.
CDA as a Platform for Many Processes. The kinds of processes that a CDA supports can vary widely, including:
- Data integration and data refinement processes. This is where data enters the CDA, then is processed to make it suited to multiple use cases.
- Processes for metadata and other data semantics. This typically involves the development of metadata and its sharing via a repository or catalog.
- Processes for specific end-user practices. For example, there can be data access, refinement, and consumption processes that are unique to self-service data usage, report creation, dashboarding, data science, and other analytics. Each of these processes touches multiple areas within the CDA, possibly with multiple tools used by multiple people.
- Processes for controlling CDA daily access and usage. This includes data governance, data stewardship, data curation, and multiple forms of security.
A true CDA is a Unified Environment. As we’ve seen, a CDA incorporates many components. However, these should not be deployed as disconnected silos. A true architecture requires that components interoperate and share data rigorously.
A well-run CDA is governed, curated, and documented. Policies for data governance help to avoid compliance infractions, as users access the data of the CDA. Data curation limits indiscriminate data dumping. And data entering the CDA should always be documented via metadata and catalog entries, so it can be found and reused by everyone.
How Data Architecture differs on Cloud
Managing and architecting data on cloud is mostly the same as on premises. However, there can be a few notable differences to which user organizations must adapt:
Cloud is familiar, yet different, too. Even seasoned IT personnel will need to adjust to how data, analytics, and other tools, platforms, and practices are different on cloud.
Cloud data architecture leverages old skills, but demands new ones, too. Large-scale data architecture is fairly a new practice, and some organization still have little or no experience with data clouds. They need to develop new skills, hire architects, and learn cloud best practices.
Costs are quite different on cloud. Consumption-based licenses are a good idea, but it takes a while for an organization to learn what their actual consumption is on cloud. Therefore, an organization needs to go through a few billing cycles (while carefully monitoring usage) to learn the cost. (For more detail, see the section on Financial Governance, later in this guide.)
Design patterns can be different on cloud. For example, on premises data lakes tend to be small (a few terabytes of data or less), due to the costs of storage and CPUs on premises, plus data center support costs. These are more reasonably priced on cloud (and elasticity reduces maintenance time, staff, and cost). Hence, cloud-based data lakes can affordably grow to greater scale. This is also true of similar design patterns, such as data warehouses and data science datasets.
Data modeling can be different on cloud. Due to the favorable speed and scale of cloud data platforms, there is less need to pre-cache data and tweak data models for performance, as is habitually done on premises. Also, simpler relational modeling is the norm on cloud; for example, data dimensions are usually created via star schema instead of proprietary hierarchical cubes.