Independent analyst for data and analytics Philip Russom PhD. offers commentary on the Gartner view of data engineering from the recent Gartner Data & Analytics Summit 2023.
The Gartner Data & Analytics Summit was held on March 19-22, 2023 in Orlando, Florida, and I was lucky enough to attend. It covered the leading topics of data and analytics (D&A), plus their best practices, tools, technologies, and team structures. Data engineering was covered in some of the better sessions of the summit. So, please allow me to summarize the Gartner view of data engineering, as presented at the Gartner D&A Summit 2023.
Data Engineering Gartner
Defining Data Engineering
The best practices of data engineering concern collecting raw data from multiple sources, then merging and transforming that data to be re-used and optimized for analytics and other use cases. In other words, a data engineer makes raw data usable for multiple user types, ranging from business end-users to data scientists and all the diverse user types between those two.
Data engineers have long relied on skills in extract, transform, and load (ETL) and other forms of data integration. However, today’s data engineer is also cross-trained in many other data management disciplines, such as data quality, metadata management, data cataloging, data modeling, and multiple analytics disciplines. Furthermore, as the data engineering role continues to evolve, it increasingly relies on DataOps and agile methods.
Data Engineering Best Practices
The last analyst presentation of the summit was one of the best, namely “5 Ways to Enhance Your Data Engineering Practices,” presented by Robert Thanaraj, a Director Analyst at Gartner. The presenter began by explaining that data and analytics leaders need to improve their data engineering practices by delivering data products efficiently, automating release processes, proving business value early, eliminating operations overhead, and fostering collaboration. Robert Thanaraj discusses these areas for improvement in detail in the presentation, by calling them five best practices for data engineering and stating each as an actionable recommendation. Here follows my summary of those five best practices.
Best Practice 1: Avoid Data Engineering ‘Wastage’ with a Value-First Model
Engineers can put value first by asking: Will the solution have the anticipated effect on business value? If the answer is no, an engineer should discard, reformulate, and try another idea. According to speaker Thanaraj, it is best to “prove feasibility and business value ahead of data engineering efforts.”
Best Practice 2: Increase Release Velocity by Automating Testing & Release Processes
Testing and release processes are heavily manual tasks among most data engineering teams today. However, they can be automated by functions in data management tools and operationalized by adopting the DataOps method. In fact, Thanaraj cites a Gartner research note (Market Guide for DataOps Tools) that predicts: “By 2025, a data engineering team guided by DataOps practices and tools will have been 10 times more productive than teams that do not use DataOps.” He also stated that “DataOps complements data integration [tools] for efficiency and automation support.”
Best Practice 3: Manage Data as a Product to Promote Reusability & Modularity
Delivering data products from data engineers and other producers should be a user experience that is similar to a marketplace. A data marketplace is a collaborative environment that enables a publish-and-subscribe process: data custodians publish data products; data consumers access data products and subscribe to them via self-service; and data subject owners approve or audit data products.
Product managers are key members of the data product development team. A product manager develops the product vision and hypotheses; owns the product roadmap; determines timing for handoff of data products; and trains business partners. Note that product managers and data engineers must coordinate efforts carefully, so that data products built by data engineers align with business requirements, as defined by the product manager.
Best Practice 4: Eliminate Operational Overhead for Citizen Users
For example, data engineers should create usage metrics for data products and their marketplace, then track the metrics to understand who is using data products, plus for how long, how often, etc. Metrics can reveal unused products that should be archived, as well as popular ones that should be promoted to the user base.
Best Practice 5: Make Collaboration Explicit with Cross-Functional Teams
In other words, data engineers must communicate and collaborate with several persona types, including data persona (i.e., engineers, architects, scientists, stewards), technology persona (programmers, DevOps specialists, test engineers), business persona (business analysts, domain experts, citizen roles), and product managers, business stakeholders, etc.
How Data Engineering Should Support Data Science
One of the most provocative sessions at the Gartner D&A Summit was “Forbidden Questions Bold Data Engineers Should be Discussing With AI Aficionados,” by Gartner analyst Erick Brethenoux. The presentation is based on two assumptions:
Data science operates on data
This is obvious, but bears repeating, because we too often focus on the science and take the data for granted. This can lead to poorly formed analytic outcomes based on poorly amassed datasets.
Data engineering must accommodate the special data requirements of data science
This is especially true of machine learning (ML), which has many stages, each with unique data requirements. There are subtle differences (which are also success factors) for training data, testing data, production data output by deployed analytic models, retraining data, and so on.
Analytic Models Work Best with Complete Data
Humans can fill gaps in knowledge based on our experience, but analytic algorithms are not good at this. Therefore, data engineers must provision analytic datasets that have no gaps. This forces the data engineer to find obscure datasets inside and outside the enterprise, which include data missing from analytic datasets. In some cases, inferred or derived data must be synthesized for those gaps, by processing other data. “Complete data” may also mean that the data engineer provisions datasets that are alternatives to the primary ones, in case an analytic algorithm can discover diverse insights from diverse datasets.
Erick Brethenoux’s presentation concluded with a description of five tests that data scientists and data engineers can apply to data, to assure that datasets are complete, gapless, and good quality, and not merely responding to the preferences of the analytic model’s creator. Each test is a collection of questions about the data’s lineage, validation, stewardship, inference/derivation, and compliance with AI/ML standards. These are the “forbidden questions” (mentioned in the title of the session) that data engineers should be discussing with their data science counterparts.
- Gartner D&A Summit 2023: The Gartner View of Data Engineering - March 30, 2023