5 Data Reliability Engineering Interview Questions & Answers
Solutions Review’s Expert Insights Series is a collection of contributed articles written by industry experts in enterprise software categories. In this feature, Bigeye CEO Kyle Kirwan offers common data reliability engineering interview questions and example answers below.
A data reliability engineer (DRE) is responsible for putting tools and frameworks in place so that the organization can achieve fresh, accurate, high-quality data. In the past, this practice has fallen on a hodgepodge of business, operations, and technical teams. The DRE is the first role to formalize these efforts, acknowledging directly how data is a key influence on organizational strategy and decision-making.
The DRE is still an emerging role. If it’s done successfully, the DRE streamlines repetitive tasks that block data and engineering teams. The DRE is the linchpin across the organization that unites teams and data.
In the early 2000s, Google convened the first Site Reliability Engineering (SRE) team to reduce downtime and latency in software development. The principles and practices of SRE have since been applied to infrastructure and operations as teams build software at scale.
DREs bring a similar set of principles and practices to data. They manage the data infrastructure, which includes data pipelines, databases, warehouses and data lakes, archives, deployments, and documentation. The right DRE will bring in folks from multiple parts of the organization in order to get data right, just like how a Site Reliability Engineer (SRE) must work in both infrastructure and software engineering domains to keep things reliable
The DRE role
In researching the DRE role, you might come across the following responsibilities:
- Research data-related problems and collaborate across several teams to drive solutions
- Define business requirements that inform data governance and quality assurance
- Write tests that validate those business requirements
- Conduct data quality testing and monitoring
- Work across engineering and product teams to optimize data pipelines for reliability
To interview for a DRE role, you’ll likely have end-to-end experience with data engineering across solutions like SQL and Azure. You’ll know data visualization platforms like Tableau and Looker. You may have experience managing infra-as-code and building cloud-based databases with complex data inputs. Experience with data-centric applications like Hadoop is a plus, as is familiarity with general programming languages like Python, Java, and C. While you don’t need a computer science degree to be a DRE, you will likely need the equivalent work experience that gives you a strong foundation in engineering.
Common Data Reliability Engineering Interview Questions & Answers
Can you explain what data reliability engineering is and why it’s vital to an organization?
Answer: Based on Google’s Site Reliability Engineering (SRE) principles, Data Reliability Engineering is a combined set of tools and processes that modern data teams use to solve data challenges in a scalable way.
This role is vital to an organization because it touches everything, from monitoring to standard-setting to change and incident management. DREs keep applications and infrastructures working reliably across data warehouses and pipelines, which can massively impact the bottom line by preventing expensive data outages. DREs also help to streamline data management that might have previously been handled in disparate, messy ways across teams, creating bottlenecks and confusion down the line.
Can you give me an example of when you had to troubleshoot a data pipeline or quality issue?
Answer: One of the simplest issues in my previous role was a truncation problem. Somebody working on part of our data pipeline set the type of a column in their ETL job such that it was only storing integer amounts on our transactions. Cutting off the cents from millions of small transactions added to a huge discrepancy in the totals being reported to executives looking at any dashboards that depended on that pipeline.
Tracing the data lineage back from the dashboard until we located the offending table helped us get the issue solved quickly—compared to how painful it would have been if we had to manually inspect the SQL for every stage in the pipeline to trace the dependencies.
Because I had lineage tracing in place, and an incident management policy in place (this was a SEV3 because it only affected internal stakeholders, but still actively interfered with the execution of our business), I was able to take care of the issue within the same working day, and earned a lot of trust for our team from the executives who were affected.
How do you help your organization maintain simplicity as it evolves the data stack?
Answer: The goal is to have “boring” pipelines that just work. Teams with “exciting” pipelines are regularly fire-fighting and managing incidents. Maintaining simplicity as the data stack evolves requires some serious streamlining and adherence to data reliability engineering principles.
One simple thing I’ve led our team to do is to answer “could this be built into the existing data model” whenever someone wants to create a new table in our warehouse. New tables give a sense of freedom to the owner, because they can change them more freely without worrying about what the affect. The downside is that we now have data sitting in more places, more ETL jobs running, more joins needed to write queries, etc.
By adding an RFC (request for comment) process—which includes this as a required question—asking folks to briefly explain their change to the pipeline, and offer the chance for others to comment on the change before they implement it, we’ve been able to reduce the rate of sprawl in our pipeline.
How do you balance embracing risk with setting standards?
Answer: Data pipelines will break in unexpected ways. Instead of avoiding data for fear of these breakages, we have to embrace risk and manage those breakages effectively through standards-setting.
One of the simplest things I’ve led our team to start doing is to set up SLAs for all of our “ending” tables which sit at the end of the pipeline and feed our analytics dashboards. This means we know the freshness, row count, duplication level, etc. of those tables, and get alerted if there are violations on either.
People are free to edit our data pipeline with little oversight (just the RFC process I mentioned earlier) but we know we have enough monitoring in place that if a SLA is getting violated because of a change, we’re going to know and can roll things back quickly if we have to.
How do you monitor your data stack and prioritize issues once they’re discovered?
Answer: I use a mix of open source and commercial tools like DBT, Great Expectations, and Bigeye. This lets me test assertions about the data model and the data, as well as making the pipelines themselves observable so we have complete understanding of pipeline behavior at all times.
I started out with by integrating the pipeline testing into our CI/CD pipeline, to we start adding test coverage to all our data models going forward. Then I added the observability tooling to blanket the entire pipeline with monitoring.
For prioritization, I wrote a short incident management document and shared it with the entire data team, plus all of our data scientists and analysts at the company. One section in the document lays out our severity levels (SEV1 through SEV5) and the rules for categorizing an issue. SEV1’s are critical issues that are causing a real-time customer-facing outage, and those will go to PagerDuty and wake me or another DRE up if needed. SEV5’s are minor issues that we keep in a backlog and get to during a sprint, but which we won’t work on in real time, and the data science team know that.
This combo of tools and process has helped me bring more clarity to how data engineering responds to problems, and what our data science and analytics “customers” can expect from us in terms of response times.
If you’re interested in learning more about data reliability engineering, register for DRE Con 2023 – the world’s largest virtual gathering of data reliability experts. It will be a day of hands-on demos, talks, and Q&A, emceed by Kelsey Hightower of Google and Kubernetes fame.