Chaos engineering is a growing phenomenon. The idea is to introduce chaos into your systems to test how they deal with failure. Simulating failure is the optimal way to learn how to protect your system. The modern conception of chaos engineering started with Netflix and their Simian Army. The Simian Army introduced too much chaos though, which negatively impacted customers. Thus, Failure as a Service was introduced to a wider audience, in a much more controlled manner.
I was lucky enough to speak with Gremlin’s site reliability engineer, Tammy Butow, to learn more about chaos engineering.
What benefits can chaos engineering bring to DevOps/DevSecOps?
I believe the primary benefit of Chaos Engineering is that it enables you to build more resilient systems, upskill your team and retain/acquire customers.
As we move into the future reliability and resiliency becomes even more important than it was in the past. There are many industries that would greatly benefit from Chaos Engineering because having a reliable product is at the core of their business; Technology, Transport, Retail, Construction, Finance, Health, Education, Government and many more.
When I think back to 2008 I didn’t expect to be able to have access to all of my airline information on my phone and get real-time updates on the status of my flights. 2018 is a completely different world. I now expect to be able to book an emergency flight home from wherever I am in the world to Australia within 3 hours of the flight time and get to where I need to go with no hassles. Customers need and expect reliable and resilient systems. If your business isn’t working as well as someone else’s business then you will lose customers, but if your business is more reliable than your competitors you will acquire customers.
As an engineer, it is the fastest and most effective way I have found to quickly learn the weaknesses of services. This saves you from a great deal of time. Breaking things on purpose and minimizing the blast radius of your experiments helps you safely identify and prioritize what you need to focus on fixing.
I’m excited to see several Professors across the world focus on Chaos Engineering. I recently went to visit Martin Monperrus who is a Professor at KTH University in Sweden. His research interests include self-healing-software and chaos engineering. It’s great to see students understanding we can use controlled failure injection to build more reliable systems. When they begin their careers as engineers in a few years they will be off to a great start!
Chaos Engineering has many great benefits for our engineering profession, companies and their customers.
How much preparation should go into running a chaos experiment? Are there any risks involved?
To prepare to run Chaos Engineering experiments I recommend making sure you have the following:
- High Severity Incident Management
- Monitoring and Observability
- Ability to measure the impact of downtime
It’s also important to share that you are embarking on the Chaos Engineering journey with your colleagues across your company. I recommend doing a presentation at your company or department All Hands meeting, being available for open office hours and sending weekly Slack updates in a #chaosengineering channel or creating email reports. It’s important to be inclusive and take the time to bring everyone on the journey with you.
At Gremlin we think the most impactful way to start running your first Chaos Engineering experiments is to run a GameDay. This is usually a face-to-face or virtual sync where we dedicate a few hours to running Chaos Engineering experiments together. We’ve written about how we run GameDays here: https://www.gremlin.com/community/tutorials/how-to-run-a-gameday/.
Is chaos engineering a useful tool for companies of any size?
Yes, Chaos Engineering is useful for all companies. All companies, from new startups to established enterprises will be able to build more resilient systems by practicing Chaos Engineering.
If you are a new startup you will be less likely to create an unstable foundation, this will therefore not only improve reliability but also improve engineering productivity and happiness. If you are an enterprise you will be able to identify the most critical weaknesses that you should focus on fixing first. You can then add these to your engineering and product roadmaps and prioritize them.
It will only take you a few minutes to run your first chaos engineering experiment.