Google Site Reliability Engineers: What it’s like to Recreate a Data Center Disaster

Google wonders, what better way to test a backup and disaster recovery plan, than to create a disaster and see how long it takes site managers to respond, and recover?

Imagine what’s it like to have what may be the coolest job ever: a Google Site Reliability Engineer. Much like how large corporations hire hackers to infiltrate their databases on purpose, Site Reliability Engineers quite literally force entry, and wreak havoc on unassuming google divisions.’s, Casey Morgan describes the teams plan of action:

“The team, which wears super-cool leather jackets with military-inspired patches, runs a simulated war on Google’s infrastructure that they call DiRT (disaster recovery testing). This “war” involves everything from causing leaks in water pipes to staging protests to attempting to steal disks from the servers—whatever it takes to bring down the infrastructure. The data center attacks aren’t real, but they are hard to distinguish from an actual event, even though the SRE team has a little fun by attributing each attack to a fictional event like a zombie, alien, or supernatural attack.”

The magic is in the timing, as google plans a site attack about once per year, and database incident managers have no idea when an event will happen, or if they are in fact one the DiRT to-do list.

Kripa Krishnan, is the Google engineer who heads the annual exercise. She lays down some rules to the attack team: “Do not attempt to fix anything. As far as the people on the job are concerned, we do not exist. If we’re really lucky, we won’t break anything.”  Historically, among Google Incident Managers around the world, the first report of a false attack was made within five minutes.

“My role is to come up with big tests that really expose weaknesses,” Krishnan says. “Over the years, we’ve also become braver in how much we’re willing to disrupt in order to make sure everything works.”

