Exploration of the Facebook Outage and What Enterprises Can do Better
Facebook, Instagram, and WhatsApp experienced a sudden outage on September 3rd from 4:40-5:40 EST. All 3 applications are owned by Facebook but run on different clouds. So, what could have caused this?
To gain a deeper understanding, we spoke with Archana Kesavan, senior marketing director at ThousandEyes. ThousandEyes offers monitoring services in a variety of solution areas. Archana stated:
“Given the commonality in ownership of the three applications, it is possible that a common API or microservice was at the root of the service disruption. This outage illustrates how complex interdependencies at a micro-service level could easily cause widespread disruptions even across separate services running on wholly different hosting infrastructures.”
We further chatted with Archana to figure out what could have been done differently, if anything, and what enterprises should do to avoid a similar scenario.
How was the Facebook outage solved?
Our data did not reveal any ISP or external network issue. Given the commonality in ownership in the three applications, it is possible that the root cause was an internal API or microservice.
Do complex micro-service dependencies complicate root cause analysis?
Yes. Modern application designs allow the flexibility to locate micro-services components anywhere. This creates a complex mesh of network paths and inter-service dependencies that all need to work synchronously to deliver optimal user experience. When application performance degrades, it can be challenging to determine which of these network paths or microservice components is the root cause.
Complex interdependencies at a microservice level could otherwise easily cause widespread disruptions even across separate services running on wholly different hosting infrastructures.
What preventative measures can enterprises take to avoid unexpected outages?
The best ways to protect yourself from unexpected outages is to diversify across multiple service providers for critical services – like DNS for instance – to minimize single points of failure. This can be achieved by investing in more than one ISP as your upstream provider to connect your data centers.
Secondly, discover and understand your dependencies in the world of the cloud so you can better de-risk your deployments. The Internet is made up of many hidden dependencies, any of which can impact your ability to connect to sites and services—even if you don’t have a direct relationship with an affected ISP.
Third, by arming yourself with visibility into Internet connectivity and performance, businesses can minimize the hidden dependencies that can pose significant risks to their organizations. It is absolutely critical to have visibility into the networks your traffic is touching.
What do you think will cause the most outages going forward?
Many things can be responsible for future outages, including natural disasters, attacks or even simple human errors. We’re seeing now how companies are implementing disaster preparedness plans to ensure their customers are not impacted during Hurricane Florence.
Attacks are also a major cause of outages. DNS is a fragile infrastructure that is often overlooked and has been a target for major attacks. Past DNS attacks such as Dyn have had a huge blast radius causing widespread outages, creating a devastating impact on businesses. BGP is another weak point in the fabric of the Internet that has been subject to attacks such as the Amazon Route 53 BGP hijack earlier this year.
User error, such as “fat fingering” can also result in outages, as well as internal misconfigurations or infrastructure failures, with symptoms that manifest themselves on the network layer – this was the likely cause of the AWS A3 outage.
How important is visibility during an outage?
Visibility is everything. Especially as organizations rely more on cloud services and the Internet, the network has become a black box they can’t understand. During an outage, companies need to be able to pinpoint network dependencies and faults with clearer insights, even for networks, services and apps that your team doesn’t own or control. This is imperative to gain an accurate understanding of how the network impacts their applications, users and customers.