AWS Outage: How to survive your cloud provider outage?

Amazon’s AWS suffered a massive outage on Sept 20th 2015 in its Northern Virginia Data Center, commonly referred to as US East-1. The outage lasted close to seven hours and caused service disruption to a lot of popular cloud services like Heroku, Netflix, Reddit, AirBnB, Tinder, Medium, IMDB, Product Hunt, Echo and a host of others. The root cause was failure of DynamoDB (Amazon’s NoSQL database) that led to a domino effect taking down other Amazon services like CloudWatch, Cognito, SES, SQS, and SWF. This is what Amazon had to say about this incident:

Between 2:13 AM and 8:15 AM PDT we experienced high error rates for API requests in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

AWS US East-1 Outage

AWS US East-1 Sept 2015 Outage

Downtime is reality, whether you are hosted in the cloud or in your data center. The key to ensure an outage doesn’t affect your brand, your customers and your bottom line is to have a mitigation plan for it. Your mitigation plan may include:

  1. Monitoring: If you haven’t done so already, start monitoring for your service. Commit to detect issues with your website/service before your customers tell you. For setting up a monitor, consider two key factors:
    1. External: Choose an external monitoring service that is separate from your cloud provider.
      1. Separation helps you catch issues when both your application and your monitoring service are co-located within the same cloud service. This could lead to a false negative when routing to your datacenter is broken.
      2. Avoid issues where the monitoring service (CloudWatch in this case) itself goes down when there is an outage in the region.
    2. User Location: Choose  geographical location that best represent where your users are. This will help you detect issues where your outage is less global and more localized (US East coast in this case).
  2. Fail-over: Plan for redundancy with automated/semi-automated fail-over. The low hanging fruit here is to come up with a strategy and outline a Business Continuity Plan (BCP) document. The document, at the minimum, will contain steps to restore the service to normal in the event of a major downtime. In parallel, architect and design your application for failure. Your software solution will take into account your specific business needs, designed to provide redundancy and fault-tolerance for your mission critical applications. Solutions could include geographical redundancy using multiple availability zones, and/or choosing multiple cloud providers.
  3. Status Page: A growing number of businesses now use a status page to share information about downtime’s and the current state of their application. Amazon, Heroku and Reddit have modeled this best practice during the recent outage.

We recommend a comprehensive approach to downtime mitigation with real-time monitoring, redundancy and transparency.

This entry was posted in Best Practices, DevOps, Monitoring, SynTraffic. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *