N Virg DC 850x300

AWS outage: “almost the whole internet is in trouble”

“When AWS flatlines like that, almost the whole internet is in trouble.”

That’s the conclusion of Business Insider in an article about the AWS outage that occurred Tuesday 28th February.

What an utterly ridiculous statement.

At the time of writing this blog (Tuesday 2nd March) details are a bit thin on the ground as to the cause and the full extent of the problem but to put things into perspective what we do know is this: one AWS service in one AWS region had enormously degraded performance beyond the point of being usable.

Certainly the service in question – Simple Storage Service known as S3 – is one of AWS’s biggest and most widely used services, and the US-East-1 (N. Virginia) region is by far the biggest region, but to say almost the whole internet is in trouble is a gross overstatement.

AWS is not the first cloud provider to suffer a major outage, and neither is this the first time it’s happened to AWS. Amongst some people though, there does seem to be a belief that cloud is somehow always-on, always available and doesn’t go down. Perhaps that misconception is down to industry-wide marketing hype.

The reality, of course, is that things fail. On your drive to work you could get a flat tyre in your car and have to stop and change it, which makes you late for work. When complex technology fails it often affects more people and takes longer to resolve; it doesn’t matter whether it’s cloud or on-premises, technology failures happen and affect large groups of people. In 2012 the British bank RBS experienced a serious and prolonged outage caused by the failed upgrade of a mainframe component, which affected millions of RBS customers. That was nothing to do with cloud, that was in the traditional old-world environment of mainframes.

Of course, the aggregated nature of cloud services means that more organisations will be affected by the failure within a cloud services provider and the same is true – albeit in different way – for co-located services.

What people need to remember is the same architectural principles you’d use with on-prem, co-loed or fully outsourced services equally apply to cloud. Indeed, one of the five pillars of the AWS Well-Architected Framework (a must-read for anyone running services on AWS) is dedicated to reliability. In that whitepaper they state “The Reliability pillar includes the ability of a system to recover from infrastructure or service disruptions” and of Amazon’s 9 key reliability questions, question 2 is “How are you planning your network topology on AWS?”. Clearly, placing everything into one region is not planning for resilience.

Ironically, whilst you the run the risk of getting caught up in a wide-spread outage through using cloud services, those same cloud services provide you with far greater resilience capabilities than would ever be possible on-premises. All of the building blocks are there should you choose to use them: S3 has cross-region replication features which can be used for backup and redundancy, CloudFormation can be used to quickly build out infrastructure in alternative regions, and the list goes on.

AWS go to great lengths to promote the shared responsibility model for security, I’d add that resilience is also a shared responsibility. Cloud providers will do their part to ensure availability but failures do occur and it’s your responsibility to ensure that you’ve understood and met the resilience needs of your application by using the wide range of services available.

In the enterprise technology world that’s my background building resilient applications is the norm, all of that knowledge is publicly available and the principles are largely 100% applicable to cloud. Which, of course, provides the answer to this comment from web monitoring firm Apica as quoted in the same Business Insider article “Yet Apple, Walmart, Newegg, Best Buy, Costco, and surprisingly Amazon/Zappos were not affected by the outage”.