Recent Cloud Outages Not for the Faint of HeartJuly 30, 2012 1 Comment
By David A. Kelly, Upside Research
A spate of recent cloud outages have raised new concerns about infrastructure vulnerability when it comes to running business-critical services on the cloud. Given the recent push to move Big Data to the cloud (reference last article on Hadoop), we are at an important inflection point regarding the viability of the public cloud and what enterprises should be doing.
Recently, a rash of severe thunderstorms disabled a number of servers run by Amazon Web Services in the US-EAST-1 region (Virginia) for hours. The notable companies impacted included Netflix, Instagram, and Pinterest. Service was restored within hours for most of the impacted customers. This outage follows another newsworthy outage a few weeks earlier at the same data center that had a similar impact to enterprise morale around cloud computing. To date, at least one company has defected from Amazon Web Services because of the outages, citing a loss in confidence and damage to their brand image.
Some experts have cautioned that enterprises not overreact to the outages, that they are to be expected and creating geographically-dispersed redundancy will effectively minimize or eliminate any impact of a similar event in the future. Enterprise IT leaders shot back with the credible question, shouldn’t the cloud services company be providing this workload geographic spread as part of the initial design of service for a particular customer?
The answer lies somewhere in between, but ultimately rests at the feet of the enterprise. While it is important for enterprises to pick a cloud services provider based on the redundancy and systems the provider has in place to prevent disruptions in service and have processes that restore service as quickly as possible when unexpected events occur to impact service, the enterprise also has a role. Just like enterprises have to create disaster recovery plans for all their major systems, so do they need to create a similar plan in the event that their outside service providers have hiccups or service outages beyond control.
A useful exercise for any enterprise to go through when considering the reliability of the public cloud services like those provided by Amazon Web Services is, what would have been the outcome if a similar event like severe thunderstorms hit your internal, or private cloud? Would they have impacted service? How quickly would you have been able to restore service? Hopefully you have a recovery plan in place that would have restored service quickly. And, if you discovered that it took longer than hoped to restore service, you made changes to your infrastructure to prevent future outages.
One of the biggest lessons these outages can teach for businesses that are choosing to run their most critical applications on the public cloud is to build in resiliency from the start. HP uses the phrase “The Resilient Enterprise” to define a business that architects its critical services to withstand failures of all kinds. In light of the recent public cloud outages, this means that companies should be seriously considering a multi-vendor approach when it comes to cloud services, including cloud providers, hardware platforms, data and software. Building in resiliency at each of these points will prevent an unforeseen failure from bringing down the business, which is what matters most.
It is up to each of the service providers (cloud services, the database provider, the server manufacturer) to do their job by delivering the goods as promised, including any service levels that are agreed upon. But, it is up to the enterprise to guarantee that a layer of resiliency is built in, even if it increases the cost, so that in the event that one of the providers fails, the entire business will not stop.Analyst Blog