Amazon Elastic Compute Cloud (EC2) services were disrupted for a few days in last April. Thousands of companies lost their services completely in this period. There is no doubt that Amazon, the leading Cloud Service Provider, has the expertise and good practices to implement, operate and maintain the highly reliable and available service. The incident might have resulted from a chain of unprecedented events or might only be a matter of a simple human error. Regardless of the reason, Amazon must have learned its lesson from this incident and probably will improve its service even further in the following months. On the other hand, what are the lessons a cloud customer should learn from this incident? I sought the answers between the lines of the "Summary of the Amazon EC2 and Amazon RDS Service Disruption" report published by the Amazon Web Services Team.
1. Human Errors are always possible even within a very experienced team and with a well defined change process: The Amazon incident report states the start of the outage was due to the incorrect traffic redirection into the low capacity Secondary Network while the intention was to forward the traffic to another router on the Primary Network.
2. Cloud may not cope with additional / unusual / unexpected demand: While the error made during the network change caused many EC2 instances demand more storage space at the same time, the system could not have coped with this additional demand and the situation pushes volumes "stuck" when they were attempted to read or write on them. The same situation might have resulted from simultaneous demand from multiple customers during normal operation (This may be a debatable argument but consider a nationwide issue directing consumers using services at the same time. Capacity is not unlimited).
3. Redundancy may not function as it is designed: Customers, even those signed on to use multiple availability zones had experienced disruptions with this incident. The report explains this issue as "triggered by a previously un-encountered bug" where fail over of the primary database replica was failed to the backup availability zone and left the primary replica in an isolated state. Although the system was designed to be robust against failures, redundant components may not work consistently such as the case of a poor network disrupting storage replication and causing the whole system down.
4. Days of disruption is still possible with high availability rates: Amazon cloud services have very good availability rates of many consecutive nines (99.99...). A customer probably receives service without a single disruption for years but a single incident may disrupt the service for days. A service contract should keep the measurement period in consideration for these kind of issues.
In conclusion, cloud services provide the know-how and expertise of efficient and effective IT service procurement but moving the cloud is not a magic wand. A company procuring IT services from cloud still needs to keep strategic capabilities in house to manage its IT. A risk-based approach should always be adopted in order to source an IT service in any form of a cloud offering. Good practices for supplier management, contract management and service-level management should be gained internally in order to manage the cloud services as needed. The level of transparency about business and operational practices provided by cloud service providers is also important in order to perform proper risk assessments and make sound transformation decisions. The operating model of the cloud service provider should be well understood. Then, internal processes, organizational structures and technical expertise should be established and performance metrics and related service standards should be defined and included in contract terms as required. Data portability and interoperability should be considered up front as part of the risk management process and security assurance of any cloud program.