General11.10.2012

Zero outage computing in digital clouds

By weich

By Roelof Louw, Cloud Expert at T-Systems in South Africa

The cloud is everywhere. And it is the main topic of discussion at IT conferences and trade shows. Nevertheless, a number of business enterprises are still sceptical when it comes to security and availability requirements in cloud environments. Cloud providers are responding to these worries with the zero outage strategy.

The seriousness of the matter became evident during CeBIT in March 2012: Facebook suffered a major outage and was unavailable for hours. Millions of users worldwide could not access the social network due to technical problems. Today mobile applications for smartphones and tablets are also at risk.

Outages of this magnitude can be very costly. In 2010 the Aberdeen Group surveyed 125 enterprises worldwide and discovered that outages of just a few minutes per year can cost an average of USD 70,000. Surprisingly, only four percent of the businesses surveyed had guaranteed IT availability of 99.999 percent. This should be unsettling, especially since experts claim that one hour of downtime in production costs some USD 60,000, and for an online shop the figure is USD 100,000. Banks are at the top of the list. They can lose up to USD 2.5 million in one hour of downtime.

Zero outage is only possible in private clouds
To win the trust of cloud sceptics despite these kinds of worst case scenarios, external data centre operators are striving to implement consistent management of their IT systems based on a zero outage principle. This includes high availability of services which, according to a definition by the Harvard Research Group, means that systems should be running at an availability level of 99.999 percent – that translates into one outage lasting a maximum of five minutes per year. The only exceptions to the principle of “zero outage computing” are agreements made with customers that govern new releases, updates or migrations. But are such high levels of availability realistic, and if so, how can they be achieved and maintained?

Those attempting to provide the perfect cloud must be able to discover errors or failures before they arise – and take every technical step possible to prevent them from occurring. What’s more, the cause of every possible failure must also be carefully analysed. It should be noted that more outages result from software issues rather than problems in the cloud architecture itself. And there are a number of inherent differences – for example, users should not expect zero outages in the public cloud, which by nature is in the public Internet and susceptible to downtime. The trade-off for that are the many services offered at no charge in the public cloud. You can have almost limitless gigabytes of storage capacity without having to pay for it. However, you will have to do without support services.

Multiple security
But things are much different in the private cloud: Using their own individually designed end-to-end network solutions, providers can guarantee high availability if their ICT architectures are based on fault resilience and transparency, with integrated failure prevention functions and constant monitoring of operations and network events. What’s more, having intelligent, self-healing software is also essential, enabling automatic rapid recovery in critical situations without any manual intervention so that system users are able to continue working without noticing any kind of interruption.

One example of high fault resilience are RAID (Redundant Array of Independent Disks) systems. They automatically mirror identical data in parallel on two or more separated storage media. If one system fails, this has no impact on the availability of the entire environment – because the mirrored systems continue running without interruption. The user is completely unaware of any issues. In addition, RAID configurations have early warning systems, and most of the incidents that occur are automatically corrected without the need for support from a service engineer.

However, the so-called SPoF (single point of failure) is especially critical for the overall IT environment. These SPoFs include individual storage, computing or network elements, installed only once in the system, that can completely shut down operations if they fail. Since mirroring these components is relatively expensive and complex, some IT providers do not install mirrored configurations – and that is extremely risky. But with zero outage this risk must also be eliminated. Zero outage also means safeguarding the data centre against a catastrophic failure through the use of a UPS (Uninterruptible Power Supply).

If one application fails, however, there will be a processing gap, for example in the form of lost transactions, no matter how fast operations are shifted to an alternate system. The failed system must be able to automatically take action to fill this gap by repeating all of the processing steps that were skipped at a later time, after the shift to the alternate system.

Data protection is just as important
The seriousness of the matter and urgency to mitigate the risk of data loss or leakage is evident in the South African market from the requirements for full disaster recovery and fail-over capabilities in solutions. In many cases organisations look to cloud solution providers for an IT business continuity solution. The solution is, however, not in using cloud services for disaster recovery but to source cloud solutions that have disaster recovery capabilities engineered into the solution.

The same awareness and requirement for data protection is seen in the regulatory developments that applicable to sourcing IT services and specific cloud services. The Protection of Personal Information act (POPI) and King III, relevant to the South African market is becoming a major consideration when sourcing IT services and looking for a provider who is compliant with the relevant acts or frameworks. In addition, existing related certification such as ISO 27001 and Sarbanes Oxley (SoX)/ Statement on Auditing Standards 70 (SAS 70) compliance, should be mandatory when considering a cloud service provider who views data protection as critical part of their solution.

Quality needs dedicated employees
Cloud providers must make sure that their employees adhere to the same standards and processes at all locations and even across multiple time zones. Studies indicate that more than 50 percent of all outages are the result of human error. That is why training is being focused on quality management as a basic integral element of company culture. This approach requires a central training plan, globally standardised manuals and comprehensive information provided by top management.

Every employee must do everything possible to prevent a potential failure or incident from even happening. And that also means having an understanding of what causes outages. They should act in accordance with the old saying “fire prevention is better than firefighting.” If the worst case should ever occur, employees must not be afraid to admit their mistakes, so that they can avoid making them again in the future. It is also vital to have a centrally organised specialist team that is ready to go into action, finding solutions to problems that arise unexpectedly and implementing these solutions throughout the enterprise. When faced with a serious outage, the shift manager can quickly call the team together to begin the recovery process. Employees working at the affected customer site can follow the action being taken via a communications system.

Quality management is an ongoing process ensuring that required knowledge is always systematically updated and expanded. It will never really be possible to guarantee zero outages in cloud processes – not even the best in class can do this – but delivering system availability that goes beyond 99.999 percent can be achieved. Businesses can be sure of this by concluding service level agreements with their service providers.