AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
There are few terms in the mission critical world as over- and misused as the nines of availability. The five 9s level has become the uncontested yardstick for measuring reliability, although the term still is largely misunderstood. Very often it is used without scientific support. The misunderstanding stems from two issues: First, the level of nines represent the probability of failure akin to probabilities in games of chance--the odds of an event happening--not the length of the outage. And, second, the nines measure relates, at best, to the reliability of the infrastructure but not the critical load. While a useful tool for comparing systems performance and reliability, extrapolating minutes/ seconds of "downtime" from nines is fundamentally wrong.
The need to assess performance is obvious in a world where continuous infrastructure operation is essential to business. However, few are taking the necessary steps to build the basic model needed to understand such performance expectations.
For our purposes, availability is defined as a probability that a system will function at a future instant in time, continuously, and, presumably, until that time and beyond. Many assumptions and components are involved in this definition. The analysis of individual subsystems' and components' performance, normally reflected as mean time to failure (MTTF) and mean time to repair (MTTR), as well as clear understanding of their interaction, is essential to establishing a reliability evaluation framework. What is often ignored is that such analysis only is valid for components with a constant failure rate. Clearly, the human factor has anything but a constant failure/error rate. And, since infrastructure systems must be serviced regularly without interrupting the flow of power or critical cooling to computers, human error plays a significant role in the operating record of critical facilities, especially in an environment where concurrent maintenance is prevalent. Although computer room power may be restored in a few minutes following the failure, it will take a lot longer, perhaps hours, to return information processing and business processes to normal.
Quantifying risk and correlating risk avoidance with the cost of strengthening the infrastructure is extremely important. However, it is essential to remember that such assessments have limitations and the quest for truly fault tolerant mission critical information infrastructures is rapidly evolving with three major trends emerging in the high availability data center space.
* Reduce the risk of losing utility power to the facility: The events of September 11 have heightened the vulnerability of utility grids. Self-healing power systems have limitations, and risks triggered by a terrorist attack against the grid or a cyber attack against the computer systems controlling power system operation are of great concern. Further, increased ...