Archive for February, 2010

Nines

2010/02/05

99.999999999…%

When one computes availability as a percent of time a system is fully functional, one typically gets a number such as 99.987%. Systems are often coarsely classified by the number of leading nines in this number (here there are three nines). Sometimes one hears vernacular such as “almost four nines”, which would be appropriate for this example.

Going from a system rated, say, three nines, to one of four nines, one nine higher, usually involves considerable effort and expense. The reason is that the expected or average amount of downtime has been reduced by a factor of 10. It is far easier, however to go from two nines to three nines than it is to go from, say, four nines to five nines. It just gets harder to get another factor of 10. The following table for a 7 by 24 system indicates why:

Availability Nines Expected downtime per year
.99 two nines 88 hours
.999 three nines 8.8 hours
.9999 four nines 53 minutes
.99999 five nines 5.3 minutes
.999999 six nines 32 seconds

If a business, due to holidays and weekends is only open 200 days per year, and its systems are only expected to operate 8 hours per business day, then it is somewhat easier. Then there are only 1600 hours per year that the system needs to be fully functional. It is possible to take the system down for maintenance and do repairs, for example, during off hours and not affect the availability rating. The downtime table becomes:

Availability Nines Expected downtime per year
.99 two nines 16 hours
.999 three nines 1.6 hours
.9999 four nines 9.6 minutes
.99999 five nines 58 seconds
.999999 six nines 5.8 seconds

For a business operating 8 hours per day, 200 days per year, it is easier and less expensive to design the system to have a very low downtime DURING BUSINESS HOURS, since maintenance and many repairs can be done during off hours.

Note that “Expected Downtime” is an average that must be taken over several years and over many similar systems. It is definitely NOT a maximum downtime! Typical “nines” ratings tend also not to include disasters such as fires, earthquakes, acts of war, etc. To design for these contingencies requires special consideration with backup and failover to remote sites.

Advertisements

High Availability

2010/02/04

If your business loses money when your network (or other system) is down, then you need a system with an appropriate level of high availability.

Availability is the percent of time that a system is expected to be fully functional. A typical, and not very good, value for the availability of a system might be 99% (twonines) Now, a year has approximately Y=8766 hours in it; thus, a system that is expected to be operational 24 hours per day, 7 days per week, and that has 99% availability would be expected NOT to be fully functional 87.66 hours every year! If a business loses $1,000 per hour when this system is down, then this translates to an expected loss of $87,660 annually due to system down time. Of course, some businesses lose less per hour but some lose much more! A system is (fully or partially) “down” if it is not fully functional. A first step in analyzing your system is to understand your causes of downtime.

A business that only operates, due to holidays, weekends, etc., say, 200 days per year and only needs its network to operate during 8 business hours per day, only has Y=1600 critical hours in a year. System maintenance and repairs can be done during off hours. This typically RAISES the availability of its systems. One can design appropriately high availability systems for such businesses at a much lower cost than for 7 by 24 systems.

We should point out that “expected” downtime and “expected” losses are averages taken over several years and over many similar systems. For a given system, they may well be more in one year and less in another year.

One approach to availability evaluations is to run two models. The first model estimates the true cost of downtime for your systems, and the second model computes the availability of your systems. This tells us, as in the above example, the expected annual loss due to down time. When this number is high, it makes sense to upgrade your systems to improve their availability and to consequently lower the expected annual loss. By modeling the proposed changes, you maximize your return on your investment in higher availability.

Downtime Costs

2010/02/02

What does your downtime really cost? The cost of downtime for a critical business system such as a network breaks down into a number of components.

  • Lost revenue
  • Wasted or ruined materials
  • Lost information or data
  • Loss of reputation as a reliable business partner
  • Cost of inefficient or idle staff
  • Cost of staff to handle repairs
  • Cost of unused facilities
  • Repair costs
  • Insurance premium increases

These add up differently for different businesses and for different systems. Here is a table whose values we have averaged and estimated from various sources for illustration purposes only.

System Estimated lost revenue per hour
Airline reservation system $80,000
Bank ATM $15,000
Catalog Sales $90,000
Credit Card Authorization $2,600,000
Dell (see below) $700,000
F100 Network Connection $30,000
Package Shipping Service $28,000
Stock Brokerage Firm – large $3,000,000

Dell’s number was computed by taking their 2002 estimate revenue of $31B dividing by 8766 and estimating that 20% of the sales expected in an hour of downtime would not occur at a later time. It does not account for any of the other factors above. Dell does not report the amount of annual expected downtime of their systems, but it is estimated to be in the order of a few minutes per year. Dell’s systems handle 15,000 simultaneous users at peak times; thus an outage at such a time would be very expensive even on a per minute basis.

Availability Modeling

2010/02/01

Availability is the percent of time that a system is fully functional. This is a calculation involving averages, and it can be envisioned as follows: There is an average time to failure, that is, the system runs fine for awhile, then some aspect of it fails. Sometimes a system runs fine for a long time, and sometimes it fails after a short time. The average time to failure, measured over a huge number of similar systems, is called the Mean Time To Failure or MTTF. After a failure, the system either stops or runs in a sub-standard way for awhile while repairs (or replacements – we do not distinguish between a repair and a replacement) are being made. Sometimes repairs can be made quickly and sometimes, for various reasons, they take a long time. (They usually take longer than expected!) One has to measure the time to repair from the time the system failed until the time that is is fully functional and no additional repair or recovery effort is being done. The average time to repair, measured over a large number of repairs of various failures of similar systems is called the Mean Time To Repair or MTTR. Think of MTTF and MTTR as blocks of time.

Availability

Availability as a ratio of time blocks

At the left edge, the systems starts fully functioning.  Time continues to the right.  At the line between MTTF and MTTR, the system fails and stops being 100% fully functional.  The repair time begins.  Repair time continues until the right edge of the diagram, when the repair is complete, and the system is fully functional again.

The following formula should seem reasonable:

A = Availability = MTTF/(MTTF+MTTR)

MTTF is often called the system’s reliability. Of course, a component of a system, for example a disk drive, has a mean time to failure; this is called the component’s reliability.

MTTR is the average system downtime. It is sometimes called the system’s repairability, since it is a measure of how fast the system can be repaired or replaced.

Since there are approximately 8766 hours in a year, the expected number of hours a 7 by 24 system is fully functional during a year is 8766×A, and the expected annual downtime is 8766-8766×A = 8766×(1-A).

In general, for systems that only need to be fully functional during certain hours, e.g. during working hours, one computes fully functional time and outage time only for that time the system is expected to be fully functional. If there are a total of Y such hours in a year then the expected hours the system is fully functional is Y×A and the expected annual downtime is Y-Y×A=Y×(1-A).

To calculate or model a system’s reliability is to estimate both MTTF and MTTR. Both estimates provide significant challenges, but since MTTR almost always involves significant human involvement, it is easy to underestimate MTTR. Of course, people make mistakes, and these often cause system outages. These mistakes are also difficult to model.

To improve a system’s availability, one can improve MTTF or MTTR or both.

Since availability has two components, reliability and repairability we like to divide up our check lists accordingly. For reliability, poor quality is the enemy, and check list entries focus on either improving quality or compensating for known types of problems. (Of course, it is the “unknown unknowns” that really get you!) For repairability,downtime is the enemy, and check list entries focus on ways to reduce downtime.

  • Reliability
  • Hardware failures

  • Software failures

  • Security failures

  • Power failures

  • Infrastructure failures

  • Environmental failures

  • Human failures

  • System design failures

  • Process design failures

  • Accidents

  • Disasters

  • Repairability
  • Failover: automatic, manual, reconfiguration

  • Spares: on hand, guaranteed supply, frantic search

  • Notification to repair personnel: automatic, beeper, phone call, search

  • Current asset inventory

  • Current network diagram

  • Current cabling and wiring diagrams

  • Up-to-date building construction plans

  • Data backups

  • System image backups

  • Written repair and emergency processes