Availability Modeling

Availability is the percent of time that a system is fully functional. This is a calculation involving averages, and it can be envisioned as follows: There is an average time to failure, that is, the system runs fine for awhile, then some aspect of it fails. Sometimes a system runs fine for a long time, and sometimes it fails after a short time. The average time to failure, measured over a huge number of similar systems, is called the Mean Time To Failure or MTTF. After a failure, the system either stops or runs in a sub-standard way for awhile while repairs (or replacements – we do not distinguish between a repair and a replacement) are being made. Sometimes repairs can be made quickly and sometimes, for various reasons, they take a long time. (They usually take longer than expected!) One has to measure the time to repair from the time the system failed until the time that is is fully functional and no additional repair or recovery effort is being done. The average time to repair, measured over a large number of repairs of various failures of similar systems is called the Mean Time To Repair or MTTR. Think of MTTF and MTTR as blocks of time.


Availability as a ratio of time blocks

At the left edge, the systems starts fully functioning.  Time continues to the right.  At the line between MTTF and MTTR, the system fails and stops being 100% fully functional.  The repair time begins.  Repair time continues until the right edge of the diagram, when the repair is complete, and the system is fully functional again.

The following formula should seem reasonable:

A = Availability = MTTF/(MTTF+MTTR)

MTTF is often called the system’s reliability. Of course, a component of a system, for example a disk drive, has a mean time to failure; this is called the component’s reliability.

MTTR is the average system downtime. It is sometimes called the system’s repairability, since it is a measure of how fast the system can be repaired or replaced.

Since there are approximately 8766 hours in a year, the expected number of hours a 7 by 24 system is fully functional during a year is 8766×A, and the expected annual downtime is 8766-8766×A = 8766×(1-A).

In general, for systems that only need to be fully functional during certain hours, e.g. during working hours, one computes fully functional time and outage time only for that time the system is expected to be fully functional. If there are a total of Y such hours in a year then the expected hours the system is fully functional is Y×A and the expected annual downtime is Y-Y×A=Y×(1-A).

To calculate or model a system’s reliability is to estimate both MTTF and MTTR. Both estimates provide significant challenges, but since MTTR almost always involves significant human involvement, it is easy to underestimate MTTR. Of course, people make mistakes, and these often cause system outages. These mistakes are also difficult to model.

To improve a system’s availability, one can improve MTTF or MTTR or both.

Since availability has two components, reliability and repairability we like to divide up our check lists accordingly. For reliability, poor quality is the enemy, and check list entries focus on either improving quality or compensating for known types of problems. (Of course, it is the “unknown unknowns” that really get you!) For repairability,downtime is the enemy, and check list entries focus on ways to reduce downtime.

  • Reliability
  • Hardware failures

  • Software failures

  • Security failures

  • Power failures

  • Infrastructure failures

  • Environmental failures

  • Human failures

  • System design failures

  • Process design failures

  • Accidents

  • Disasters

  • Repairability
  • Failover: automatic, manual, reconfiguration

  • Spares: on hand, guaranteed supply, frantic search

  • Notification to repair personnel: automatic, beeper, phone call, search

  • Current asset inventory

  • Current network diagram

  • Current cabling and wiring diagrams

  • Up-to-date building construction plans

  • Data backups

  • System image backups

  • Written repair and emergency processes


Tags: , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: