High Availability

The availability of a system is defined as MTBF/(MBTF+MTTR) where a “failure” is a time period in which the system is not fully functional, MTBF is the mean or average time between failures, and MTTR is the average duration of a failure, i.e., the average time it takes to bring the system back to 100% functionality.

An availability rating might look like 0.9999123945. If there are n initial nines in such a rating, the system is said to have n nines of availability.  “High” availability usually starts at n=3 and higher.

If a system is only supposed to function during certain days and certain times, then the time used to measure MTBF and MTTR should only include the time the systems is supposed to function.  For example, if a system is only supposed to function on week days, the system may be dismantled and overhauled during the weekend and such down-time will not affect its availability rating.

Caution should be taken when interpreting vendor ratings on MTBF.  Some vendors, particularly hardware vendors, often only consider hardware faults for failures.  This overstates their MTBF since it doesn’t account for software faults, operator faults, environmental (e.g. power and cooling) problems, etc.

If a system failure goes unnoticed or the system’s maintainers are slow for some reason, then the time to repair can be quite large, and the system’s availability correspondingly small.  Conversely, immediate recognition of a failure and rapid repair will make the system availability correspondingly larger.

