11 Fault tolerant computer systems

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many aspects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

A disk array is a disk storage system which contains multiple disk drives. It is differentiated from a disk enclosure, in that an array has cache memory and advanced functionality, like RAID, deduplication, encryption and virtualization.

In data storage, disk mirroring is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability. It is most commonly used in RAID 1. A mirrored volume is a complete logical representation of separate volume copies.

Error-correcting code memory is a type of computer data storage that can detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, eg industrial control applications, critical databases, or infrastructural memory caches.

Fencing is the process of isolating a node of a computer cluster or protecting shared resources when a node appears to be malfunctioning.

OpenVMS

OpenVMS is a multi-user, multiprocessing virtual memory-based operating system designed for use in time-sharing, batch processing, and transaction processing. It was first released by Digital Equipment Corporation in 1977 as VAX/VMS for its series of VAX minicomputers. Since 2014 OpenVMS is developed and supported by a company named VMS Software Inc. (VSI).

In engineering, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

A server farm or server cluster is a collection of computer servers – usually maintained by an organization to supply server functionality far beyond the capability of a single machine. Server farms often consist of thousands of computers which require a large amount of power to run and to keep cool. At the optimum performance level, a server farm has enormous costs associated with it. Server farms often have backup servers, which can take over the function of primary servers in the event of a primary-server failure. Server farms are typically collocated with the network switches and/or routers which enable communication between the different parts of the cluster and the users of the cluster. Server farmers typically mount the computers, routers, power supplies, and related electronics on 19-inch racks in a server room or data center.

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

Stratus Technologies, Inc. is a major producer of fault tolerant computer servers and software. The company was founded in 1980 as Stratus Computer, Inc. in Natick, Massachusetts, and adopted its present name in 1999. The current CEO and president is Dave Laurello. Stratus Technologies, Inc. is a privately held company, owned solely by Siris Capital Group. The parent company, Stratus Technologies Bermuda Holdings, Ltd., is incorporated in Bermuda.

In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.