Failover Clustering

A failover cluster is a group of independent computers that are physically connected by a local or wide area network, connected by cluster software. The group of nodes is managed as a single system and shares a common namespace. Failover clusters provide support for databases, messaging systems, file and print services, and virtualized workloads that require high availability, scalability, and reliability. If a server that is running a particular application crashes, failover clustering takes care of the situation by detecting hardware or software faults,and immediately restarting the application on another node without requiring administrative intervention a process known as failover.

Failover Clustering Terminology

Resource: A hardware or software component in a failover cluster such as a disk, an IP address, or a network name.

Resource group: A combination of resources that are managed as a unit of failover.

Dependency: An alliance between two or more resources in the cluster architecture.

Quorum: A shared view of members in a cluster. To ensure that only one subset of cluster members is functioning at one time,a majority of members is required to be active and in communication with each other. This avoids having two subsets of members both attempting to service a request and writing to the same disk. Each node provides a single vote toward membership. A physical disk or a file share may also serve as a quorum resource and contribute a single vote toward membership.

Heartbeat: The clusters health monitoring mechanism between cluster nodes.

Membership: The orderly addition and removal of nodes to and from the cluster.

Global update: The propagation of cluster configuration changes to all cluster members.

Cluster registry: The cluster database, stored on each node and on the quorum resource, maintains configuration information for each member of the cluster.

Virtual server: A combination of configuration information and cluster resources, such as an IP address, a network name, and application resources.

Active/Active failover cluster model: All nodes in the failover cluster are functioning and serving clients. If a node fails, the resource will move to another node and continue to function normally, assuming that the new server has enough capacity to handle the additional workload.

Active/Passive failover cluster model: One node in the failover cluster typically sits idle until a failover occurs. After a failover, this passive node becomes active and provides services to clients.

Shared storage: All nodes in the failover cluster must be able to access data on shared storage. The highly available workloads write their data to this shared storage. Therefore, if a node fails, when the resource is restarted on another node, the new node can read the same data from the shared storage that the previous node was accessing. Shared storage can be created with iSCSI, Serial Attached SCSI, or Fibre Channel, provided that it supports persistent reservations.

Validation of hardware for a failover cluster

In Windows 2008 and 2008R2 the way clusters are qualified is by running the Cluster Validation Wizard. With the validation wizard you can run a series of test on a collection of servers you intend to include in the cluster. The process tests the underlying hardware and software. For the failover cluster to be officially supported by Microsoft Customer Support services the solution must meet the following: All hardware and software components must meet the qualifications for the appropriate logo (Certified for...). Must pass all test in the validation wizard (described in KB 943984)

Create a Failover Cluster in Windows Server 2008

1.Open Server Manager

2.Seclect and click Add Feature

3.Select Failover Clustering

4.Open Failover Cluster Management from Administrative Tools Folder.Click Validate a configuration from the management section

5.On the first screen of the validation wizard you need to enter the network name of every node.(You should select all nodes that are available to ensure that you testing validation is as accurate as possible.)

6.Click next and next again

7.If the test was successful click Create a Cluster

8.Enter the nodes that will be used to form the cluster

9.Click next to proceed to the access point for administering the cluster page.On this page you need to enter the cluster name and IP Address that will be used to identify and administer the cluster once created.

10. Click next and next

Quorum in a failover cluster

Is the number of elements that must be online for the cluster to continue running. Each element can cast a vote to determine if the cluster can continue running. The voting elements are nodes or in some cases a disk or file witness. Each voting element with exception of a file witness contains a copy of the cluster configuration. Quorum modes:

Node Majority: Each node that is available and in communication can vote. The cluster functions only with a majority of votes.

Node and Disk Majority: Each node plus a designated disk in the cluster storage (disk witness) can vote when available and in communication. The cluster functions only with a majority of the votes.

Node and File Share Majority: Each node plus a designated file share created by the administrator (file share witness) can vote.

No Majority Disks Only: The cluster has quorum if one node is available and in communication with a specific disk in the cluster storage.

The process of achieving quorum

1) As a given node comes up it determines if there are other cluster members that it can communicate with.

2) Once communication is established, the members compare their membership view until they agree on one view.

3) A determination is made as to whether this collection of members has quorum.

4) If there are not enough votes to achieve quorum, the voters wait for more members to appear. If there are enough votes present ,the cluster services begins to bring cluster resources and applications into service.

5) With quorum ,the cluster becomes fully functional again.

Why quorum?

When network problem happens, they can interfere with communication between cluster nodes. A small set of nodes might be able to communicate together across a functioning part of a network, but might not be able to communicate with a different set of nodes in another part of the network. If this happens, at least one of the sets of nodes must stop running as a cluster. To prevent the things that are caused by a split in the cluster, the cluster software requires that any set of nodes running as a cluster must use a voting algorithm to determine whether, at a given time, that set has quorum. Because a given cluster has a specific set of nodes and a specific quorum configuration, the cluster will know how many votes constitutes a majority. If the number fall below the majority, the cluster stops running. The nodes will still listen for the presence of other nodes, in case another node appears again on the network, but the nodes will not begin to function as a cluster until the quorum exists again. So,for example in a five node cluster that is using a node majority, what happens if nodes 1, 2, and 3 can communicate with each other but not with nodes 4 and 5? Nodes 1, 2, and 3 constitute a majority, and they continue running as a cluster. Nodes 4 and 5 are a minority and stop running as a cluster, which prevents the problems of a split situation. If node 3 loses communication with other nodes, all nodes stop running as a cluster. All functioning nodes will continue to listen for communication, so that when the network begins working again, the cluster can form and begin to run.