PC Server High-Availability Techniques

By Tim Kearby and David Laubscher

''Most companies today are concerned about system availability, or uptime, of their local area networks. With mission-critical applications now commonplace on PC servers, companies cannot afford to incur unplanned outages of LAN resources. In many instances, companies traditionally used mainframes to host these critical applications, and executive management expects from LANs the same availability levels associated with mainframe systems. For these reasons, many system administrators have the mission to significantly improve LAN availability.''

''In this article, you'll be introduced to ways to increase system availability in a LAN environment using high-availability solutions on the market today. You'll also read several scenarios that incorporate these solutions to provide high availability in various environments.''

Simply stated, availability is the percentage of time that a system is up, running, and available for users to do productive work. Availability is calculated only for the hours during which a system is supposed to be available. For example, if your business requires a system to be up from 7 a.m. until 11 p.m., then system maintenance downtime between the hours of 2 a.m. and 4 a.m. does not count against availability.

High availability is a target that means a system will be available a higher percentage of time than if no special system features or operational procedures were employed.

As a point of reference, normal system availability in a mainframe environment has traditionally been measured at 99 percent to 99.5 percent. High-availability percentages are more often around 99.95 percent. You can reach this level only by eliminating or masking unplanned outages during scheduled periods of operation using techniques such as advanced system design, fault tolerance methods, and fast restart.

Advanced system design uses highly reliable hardware and software components in a system. These components can often anticipate failures and employ preventive measures to avoid failures, or at least prevent these failures from affecting normal system operation.

Fault tolerance is a system's ability to deliver acceptable service in the event of a component failure. Obviously, the proper system features and/or operational procedures have to be in place to keep critical system resources up, even if a piece of hardware fails.

The most common method of providing fault tolerance is through redundancy of critical resources - either in the same machine or elsewhere on the network - so a backup is available if a primary resource fails.

Fast restart is the ability to quickly detect and recover from a failure in a way that minimizes service disruption. This key concept must be exploited to achieve the high level of availability in the LAN environment that exists in the mainframe environment.

Mainframe administrators know that to achieve high availability, you must recover from a failure as quickly as possible. The first priority is to restore service. There is no time to waste in determining the proper recovery technique. Recovery plans must have been decided upon, and procedures must be in place long before the failure occurs.

A key method for achieving a fast recovery is offline diagnosis. When a failure occurs, there is no time to diagnose mishaps. Diagnosis must be done only after service has been restored. Stopping to analyze what went wrong before restoring service can significantly hurt availability.

Suppose the required system operation hours are between 7 a.m. and 11 p.m. Monday through Friday. This means your system must be available 90 hours per week or 387 hours per month. If your availability target is 99.5 percent, then downtime cannot exceed 1.9 hours per month during normal operating hours.

Designing a High-Availability Solution
To design a high-availability solution for your environment, determine the required availability level by answering two key questions: Although businesses answer these questions differently, generally unplanned outages of several hours or days are the most costly to a business. Unfortunately, lengthy outages have been common historically in LAN environments.
 * What percentage of time must my systems be available?
 * How quickly must I recover from an unplanned outage?

Once you set your goals, you must design a solution that meets them using the choices on the market today. Your first consideration has to be the basic engineering of the system itself. Solid system engineering uses more reliable components, which help to avoid failures.

Your next step is to examine individual subsystems inside the machine and look for ways to increase the single-system availability of the stand-alone machine.

Your final step is to consider the entire LAN environment and look for areas that may need complete system redundancy, such as a hot backup system. One hot topic today addressing redundancy is clustering. We discuss it later, as well as system engineering and single-system availability.

System Engineering
IBM PC Servers were designed with high availability in mind. They include many state-of-the-art technologies important for availability; therefore, we now briefly discuss some of these.

Power
If you mention fault tolerance, relative to power, to system administrators, they start talking about redundant power supplies and concepts such as triplicated majority redundancy!

Before we discuss ways to avoid power supply failures, keep in mind that the vast majority of power problems have nothing to do with power supply failure. According to a contingency planning study quoted by American Power Conversion, more than 45 percent of data losses result from power failure and line surges. This compares to only 8 percent due to hardware or software error and 3 percent due to human errors. One IBM study estimated that as many as 120 power problems occur per month in a typical installation. American Power Conversion has grouped these problems into five categories: The impact of these events can be measured both in terms of lost productivity when a system is unavailable and the replacement costs for damaged equipment. A good uninterruptible power supply (UPS) is the most cost-effective first step toward increasing availability.
 * Sags
 * Blackout
 * Spikes
 * Surges
 * Noise

A well-designed power supply should protect against many types of line problems, but a good UPS can protect against line problems and permit an ordered shutdown in case of a prolonged power failure. By using a UPS to filter your A/C power line, you are protecting all the system and disk caches and giving the operating system time to finish its final processing before power is removed.

Once you have protected and conditioned the power reaching the system, the next component to consider is the power supply itself. Because power supplies have a lower mean time between failure (MTBF) than digital electronic circuits, building in redundancy is a good idea. In the standard method, called triplicated majority redundancy, the power supply is split into three components, any two of which are sufficient to power the machine. If one component fails, the system continues to function, and an alert is sent to a system management console so that the failed component can be replaced.

IBM offers a redundant power supply option for a wide range of its PC Server systems.

Cooling
The proper cooling of PC components received considerable attention when Pentium-based systems started shipping. The focus at that time was on ways to keep the processor chip within acceptable heat limits. In an effort to keep the temperatures down, some vendors even mounted fans aimed at the chip assembly; however, from an overall system viewpoint, extracting the heat from the CPU is only part of the solution. If the heat is removed from any one chip but not adequately forced out of the machine, your reliability problems can increase. Effective dissipation means extracting the heat from all sources, thereby reducing the machine's overall operating temperature.

Reducing the temperature is key to extending the life of the components inside. Statistics from the U.S. Department of Defense show that the MTBF is halved for every 20 degrees Celsius increase in operating temperature.

IBM has implemented a design approach called FloThru cooling, which maximizes the air flow through the server box while keeping that flow primarily front to back. This front-to-back flow avoids problems created by vents that are blocked when multiple servers are stacked side by side or on top of each other.

The ambient temperature of the machine is also important for system reliability. For example, if you put a server into a rack that has inadequate cooling, your machine's reliability will be compromised. IBM's PC Server racks are designed with an extra level of cooling that permits a server to actually run cooler in the rack than it would if it were a stand-alone machine.

Disk Drive Design
As one of the few remaining mechanical devices within the server, disk drives are among the first components to consider when increasing availability. While a complete discussion of hard-disk technology is beyond the scope of this article, a few points regarding disk-drive design are mentioned here, along with examples that illustrate why IBM leads the field in disk-drive reliability. These are just a few important technologies enabling the IBM Ultrastar line of hard-disk drives to be rated at an MTBF as high as one million hours.
 * Storage density - Open any large-capacity disk drive and you'll find a number of platters or magnetic disks. The fewer the disks, the fewer the read/write heads, and the lower the probability of failure. IBM Ultrastar XP hard disks use specially developed technologies that increase the density of each platter, thereby reducing the number of platters required.
 * Air filters - Dust can easily cause disk failures. Because disk drives generate heat, they cannot be made airtight, and therefore must employ air filters. IBM disk drives have chemical filters that stop not only dust, but also harmful gases that might damage the disk surface.
 * Head technology - Head crashes on disk drives occur when a read/write head comes into contact with a disk surface. As a result, the drive fails, and data is usually lost. IBM has pioneered methods to ensure that the head's flying height is correctly set to a level that minimizes the chances of a head crash.

While this seems like a minuscule failure rate, it is still too large for high-availability environments. For this reason, IBM developed disk-drive predictive failure analysis (PFA). PFA monitors key device performance statistics for trends that ultimately lead to device failure. The disk drive then notifies systems management software, such as IBM's TME 10 NetFinity, of imminent failure. When combined with RAID technology (see the next section), PFA can provide the basis for a high-availability solution.

Other PC server vendors have realized the advantage of these techniques and are working to develop a similar technique called SMART, which stands for self-monitoring, analysis, and reporting technology.

Improved Subsystem Availability
Before we focus on the popular topic of clustering multiple systems to provide system-level fault tolerance, keep in mind that a full system failover - the switchover to a hot-spare system - should be a last-resort option.

Beyond the previously discussed features that provide the base level of availability, other technologies implemented in a modern PC server can aid in achieving even higher availability of that single system. Most of these technologies concern eliminating single points of failure inside the machine. These approaches can protect against the majority of typical failures, as well as prevent full system failures. Many of these techniques are also cost-effective, since a good deal of protection can be provided for a comparatively small investment. The more important of these technologies are discussed next.

Redundant Disk Subsystems
Even with advances in technology, hard disks are still mechanical devices that fail more often than integrated circuits. Because of this, redundancy within a disk subsystem is often the easiest and cheapest way to increase a server's availability. The RAID technology, which stands for "redundant array of independent disks," has been developed to address these requirements.

RAID defines different levels of redundancy in the disk subsystem. For example, RAID 1 defines the most common method of providing disk redundancy, which is disk mirroring and/or duplexing. With this method, the amount of disk space required is doubled, since every piece of data is duplicated on separate drives. If one drive fails, data is still available on the other drive, and the system continues to operate without downtime.

For servers requiring a relatively small data space, this method provides good performance at a minimal cost. But as data spaces grow larger, this method becomes increasingly expensive, and eventually the cost of having twice the number of drives becomes prohibitive.

Higher levels of RAID, such as RAID 5, also protect against drive failure and avoid system downtime, but require only a single additional drive to record redundancy data. This requirement for only one additional drive keeps the overall disk cost lower than with duplexing. Most RAID implementations also provide for hot-spare drives and the ability to rebuild a failed drive using the hot spare, so that a failed drive can be automatically replaced without downtime. Even more effective is the use of hot-plug drives, which allow you to physically unplug a failed drive and plug in a new drive without bringing the system down.

A key factor in choosing an approach for disk redundancy is whether to implement it in hardware or software. For example, NetWare, Windows NT, and OS/2 Warp Server all provide a facility within the network operating system that manages a mirrored or duplexed environment. SCO UNIX even offers the ability to implement a RAID array within the operating system.

Most server manufacturers offer RAID controller cards that offload redundant disk management to specialized processors on the adapter. This not only increases the efficiency of the system CPUs, it also greatly improves the speed of RAID operations over software RAID.

IBM PC Servers employ the hardware approach. Several PC Server models offer the RAID controller as a standard feature, while in other models it is optional. These high-performance adapters can implement multiple levels of RAID (0, 1, and 5) - even on the same set of hard disks.
 * ...a typical server-class machine using ECC memory will fail only once every four thousand years due to memory bit errors.

IBM's latest RAID card, called the ServeRAID Adapter, offers additional protection by allowing arrays to be spread across up to three SCSI channels. It also permits adding disks to the array, thereby increasing the size of the logical drive without bringing the system down.

The ServeRAID adapter allows you to create high level protection by implementing three-drive RAID 5 arrays, with each disk of the array residing on a separate ServeRAID SCSI channel in a separate external storage enclosure. Using this configuration, an entire storage enclosure can fail and the system will still function.

Memory
The need to increase data integrity within the memory subsystem has received significant attention. Certainly, as we drive more mission-critical applications onto PC servers, it is imperative that data be protected from soft errors that can occur in memory systems.

To combat this problem, IBM PC Servers employ schemes called error-correcting code (or sometimes error checking and correction but more commonly just ECC) to detect and correct single-bit memory errors, detect double-bit memory errors, and detect some triple-bit memory errors.

ECC works like parity by generating extra check bits from the data and storing these extra bits with the data in memory; however, while parity uses only one check bit per byte of data, ECC uses seven check bits for a 32-bit word and eight check bits for a 64-bit word. Extra check bits, along with a special hardware algorithm, enable the detection and correction of single-bit errors in real time as data are read from memory.

Statistical analysis estimates that a typical server-class machine with 64 to 128 MB of parity memory will fail about once every two years due to random memory bit errors. This may not seem like much risk, but a modest-sized installation of 24 servers will sustain a monthly failure due to memory bit errors. On the other hand, a typical server-class machine using ECC memory will fail only once every four thousand years due to memory bit errors. In addition, since ECC memory can detect multiple bit errors, data integrity of the machine is also improved.

IBM provides ECC memory as standard on most of its servers. For models that come with parity as standard, IBM provides ECC on SIMM or EOS upgrades.

I/O Bus
Data integrity across the system I/O bus merits attention even though it's not specifically an availability issue.

Servers continuously checks for errors and data corruption in memory using parity or ECC; on the disks using RAID and Predictive Failure Analysis; and on the processor and the cache using parity. Since the I/O bus is used to transfer data between these subsystems, it is equally important in ensuring reliability and availability.

ISA and EISA buses do not employ bus parity checking. That is, they have no logic to ensure that data put onto the bus by the sender are the same as data removed by the receiver. PCI and Micro Channel buses have parity checking that allows a receiving device to request that corrupted data be re-sent. This reduces data corruption, and most important, it stops the propagation of these errors across your business.

IBM implements bus parity checking on the PCI bus and the MCA bus in all of its servers. The PC Server 720 implements parity protection on both the PCI and Micro Channel buses, as well as ECC protection on the internal SMP bus.

Network Interface Card
The server's network "on-ramp" is its network interface card (NIC). In many implementations, this adapter card remains a single point of failure; however, the task of implementing multiple NICs that access the same network is often easily supported by the network operating system without further hardware assistance.

IBM's OS/2 Warp Server, for example, supports multiple NICs that are active within the same server on the network. During normal operation, the network load is distributed among the active cards. This is a distinct advantage over other hardware-based approaches that implement a "standby" card, which provides no value until a primary adapter card fails. In either case, if one card fails, the remaining card(s) is available to pick up the load. Additional protection against network component failures can be provided by attaching these cards via different physical paths to the network (separate hubs, for example).

CPU
Machines that offer multiple processors are becoming commonplace. Typically, these machines are designed for high-performance applications that require more CPU power than is currently available from a single chip.

In some cases, such as the IBM PC Server 720, hardware watchdog timers are implemented to identify a failed CPU and initiate a system restart. In most cases, however, neither the operating system nor the hardware engineering can accommodate management of fault conditions. (Such management would enable continued operation if one CPU suddenly fails.) In these instances, the operating system halts, and users on the network lose their connections. In some cases, the system must be manually reconfigured to take a failed CPU offline. Once offline the server can be restarted, and users can re-establish their connections.

Some industry offerings incorporate spare processors that take over in case of failure. These solutions are the exception, not the rule, of SMP implementations, and typically carry a significantly higher price.

If CPU failures are a concern, there are a number of other ways to protect the installation. The section on clustering multiple systems to improve fault tolerance gives more detail.

Systems Management
Another important aspect of improving availability is systems management. Regardless of how well availability is designed into the final system solution, it is critical to be able to manage that solution, predict problems before they occur, react to faults as they happen, and take appropriate recovery actions. The choice of an integrated management tool, not just for the PC server itself, but for all components of the chosen solution, is crucial.

IBM PC Servers come standard with TME 10 NetFinity, an award-winning systems management product with a great deal of function designed to keep the system up and running.

Clustering for Fault Tolerance
While single-system availability is important, it is not the sole factor in producing a highly available LAN. Many factors outside the scope of hardware reliability can cause unplanned system outages, including: You need to plan for these failures and design the appropriate recovery techniques if your LAN is going to meet or beat the availability numbers of the mainframe environment.
 * Software bugs
 * System configuration changes
 * Computer viruses
 * Environmental problems
 * Human error

Recently, the idea of clustering multiple PC servers to achieve fault-tolerant solutions has become popular. A word of caution: the word cluster (as with many terms in our industry) can mean different things to different people. Here, we are not clustering machines in order to build a high-performance computer engine, as is the case with the IBM RS/6000 SP2 design. (Certainly this high-performance model is in the future for PC servers also, but for the present, the maturity of the operating systems and their associated system interconnects preclude vast progress.)

In the context of this discussion, clustering means linking two or more machines to provide system-level fault tolerance. A wide range of solutions in the marketplace provide such function, and for the most part, each solution works well in certain situations.

In the following sections, we review three popular solutions. After that, we consider (from a requirements point of view) different scenarios where each of these solutions apply.

IBM PC Server High-Availability Solution
IBM's PC Server High-Availability Solution, illustrated in Figure 1, uses StandbyServer software from Vinca Corporation and a dedicated 100 Mbps Ethernet link in conjunction with standard IBM PC Servers and a standard network operating system.



In this approach, Vinca software mirrors data in real time from a production server (called a primary server in Vinca terminology) to an online backup machine (called a standby server). Depending upon the operating environment, the backup server can be either dedicated to its standby function or can be an independent, active server on the network.

Systems management for this environment is crucial to successful operation. In the OS/2 Warp Server environment, for example, IBM's TME 10 NetFinity systems management software is configured to monitor the primary machine during normal operation via NetFinity's presence detect feature.

If the primary server fails, NetFinity signals the standby system to take over the primary role. This requires a slight change in the standby machine's configuration, followed by restarting the failed applications from standby. This automated sequence requires no operator intervention.

After the standby machine takes over the primary's role, the failed machine can be diagnosed offline, repaired, then brought back online. The two systems are then re-mirrored to bring the disks back in sync. Depending upon the size of the mirrors, you may need to perform this operation after hours, because the mirroring operation can degrade performance.

It should be stressed that this is a high-availability solution rather than a continuous-availability solution, since there will be a short gap in service while the standby machine is brought online in the primary role. In most cases, users will not have to log off and then log on. The client code for Windows NT, OS/2, and the latest 32-bit requester code for NetWare will automatically reconnect.

The link used to mirror data between the primary and standby machines should be a dedicated connection, unless the amount of mirroring activity is very small. If a dedicated link is used, data are mirrored through separate adapters and a separate communications link. This removes the mirroring activity from the production LAN transport, thereby eliminating any potential performance impact to the network.

The IBM PC Server High Availability Solution ships standard with two IBM 100/10 PCI Ethernet Adapters and an interconnect cable to support a direct, dedicated link between the two servers. In addition, the Vinca StandbyServer software offers other options for linking the two systems. Under NetWare, for example, Vinca StandbyServer 2.0 supports link communications over any NetWare-supported NIC, including fiber, which offers additional advantages in speed and distance between the link servers.

The IBM PC Server High Availability Solution is available for the OS/2 Warp Server, Novell NetWare, and Microsoft NT environments. Its benefits are:
 * Full-server fault tolerance - Because data are mirrored to a completely separate computer, all server components are redundant. RAID can protect disk drives, and redundant power supplies can ensure continuous power, but the Vinca solution replicates all hardware components. In addition, since data are mirrored to a separate computer, the Vinca solution can even recover from a software failure on the primary server.
 * Primary server and standby machine do not need to be identical--A system administrator can configure a less expensive computer as the standby machine (perhaps using last year's 486 server) and a more expensive Pentium-based system as a primary server. When the primary server fails and the users are switched to the slower standby machine, they might notice a performance difference between the two servers, but they can continue to access their data.
 * Full automation of server failover - When the primary server fails and the standby machine is initialized as the main machine, no manual intervention is required. This is advantageous when the support staff is unavailable, or even nonexistent, such as at remote sites.
 * Remote notification--Using IBM's TME 10 NetFinity, alerts and other system notifications can be sent to a management console or to a digital pager. This makes it convenient to monitor system status in environments where the administrator is off site or is responsible for several networks at different locations.
 * Offline diagnosis and problem resolution - Implementation of a hot-spare server allows an administrator to recover quickly from a system failure by using the standby machine to restore service to users. Then, the failing machine can be taken offline and diagnosed without affecting users' level of service. After that, the support staff will have time to analyze why the server failed and determine how to prevent the problem in the future.

Novell NetWare SFT III
Like the Vinca solution, NetWare SFT III is a mirrored server solution that automatically takes information from one server and duplicates its disk and memory image to a standby server connected by a mirrored server link (MSL).

The Novell solution, however, has one important distinction; it is a continuous-availability solution. If the active server has a hardware failure, the second server begins serving clients immediately, without losing a connection to any client machine.

The faulty server can then be shut down and serviced. When service is completed, the server is returned online, and the two machines are re-mirrored to bring their disks back in sync. As is the case with the Vinca solution, the re-mirroring operation can degrade performance, so if you have large, mirrored disks, you may want to do it after hours.

This solution permits complete redundancy of all server components. Because each file transaction is duplicated on each server, both machines have complete, up-to-the-second versions of all the data necessary, in case they have to operate independently.

While this solution can provide higher reliability, it can be more expensive to implement, because all server hardware must be completely duplicated in an identical configuration. As application environments grow larger, the cost of duplicating the disk alone could make this solution prohibitively expensive. Also, as a NetWare-only solution, it does not work in OS/2 Warp Server or Windows NT environments. IBM provides support for NetWare SFT III across its server line in conjunction with OEM fiber adapters.

Shared External Disk
With this solution, each primary server uses external direct access storage devices (DASD) to store end-user applications and data. The external disks can be taken over by a standby machine if the primary machine fails.

One advantage of this solution is that multiple primary servers can be backed up with one standby machine. This solution requires a method of connecting the external disks to the primary machine in such a way that it can be easily switched over to the backup machine when a failure occurs. There are several ways to do this:
 * Manually disconnect the SCSI cables from the primary, then reconnect them to the backup machine
 * Through an A/B switch box for SCSI interfaces
 * Via the IBM ServeRAID Adapter
 * Using serial storage architecture (SSA)

Manual Reconfiguration
Manual reconfiguration requires the support staff to disconnect the SCSI cables from the primary server and reconnect them to the backup machine in the event of a failure. The problem with this approach is obvious--it requires a system administrator to be available immediately to reconfigure the system. In situations where every second counts, this is an unrealistic expectation. At remote locations, there may be no support personnel who can make the switch.

Another manual method is to remove the disk drives from the failing machine and place them in a spare server set aside for this purpose. With this method, one spare server could potentially back up many other servers. This can be accomplished without using external disks, although using a hot-swap chassis and disk drives would make this solution more feasible. This method has the same inherent problems as the previous method, but it is a low-cost solution if you have the personnel to make it happen.

ApCon PowerSwitch
The PowerSwitch is a high-performance, electronic, cross-point SCSI switch manufactured by ApCon, Inc. (formerly Applied Concepts). It enables multiple, independent SCSI buses to be electronically selected and connected via internal switching circuits. This setup allows two computers to access external SCSI devices in much the same way that a printer A/B switch allows two computers to share a single printer; therefore, the need to manually swap and reconfigure SCSI cables and bus terminators is eliminated.

Figure 2 shows an example of an ApCon solution. During normal operations, the primary server accesses data on the external SCSI enclosure through the "A" port connection. If a failure occurs, the switch is re-configured to connect the standby machine to the disks through the "B" port on the switch.

Like the Vinca solution, IBM's TME 10 NetFinity systems management software is configured to monitor the primary machine via NetFinity's presence detect feature. If the primary server fails, NetFinity sends simple ASCII commands out through the serial port to the PowerSwitch, telling it to switch the SCSI devices from the primary to the backup server. Next, NetFinity sends an alert to the backup machine, which has been configured to execute a reboot sequence upon receiving this alert. This sequence is automated and requires no operator intervention.

The backup server has a different boot configuration when it functions as a standby, versus when it switches roles to become the primary. When it boots as the backup machine, it boots from its internal hard disk. When it boots as the primary, it boots from the external disk. This switch is accomplished by changing the machine's startup sequence to boot first from the external disk, then from the internal drive.

As in the Vinca solution, service is disrupted slightly while the standby machine is brought online in the primary role; however, in most cases, users do not have to log off and then log on. The client code for Windows NT, OS/2, and the latest 32-bit requester code for NetWare will automatically reconnect to the backup when it comes online.

This approach has a few limitations. One is the physical distance that can be supported between the servers and the external SCSI devices. This distance is limited by the ability of the SCSI adapters to drive the SCSI bus signals. The supported cable length depends upon several factors, including the speed of the interface, e.g., 5 Mbps or 10 Mbps, and whether it is single-ended or differential-ended SCSI. The product specifications should always be checked.

If you need greater distances between the two servers than what is supported, SCSI extenders are available from several vendors.

This solution works best if both the primary and backup machines have identical hardware configurations, because each server boots from the 3518 external DASD when acting as the primary. This means that the network operating system image on the 3518 must see the same hardware configuration, regardless of whether the primary or the backup server is in control of the disks.

IBM ServeRAID Adapter
The IBM ServeRAID Adapter is IBM's newest RAID controller. It is a 32-bit PCI adapter that can burst data across the PCI bus at 132 MBps. On the SCSI side, it implements the UltraSCSI interface and is capable of a 40 MBps peak data-transfer rate between the SCSI device and the adapter. With three such independent SCSI channels, it can attach up to 45 devices on one adapter.

The RAID controller on the adapter is based upon the RISC POWER PC 403 chip and supports RAID levels 0, 1, and 5 at a configurable interleave depth of 8, 16, 32, or 64 KB. The RAID support also includes several features suitable for use in high-availability situations, one of which is the ability of the adapter to read RAID configuration data from the array itself. It can also detect any changes to the array configuration and can automatically update the configuration data.

An example of where this adapter might be used is in the event of a drive failure. If a hot-spare drive exists in the array, the controller maps it into the array, then automatically rebuilds the array. When the failed drive is replaced, the new drive becomes the hot spare, and the array information is updated while the machine is up and running. (In contrast, on previous IBM RAID controllers, this information was stored on the adapter in electronically erasable read-only memory [EEPROM] and had to be updated each time the RAID configuration was changed. This involved bringing the machine down and then booting a configuration utility to reconfigure the array.)

As for fault tolerance, the IBM ServeRAID adapter allows you to twin-tail two IBM ServeRAID adapters from different machines to the same external SCSI expansion tower. (Figure 3 shows this type of configuration.) Using this method, if the primary machine fails, the backup dynamically takes control of the DASD and updates the array configuration without requiring a reboot operation. Of course, the operating system must be able to dynamically mount new drives; Windows NT and NetWare already do that.

Serial Storage Architecture
Serial storage architecture (SSA) is already being used in the RS/6000 environment and is an emerging technology in the PC server arena. As Figure 4 shows, this architecture is based upon a two-way serial communication loop that interconnects controllers and devices over an 80 Mbps daisy-chained link.

Because of the bi-directional nature of the links, fault tolerance is built into the architecture. If one device in the chain breaks, the data simply circumvents the loop to get to its destination. This works much the same way as the backup path on a token-ring network.

Since the architecture permits multiple controllers to participate in the loop, the implementation of a shared-disk approach using SSA is quite feasible. Figure 5 illustrates an SSA domain where all systems can access not only their own disks but the disks on the other systems as well.

Sample Scenarios Using High-Availability Solutions
The preceding section discussed a few methods available for clustering multiple servers into a fault-tolerant configuration. These methods were highlighted here due to their richness of function and general-purpose application; however, a vast number of other approaches can produce viable high-availability implementations in the PC server environment. These solutions range from basic alert notification provided by systems management tools to application-specific, fault-tolerant features integrated into database and groupware applications.

With so many possibilities to choose from, the system administrator faces the daunting task of sorting through the choices and selecting the right solution. Ultimately, there is no single best option, such as the best file server approach or the best database solution. Instead, you must evaluate the trade-offs, which depend upon the application and the environment that you want to protect.

This section presents several realistic scenarios that call for a range of approaches to fault tolerance. In each case, we outline the requirements and discuss the reasons for choosing the indicated solution. Our scenarios provide examples of how you can implement high availability. The solutions we present may or may not match your needs.

Scenario 1: The Establishment File Server
Requirements - 400 Windows-based clients attached to an IBM PC Server 520 running Windows NT 3.51 share access to data and to personal productivity applications (Lotus 1-2-3, WordPerfect, and Microsoft Project). The corresponding end-user data files associated with these products are stored on the server in both public and private areas. The server contains 6 GB of disk storage.

If the server is unavailable, users' productivity is immediately impacted; however, as long as access to the applications and shared data areas is returned within minutes of an outage, the productivity loss is minimized. Also, in this environment, it is quite acceptable to perform maintenance on the machine after normal working hours.

Solution - This situation is appropriate for the IBM PC Server High-Availability Solution for these reasons:
 * Recovery time from a failed server is minutes rather than hours. When 400 users are down, you need a recovery procedure that lets you restore service as quickly as possible. The cost of the additional hardware is justified when compared to the cost of lost productivity due to an inoperative file server.
 * The standby machine does not need the same configuration as the primary. In this case, a lower-cost IBM PC Server 320 serves as the backup. Both machines, however, have the same disk configuration.
 * With the Vinca software, the end-user data is mirrored in real time to the standby machine. This means if a failure does occur and the standby is activated, users still have access to their latest work. Contrast this to an approach that requires data to be reloaded from tape; the end users would lose any data they had saved since the last tape backup.
 * 6 GB is an acceptable amount of data to be mirrored. It might matter more if there were 60 GB, because the cost of providing 60 GB of mirrored data is substantial. Also, the time it takes to re-mirror 6 GB is not long, and it can be done after hours during the re-introduction of the primary machine.

Scenario 2: The Mission-Critical Application Server
Requirements - A NetWare 4.1 server provides file and application service for 500 OS/2-based clients. The application is a mission-critical, LAN-based, commodities-tracking system. Any outages could result in lost opportunities and lost business. Downtime can be scheduled during off-hours, but 100 percent availability during business hours is required. The disk-storage requirement is 10 GB.

Solution - This scenario is a good match for NetWare SFT III, for these reasons:
 * With the requirement for 100 percent uptime during operating hours, you need a fully redundant solution. SFT III provides continuous availability in the NetWare environment.
 * As in the first example, 10 GB is feasible for a mirroring solution. Again, the re-mirroring can be done after hours.

Scenario 3: A Customer Service Voice-Response Application
Requirements - A Customer service database running on OS/2 Warp Server is connected to a voice-response unit that customers access over the phone. Customers can place orders, check the status of existing orders, verify account balances, and update information such as address and billing information. Service has to be available 24 hours per day, although short disruptions can be tolerated at a slight inconvenience to customers. The size of the database is 30 GB.

Solution - This scenario is a good match for an external shared disk, for these reasons:
 * The 24x7 nature of the operation precludes server maintenance down time. With a shared-disk approach, the standby server can be used during maintenance of the primary machine.
 * A 30 GB database is almost too large for a redundant solution. Complete redundancy requires twice the amount of disk space, so the main concern here is cost. Another concern is performance, because the server resources required to keep two databases of this size in sync can be substantial. With a shared-disk approach, the information can be stored in a RAID 5 array that protects the data but requires substantially less overhead than a mirrored approach.
 * If a mirrored solution was implemented and a machine fails, the two databases have to be re-synchronized when the failed machine is re-introduced. With a 30 GB database, this would take a substantial amount of time, and either both machines would have to be taken offline for this operation or performance would significantly decrease during the re-mirroring.

Scenario 4: The Multimedia File Server
Requirements - A NetWare 3.12 platform multimedia server used for training. Workers on an assembly line use the server to call up video clips that explain manufacturing processes and show how to perform corrective actions when parts are out of spec. The multimedia content is stored on a CD-ROM, which is shared from the NetWare server. The application needs high availability to keep the assembly line running smoothly.

Solution - Because the digital content is on a CD-ROM, the easiest method of providing fault tolerance is to have another NetWare server with a CD-ROM drive that can be accessed in the event of a server failure. This backup server could normally be used to provide service to another user group, but it could also serve as a backup if the primary machine fails.

Scenario 5: The Enterprise Information Base
Requirements - A Lotus Notes-based product information database replicated to 850 auto dealerships for access by local personnel. Product data is updated once a week from the corporate office. Disruptions in service of 30 minutes or less are acceptable. Outages of more than two hours can severely impact the business. The size of the database is 15 GB.

Solution - In this example, fault-tolerance features are built into the application. In Lotus Notes, database replication is done between servers automatically. The servers are configured such that, when the master database is updated each week, it contacts the other servers and sends only the changed data.

In this case, any of the servers can be used to retrieve the information. If one database fails, users can simply dial into another location via a communication server that has been configured to access an alternate site.

Scenario 6: A Branch-Office Application Server
Requirements - IBM PC Server 320s running NetWare 4.1 are used as a platform for a mission-critical client/server application. The environment consists of 40 enterprise branch offices spread across a major metropolitan area. At each location, ten OS/2 Warp-based clients access a server that is part of one NetWare NDS tree. The critical nature of the data requires each location to be backed up by another location that can provide processing and data access when necessary. The business leases FDDI links from the local telephone company to connect the branches.

Solution - The Vinca StandbyServer 2.0 for NetWare, with a fiber connection to link the servers would satisfy the requirements presented in this scenario.

Scenario 7: A Departmental Database Server
Requirements - IBM PC Server 720s running Windows NT provide support for a large customer-service call center. A database holds customer account information, and the majority of the transactions retrieve current account balances and recent transaction histories. Account update transactions, such as address changes, are held until after hours and are executed in batch mode each night. In addition, the building has an annoying environmental problem that intermittently brings servers down while leaving the rest of the users' systems and network in the building operational.

Solution - This is another situation where the IBM PC Server High-Availability Solution could be used. A dedicated 100 Mbps Ethernet link between the primary and the standby will permit the backup server to be on a different floor with different environmental factors. Because the database is used mainly for read operations, there would not be excessive traffic on the mirrored link.

Scenario 8: A Corporate LAN Environment
Requirements - This example illustrates how a combination of all of the preceding techniques can provide high availability for an entire corporate LAN environment.

In this scenario, a company uses OS/2 Warp Server to deliver file and print services, as well to provide a platform for communication, collaboration, software distribution, and database functions. Twelve servers participate in an OS/2 Warp Server domain that services 1,000 users in five different locations. The company also has an IBM mainframe and two LAN-attached AS/400s located at the corporate office.

Many of the business processes that used to run on mainframe systems have now been migrated to the client/server environment. The company's chief information officer has committed to senior management that the LAN environment will be no less reliable than the mainframe environment. This translates into a requirement that the overall availability of the LAN must be at least 99.5 percent during normal operating hours.

Solution - This situation calls for using a combination of techniques. Some of the fault-tolerant requirements are satisfied by built-in features of the hardware and software products. Others have to be designed as part of the overall LAN structure.

Let's look at the built-in features first. Both OS/2 Warp Server and Lotus Notes provide fault-tolerant features to help achieve high-availability goals. OS/2 Warp Server uses the domain concept, which presents the LAN as a single-system image. The domain comprises a primary domain controller and other servers. The primary domain controller maintains the master domain control database. It is automatically backed up by one or more backup domain controllers.

In this solution, the primary domain controller is located at corporate headquarters, while one of the servers at each location is configured as a backup domain controller. In this way, the backup domain controller can provide logon authentication for all users at that location, which increases logon speed.

As the domain database is updated, the changes are automatically replicated to all servers that are members of the domain. If the primary domain controller fails, the backup domain controllers continue to provide authentication and security functions for the domain. The role of one backup controller is temporarily changed to the primary role, so that a master domain control database is still available to handle changes to the domain.

Lotus Notes provides the collaboration functions. As in a previous example, Notes database replication is handled automatically between all Notes servers. One server in each location is loaded with Notes, and it participates in the database replication with the other servers. If any machine fails, the clients have a backup Notes server that they can access over the network.

The communication server is located on the corporate backbone and provides a gateway to the host. It is backed up via a hot-spare machine dedicated to this purpose. The backup server is an older machine previously used as a file server at one of the remote locations.

The backup gateway can be implemented in several ways. In the first way, all client machines are configured to use the primary communication server at a specific token-ring address. If the primary machine fails, the backup machine is brought online with the same token-ring address. This is an automated process using TME 10 NetFinity to monitor the primary server. If the primary server fails, NetFinity sends an alert to the backup machine, instructing it to reboot with the configuration of the primary communication server. Users will have to re-establish their host sessions, but the change to the backup server is otherwise transparent.

Another, more sophisticated way is to use the high performance routing (HPR) function within the IBM Communications Server for OS/2 Warp. HPR is a feature of APPN that dynamically re-routes connections around network failures. In this case, both "gateways" are configured as APPN network nodes. The clients use the Communications Server's dependent LU requester (DLUR) feature to allow the 3270 sessions to use HPR. If one of the network nodes fails, HPR reroutes the sessions in a non-disruptive manner.

To meet the requirements for the file, print, and database functions, the shared-disk approach is implemented. At each location, two IBM PC Server models 310 in production are backed up by an additional PC Server 310. Each production server is attached to an IBM 3518 SCSI expansion tower with a redundant power supply and 40 GB of disks configured as RAID 5 arrays.

All servers and 3518s are connected through an ApCon PowerSwitch such that, if a primary machine fails, the backup machine takes control of its 3518 and temporarily assumes the primary machine's role. This gives the support staff time to diagnose the problem and correct it.

The RAID 5 arrays in the 3518 provide ample data protection. If a drive fails, the server continues to operate by re-creating the data from the failed drive on the fly from check bits that have been stored on the other drives. While this is totally transparent to users, the system is configured to send an alert to a NetFinity systems management console at the corporate support center to notify the staff that a drive has failed.

Overall, the techniques applied in this scenario cost-effectively provide the appropriate level of protection. Each functional service provided by a PC Server has been analyzed, and the appropriate techniques have been employed to protect it.

The PC Server model 310 is a low-priced server that offers high performance, but more limited expansion capabilities; however, when used in conjunction with the external 3518s, the DASD expansion capabilities are more than adequate to handle file, print, and database services.

In this scenario, to further reduce costs, the machine redeployment technique was also used. As new hardware is purchased, existing machines can be redeployed to provide redundancy of network resources.

Managing LAN Availability
This article has presented some concepts and techniques that you can use to increase your LAN's availability. Many companies already use these techniques.

As the technology for availability management rapidly advances, the future will no doubt deliver many new and exciting techniques that will help satisfy your critical requirements for LAN availability.

Resources
Portions of this article come from the new IBM redbook PC Server High Availability Techniques, SG24-4858. This book can be ordered via the PUBORDER application on IBMLink or by calling (800) 879-2755. The redbook covers all of the concepts presented in this article in greater detail.

Kearby is an advisory technical support specialist in the IBM International Technical Support Organization in Raleigh, North Carolina. He conducts residencies and writes redbooks about IBM PC Servers and network operating systems. Tim's various positions during his IBM career include assignments in product development, system engineering, and consulting. He holds a BS in Electrical Engineering from Purdue University.

Laubscher is a senior systems engineer in the PC Server group, IBM PC Company, Research Triangle Park, North Carolina. Dave is team leader for the PC Server Competency Center and focuses on high-end servers and clustering solutions. His IBM career has included both development programming and field marketing. He holds a BS in Computer Science and an MS in Business Administration from the Pennsylvania State University.