Fault Tolerance for LAN Server

By Randall Johnson

''OS/2 product manager for Vinca Corporation, Randall Johnson discusses the importance of data security and availability in corporations today. The author shows how a small investment in a fault-tolerant system can significantly reduce lost productivity due to server downtime.''

For years, the computer industry has appreciated the importance of the billions of magnetic bits precariously stored on their departments' disk drives. Today, however, companies not directly involved in the computer industry are discovering their dependency upon software systems to manage databases and applications such as inventory, payroll, accounting, word processing, order entry, and contract management.

The day-to-day operational data is increasingly regarded as a significant corporate asset. Because the cost of storage media has plummeted to the point where hard-disk drives now cost fewer cents per megabyte than floppy disks, it is safe to say that the content is much more valuable than the container. The time cannot be far off when a corporation's stored data will be valued, insured, and given its own ledger entry on the company's balance sheet.

The past decade could be called the decade of the local area network (LAN). Distributing processing power to the corporate desktop has been a tremendous boon to productivity, but the focus on distributed computing has overshadowed the equally important issue of protecting corporate data assets. Many companies who moved strongly into LAN computing are living with a kind of information anarchy and are searching for ways to get control of their data. The call to arms in the decade of the LAN was "the network is the computer," but it is time to remember the purpose behind it all--that the data is the reason!

Are You Being Served?
As the dependence upon stored data increases, so does apprehension about the reliability of the systems that store, manage, and retrieve it. When data becomes unavailable, companies lose money. Consider these statistics from a survey of Fortune 1000 companies conducted by Stratus Computer Inc.: Networked storage systems serve tens, hundreds, and even thousands of connected clients. A file server failure idles thousands of workers, resulting in the revenue losses illustrated by the 1992 statistics. Maintaining data availability is clearly an information system (IS) priority. The future will bring even greater data re-centralization as corporations plug into data warehouses--large repositories of information that can be mined for important relationships and indicators.
 * The typical system outage lasts for an average of four hours and costs an average of $329,000 in lost revenue and worker productivity.
 * Computer downtime cost U.S. businesses more than $3.8 billion in lost revenue and worker productivity in 1992 (the last year for which such research data was available).
 * The average hourly revenue loss from downtime is $78,000.
 * Major businesses lost approximately 38 million worker hours annually, or $444 million in wages, due to downtime.

The IS manager must consider three distinct types of solutions to mitigate the effects of failures in networked storage systems: Fault tolerance, back-up, and disaster recovery. Of the three, fault tolerance is the first line of defense against system failures. Adding redundancy allows a fault tolerant system to gracefully handle a failure in any component for which a spare is provided. Sophisticated systems allow the spare to be used to balance the load until a failure occurs. At that time, the remaining component picks up the full load with a concomitant decrease in performance, but little or no interruption in service.

Most file servers allow for multiple network interface cards (NICs) to provide multiple, redundant paths between server and clients. Disk mirroring keeps a copy of important data on a second disk drive. In the event of a drive failure, the file system continues to provide access to the data using the remaining healthy drive.

Duplexing extends the mirroring idea to host adapter cards. In this case, the mirrored data is kept on a drive connected to the server through a second host adapter. This guards against failures in the host adapters as well as the drives. A more sophisticated form of mirroring--RAID level 5--uses a less wasteful method for providing data redundancy, but is not as well suited for duplexing.

None of these techniques provide an effective preventive measure for problems with the server platform itself. In addition to memory parity, power supply, and other hardware related faults, servers are also susceptible to a wide range of software related errors. Server redundancy is accomplished by adding a second server with access to the network and an up-to-date copy of the data, ready to step in when needed. Even though complete server system redundancy is still a relatively new science, an IS manager can provide effective server fault tolerance today, from both a cost and performance perspective.

Stand By Your LAN
StandbyServer 32 for LAN Server from Vinca Corporation provides a fault tolerance solution for IBM LAN Server installations. To do this, a second, standby machine is set up and connected to the primary server with a dedicated, high-speed link. An up-to-date copy of important data is kept on the disk drives in the standby machine using LAN Server's native disk mirroring (duplexing). In the event of a failure in the primary server, the standby machine automatically steps in to run LAN Server and provide access to the networked data.

StandbyServer 32 for LAN Server operates at the block level, below the file system, ensuring that the data on both servers are exact duplicates on an I/O-request-by-I/O-request basis. Systems that rely on file copying or replication suffer from long latencies and do not handle open files.

The persistent connection capability of LAN Server's client requester makes switching between servers painless for users. The impact of a failure in the primary server will vary from application to application, but with a switch-over time of just more than one minute, work interruption is minimal, if noticed at all.

StandbyServer 32 for LAN Server uses NetFinity (an IBM network management tool set) to generate system alerts that can be used to notify the system administrators of selected events, by pop-up menu, pager, or even audio .WAV files. The switch-over and notification system is completely open and easily customized to fit any particular site installation.

While the standby machine does not provide load balancing in connection with the primary, it can be used for other, independent tasks. The standby machine can be configured to stand in for a domain controller, back-up domain controller, or simply an additional server. The standby server platform need not be the same type of machine as the primary, offering a real solution of what to do with the old server after upgrading to a more powerful model.

Warp Server and StandbyServer 32 for LAN Server
IBM has recently taken an encouraging step in the promotion of its highly capable, yet invisible file server system. In creating OS/2 Warp Server, IBM has combined the three elements essential to building a highly effective and manageable networked server: OS/2 Warp, LAN Server 4.0 Advanced, and SystemView.

OS/2 Warp is an excellent foundation on which to build a server system. LAN Server 4.0 Advanced is a state-of-the-art file server program with a simple to use administrative front end and support for disk mirroring and duplexing. SystemView is a NetFinity-compatible derivative that manages networked components. By combining OS/2 Warp, LAN Server 4.0 Advanced, and SystemView into one package, IBM has created a potent competitor to NT Server and NetWare. All that is needed to build a complete server fault tolerant solution is the addition of StandbyServer 32 for LAN Server from Vinca.

Fault Tolerance First
By some estimates, data requirements are increasing at a compound rate of 40 percent per year, resulting in increased pressure on the IS manager to maintain data online and keep it uncorrupted. Down time costs are significant in terms of both lost revenue and goodwill.

Disk mirroring, duplexing, and RAID level 5 are effective disk channel fault tolerance mechanisms, but they leave the file server platform unprotected. Vinca's StandbyServer 32 for LAN Server has extended duplexing to include the server platform, providing complete fault tolerance for the network file server system. Considering that the entire cost of implementing the standby solution could be recouped in a single hour of recaptured down time, IS managers should take a closer look at server fault tolerance before rather than after the next downtime event.

See also StandbyServer 32 for LAN Server.