Systems Monitoring

By Dave Wallenberg
 * System Monitoring in Multitasking Environments

The focus of this article is to discuss the value of monitoring system activities in the multitasking and multithreading environment as it pertains to such operating systems such as OS/2, UNIX, and NT.

OS/2, UNIX, and NT are multitasking and multithreading operating systems. This means that more than one task can be performed at a time. Not only does this imply that the user can work on more than one thing at a time, but the operating system itself takes advantage of this as well and has several processes executing in the background. These processes are typically hidden from the user.

Where there can be numerous activities, there can be several resources and devices being used simultaneously. This adds a level of complexity that today's user needs to be aware of. The user may need to be aware of or desire the ability to monitor:
 * CPU usage to identify and resolve performance bottlenecks. This can be invaluable when trying to understand why and when your system runs slow and the ability to identify what processes cause the degradation.
 * With memory utilization and how much memory is currently in use, one can gain insight into system tuning. The more components that are simultaneously active, the more memory is demanded until excess swapping occurs.
 * Swap utilization to identify excess and frequent swap activity when memory is over committed. This can be a good identifier for needing more memory. When physical memory is exceeded based on the programs simultaneously running and their memory requirements, the operating system responds by temporarily "moving" pieces (4KB pages) of less active memory out to disk. These chunks of memory are brought back into memory when the memory is referenced. A Swap monitor can provide insight into excess swapping.
 * Program performance can enable the user to assess the impact of running program(s). Today's diverse software and multimedia craze requires the high performance of at least a i486 based or equivalent processor to keep up with CPU and memory intensive applications. The ability to examine the CPU usage of a program can help identify demanding programs and perhaps ill-behaved programs. Program monitoring can also offer the ability to see the number programs active and what state each program is in.
 * Disk space utilization is often desirable to monitor. How many people out there have more disk space than they know what to do with. Most people are always "tight" on disk space. As you are downloading a large file, or excess Swapping occurs, the ability to quickly view your disk space constraints can be very valuable.
 * Status of print spooler and print queue(s) is another valuable item to monitor. Particularly when the computer can print to LAN printers, receive and send fax documents, and more.
 * The current date, time, disk drives online and available, and more.

The three most important things in a computer have never changed. They remain, and in the following sequence, are:
 * 1) The speed of the processor(s)
 * 2) The amount, speed, and location of physical memory
 * 3) The amount and speed of external storage disk

The multitasking and multithreading environment of OS/2 allows numerous activities to execute simultaneously. Although each process executing appears to be sharing the same CPU concurrently, the processes are actually being individual time-slices from the OS/2 Scheduler in what is called a Priority driven Round-Robin algorithm. This means the scheduler factors into the time-slice algorithm (amongst other items such as I/O vs CPU bound and Min/Max time-slice milliseconds) the priority of each thread within a process. The information that the operating system maintains and collects per process and each thread within the process is stored in the Process Tables. These tables are memory resident data structures that only the operating system can access OS/2 provides an (undocumented) API that allows access to the Process Table information. With this, the monitoring software can summarize and report the statistics in a meaningful presentation. Some of the data that can be monitored and presented to the user is:
 * Program and CPU Monitoring:
 * Number of processes and threads that are running
 * The priority and state of the thread in a process
 * The amount of CPU time each thread of a process has received
 * Which CPU each thread is attached to in the SMP (Symmetric Multi-Processing) version
 * and more...

These items form the basis for monitoring programs executing, how they are executing, and what amount of CPU each is consuming. This information, if presented intuitively, can be useful to any audience. Whether the user be a novice or a software engineer, this real-time summary provides insight. It can be used as resolving overall system performance or performance right down to the thread level. For example, when the system appears to be "hung" or sluggish, a simple system-wide CPU monitor can inform the user as to the extent of unresponsiveness. The ability to then "zoom" into a birds-eye view of all programs running and examine the CPU utilization per processes, usually will identify the culprit program.

Without this insight, you're at a complete loss as to what happened. The CPU monitor is typically the first monitor tool to obtain and rather simple to develop. But the Process Monitor is difficult to develop and depends solely on the ability to "get at" the Process Table information of the operating system. The Process Table(s) structure is operating system specific and requires a strong understanding of the structure layout. Most new operating systems provide this raw data for applications in the form of an API. OS/2 provides much of the data to the monitoring application via an API called DosQProcStatus.

At it's best, the monitoring tool may provide an Early Warning mechanism. This ability would warn the user when certain thresholds have been exceeded. Such as abnormal and excessive CPU usage may indicate a runaway situation. Before this condition brings the entire system to its knees, the user can have he opportunity to not only terminate the ill-behaved culprit, but more importantly, determine the cause. Early Warning mechanisms are seen most often in real-time mission-critical systems.

The real-time memory monitor is valuable in providing the user with insight into the amount of memory that is currently in use and memory that is idle and subject to being swapped-out.
 * Memory Monitor:

In a virtual memory model such as OS/2, the amount of actual free memory is very difficult to determine and usually calculates into an approximation. The virtual memory model simply looks at available memory as physical memory installed plus disk space available. So free memory can be two different things; Actual free physical memory, or total virtual memory. Actual free memory is most often desired to monitor.

Tracking free physical memory, along with swap activity, enables you to determine whether you have enough memory to meet your demands. This is primarily an issue of cost. Although, some may interpret it as an issue of "pain". What is your "threshold of pain" while you wait for the system to swap out memory pages to free up room for your next program?


 * Disk and Swap Monitoring:

Swapping is very much related to the amount of physical memory available. At some point in the tuning process you determine your "threshold of pain". RAM costs money, but swapping is inevitable - and normal. Unless your application is of a time-critical nature and can not afford ANY swap time.

The ability to monitor the frequency and degree of swapping can be the only thing you need in order to tune a system. With this alone, you can do a performance tune-up on a given system. Or at least identify if more memory is needed and how much.

Only surface analysis tuning can be accomplished without the ability to record the monitoring activities. Periodic and continuous logging are a must for the monitoring tool.
 * Monitor Logging:

After logging activity for a given amount of time, you have captured a sample set of statistics. This raw data can be summarized and produce minimum, maximum, and average performance reports over different time intervals. Perhaps peak periods.

It is important that a time-stamp is placed in the log (preferably down to a hundredth of a second). This facilitates even the most stressful situations such as real-time monitoring.

The monitoring software can not cause overhead. To accomplish many of the monitoring tasks discussed in this article, the monitoring software is required to periodically sample the device or system resource. Obviously, the more often samplings occur, the more system resources are used by the monitoring software. The ideal monitoring software:
 * Monitor Caveats:
 * Is designed to run ALL the time and causes absolutely no interruptions. Realizing that the PC workstation is used by the user and the typical user uses the keyboard, keyboard and video delays caused by the monitoring software means that the monitoring software is using too much CPU to accomplish its job.
 * Is designed to handle a stand-alone environment as well as being capable of monitoring resources in a intensive real-time environment as well.
 * Is designed to be aware of LAN connectivity.
 * Is designed to allow adjustable sampling rates for each resource to be monitored.
 * Sampling each resource should be staggered to avoid sampling ALL resources at the same time. This can also be accomplished by the user by adjusting the frequency of sampling.
 * Exercises multithreading capabilities of the operating system so resource intensive tasks can execute at idle priorities.
 * Is self-contained and integrated.
 * Contains a logging mechanism. This is very important for unattended operations. Once again, and easy to overlook, is how the logging is accomplished and should cause no extra overhead.

Run one CPU monitor at a time. Attempting to run multiple CPU monitors at the same time defeats the purpose and causes each monitor to absorb the other's CPU load.

Depending on how robust and feature rich the monitoring software is, the software may not be able to guarantee 100% compatibility with future versions of the operating system. Due to the nature of monitoring and the uniqueness of the application, most monitoring software requires slight modification per operating system release. Contact the vendor before you migrate to a new release of the operating system. The monitoring software is not usually affected by minor OS upgrade versions.

Mr. Wallenberg is a senior consultant with Computer Sciences Corporation and specializes in OS/2 and UNIX based systems programming. Independently he develops OS/2 and UNIX system monitoring tools. His latest creation is PM Patrol for OS/2 which provides much of the items discussed in this article. Mr. Wallenberg can be contacted via CompuServe (72702,2320) or the Internet (72702.2320@CompuServe.COM).
 * About the Author: