Newsletters
Using TeamQuest Model for Mainframe Capacity Planning
This article is taken from a presentation at the TeamQuest Technology Summit, highlighting how TeamQuest software was used on mainframe systems in a large enterprise environment. In particular, TeamQuest Model proved invaluable in forecasting mainframe capacity against anticipated growth.
Quantifying Performance
Management agreed that it would be wise to quantify the performance and capacity metrics from two mainframes while running a core business application. This was to be done by analyzing data from a test conducted under stressful system conditions (known as a stress test). This was achieved by building baseline analytic models for each of the processors representing the stress test workload. IT also saw the value of forecasting the capacity requirements and performance of the mainframe application, using the many "what if" scenarios that can be modeled using TeamQuest software.
TeamQuest Model for mainframe offers the ability to change hardware, workloads, and software configurations by using what if scenarios, identifying performance gaps, and using projections via a user-friendly interface, with comprehensive and concise documentation.
IT conducted modeling based upon certain assumptions. The stress test, for example, had to be representative of the application running during its peak period. The application had to be scalable and systems would be tuned properly so it suffered from no significant bottlenecks. Projections were carried out using different growth scenarios. Other assumptions included the fact that storage was assumed to be an unconstrained resource.
This department’s primary application runs on two IBM mainframe processors. Each processor runs three Logical Partitions (LPARS). One of the three LPARS supported the primary application. The application was split between two different mainframes for availability and load balancing. This application uses the mainframe for DB2 access only. CPU utilizations during the stress test were translated into MIPS (millions of instructions per second).
Model Revelations
TeamQuest Model revealed that the stress test workload could be supported on the processors in one of the mainframes. With a 30 percent growth in the application, however, those processors would not be able to provide acceptable performance to all LPARs. A planned upgrade to one system, with the addition of one general purpose processor, demonstrated high value as it would provide reduced utilizations and lower response times due to shorter queue delays. TeamQuest Model also highlighted the fact that workload growth on the primary application was best moved off the original mainframe or by upgrading its processing capabilities.
Modeling was done on the mainframe running the primary application on the basis of a 30 percent reduction and a 30 percent increase in the application workload. All other workloads were kept constant. Processor utilization rates went from 81 percent at the time of data capture, down to 74 percent with a 30 percent reduction in the application. With a 30 percent increase in the application workload, processor utilization rose to 89 percent. However in absolute terms, this equates to a change from approximately .004 seconds in the baseline to .012 seconds with 30 percent growth.
TeamQuest Model contains a handy metric known as stretch factor that we used heavily in our various what-if scenarios. Stretch factor is the ratio of service plus queuing divided by service. In other words, this measures the time in wait versus the time doing work. This is available for each workload/process accessing an active or passive resource. An ideal score is 1 to 1, whereas greater than 1.8 indicates a constraint. When modeled, the stretch factor in two systems exceeded 2. The team used TeamQuest to drill down to isolate queuing delays and other factors inhibiting performance under a steadily rising workload.
Additional what-ifs addressed other response time components. The longest response components turned out to be disk service. This made sense since DB2 I/Os show up in this workload. For the purposes of modeling, disks were grouped into two groups of over 3,000 low activity devices each. These were grouped in this fashion to keep the model smaller and more manageable. Two tape units were also part of the model.
The modeling was conducted to detect even the slightest deterioration in response times. Therefore, even when queuing delays appeared to increase significantly, they may have not gone up dramatically from an absolute number perspective. For example, in one model, response increased 75 percent from the baseline based upon 30 percent growth. In the real world, this represented a difference of less than .001 seconds. This was deemed an acceptable level of increase.
The other mainframe underwent similar modeling, again using the -30 percent and +30 percent variations in the primary application workload. All other workloads were kept constant. Processors here were upgraded, too. As a result of the upgrade, processor utilization went from a range of 74 percent to 89 percent to a more manageable 67 percent to 73 percent. TeamQuest Model also demonstrated that all stretch factors remained below 2 for this mainframe and response times remained in an acceptable range. With better processors, a 30 percent growth increase meant that task response decreased from .025 to .020 seconds.
IT then modeled having to move the primary application from one LPAR to another in the event of a failure. Initially, the addition of the primary application to this LPAR showed CPU utilization above 90 percent. At 30 percent growth, processor utilization hit 100 percent and work had to begin queuing. Some stretch factors even hit as high as 10. TeamQuest Model was used to determine the best way to deal with a failure on the LPAR running the primary application. A processor upgrade proved out that even with a 30 percent increase in its workload, the LPAR would be able to comfortably handle it.
Next Steps
The modeling process will continue to be used to provide IT with early warnings of trends in increased utilizations and response times, allowing IT to be proactive in either researching and implementing performance improvements or adding capacity where required. In addition, IT will not need to write and maintain code for continued reduction and analysis of its mainframe capacity data.
|