Performance Testing: Chatper 19 : Monitoring : websphere

When applied in an appropriate context, application monitoring is more than just the data that shows how an application is performing technically. This article presents a discussion on application monitoring methods, tools, and justification, and also provides a useful overview of what metrics to collect, for which components of a Web application, and when to collect them.

Introduction

Monitoring applications to detect and respond to problems - before an end user is even aware that a problem exists - is a common systems requirement, especially for revenue-generating production environments. Most administrators understand the need for application monitoring. Infrastructure teams, in fact, typically monitor the basic health of application servers by keeping an eye on CPU utilization, throughput, memory usage and the like. However, there are many parts to an application server environment, and understanding which metrics to monitor for each of these pieces differentiates those environments that can effectively anticipate production problems from those that might get overwhelmed by them.

When applied in an appropriate context, application monitoring is more than just the data that shows how an application is performing technically. Information such as page hits, frequency and related statistics contrasted against each other can also show which applications, or portions thereof, have consistently good (or bad) performance. Management reports generated from the collected raw data can provide insights on the volume of users that pass though the application. An online store, for example, could compare the dollar volume of a particular time segment against actual page hits to expose which pages are participating in higher or lower dollar volumes.

Back to top

Justifying proactive application monitoring

There are fundamentally two ways to approach problem solving in a production environment:

One is through continual data collection through the use of application monitoring tools that, typically, provide up-to-date performance, health and status information.
The other is through trial and error theorizing, often subject to whatever data is available from script files and random log parsing.

Not surprisingly, the latter approach is less efficient, but it's important to understand its other drawbacks as well. Introducing several levels of logging to provide various types of information has long been a popular approach to in-house application monitoring, and for good reason. Logging was a very trusted methodology of the client-server era for capturing events happening on remote workstations to help determine application problems. Today, with browsers dominating the thin client realm, there is little need for collecting data on the end user's workstation. Therefore, user data is now collected at centralized server locations instead. However, with the general assumption that all possible points of logging are anticipated and appropriately coded, data collection on the server is also problematic. More often than not, logging is applied inconsistently within an application, often added only as problems are encountered and more information is needed.

In contrast, application monitoring tools offer the ability to quickly add new data - without application code changes - to information that is already being collected, as the need for different data changes with the ongoing analysis.

While logging worked well in the single user environment, there are some inherent problems with logging in the enterprise application server environment:

Clustered environments are not conducive to centralized logs. This is a systemic problem for large environments with multiple servers and multiple instances of an application. On top of the problem of exactly how one is to administer the multiple logs, is the user's ability to bounce around application servers for applications that do not use HTTP Session objects. Coordinating and consolidating events for the same user spread across multiple logs is extremely difficult and time consuming.
Multiple instances of applications and their threads writing to the same set of logs imposes a heavy penalty on applications that essentially spend time synchronized in some logging framework. High volume Web sites are an environment where synchronization of any kind must be avoided in order to reduce any potential bottlenecks that could result in poor response times and, subsequently, a negative end user experience.
Different levels of logging requires additional attention: when a problem occurs, the next level of logging must be turned on. This means valuable data from the first occurrance of the problem is lost. With problems that are not readily reproducible, it's difficult to predict when logging should be on or off.
Logs on different machines can have significant timestamp differences, making correlation of data between multiple logs nearly impossible.
Beyond the impact of actually adding lines of code to an application for monitoring, additional development impacts include:
- Code maintenance: The functionality, logical placement and data collected will need to be kept up, hopefully by developers who understand the impact of the code change that was introduced.
- Inconsistent logging: Different developers may have drastically different interpretations of what data to collect and when to collect it. Such inconsistencis are not easily corrected.
- Developer involvement: Involving developers in problem determination becomes a necessity with log-based approaches, since the developer is usually the best equipped to interpret the data.
Application monitoring accomplished through coding is rarely reused. Certainly the framework itself can be reused, but probably not the lines of code inserted to capture specific data.
When logging to a file, the impact on the server's file I/O subsystem is significant. Few things will slow down an enterprise application more than writing to a file. While server caches and other mechanisms can be configured to minimize such a hit, this is still a serious and unavoidable bottleneck, especially in high volume situations where the application is continually sending data to the log.
While Aspect-Oriented Programming is proving a valuable technology for logging, it has yet to be embraced by the technical community.

Not surprisingly, it is also common for development teams to try to collect basic performance data using their logging framework, capturing data such as servlet response time, or the timings of specific problematic methods, etc., in order to better understand how the application performs. This activity is victim to the same disadvantages mentioned above, in that any suspected problem points are correctly identified and instrumented. If new data points are identified, then the application must be modified to accommodate the additional data collection, retested and then redeployed to the production environment. Naturally, such code also requires continual maintenance for the life of the application.

Back to top

Proactive Application Monitoring Tools

The benefits of a proactive, tool-based approach to application monitoring are many:

No code
This, by far, is the single most valuable benefit regarding a tools-based approach. Application monitoring tools, through the ability of classloader instrumentation and other Java techniques, allows for the seamless and invisible collection of data without writing a single line of code.
Fewer developer distractions
With application monitoring no longer a focal point, developers can instead concentrate on the logic of the application.
Non-application specific
Application monitoring tools are not developed for anything more specific than the Java language and WebSphere Application Server environment.
Reusability
Application monitoring tools are written to generically capture data from any application, resulting in a tremendous amount of reuse built into the tooling itself. Without doing anything extraordinary, an application monitoring tool can capture data for a variety of applications as they come online.
Reliability
While you should still perform due diligence to ensure that a tool is working properly in your environment, application monitoring tools from major vendors are generally subject to extensive testing and quality assurance for high volume environments.
Understandable results
Consolidation of data occurs at some central console, and the results can be readily understood by a systems administrator. Only when the system administrator has exhausted all resources would developers need to assist in troubleshooting by examining data from a variety of subsystems.
Cost
Yes, there is the initial expenditure of procuring such a tool, but there is also the very real possibility of eventual cost savings - particularly in terms of time.

Back to top

Application Monitoring 101

A WebSphere Application Server-based application has, at the very least, two or more of the components identified in Figure 1:

servlet container
EJB container
HTTP Session objects
connection pool to database(s)
JVM memory.

Each one of these components has a variety of metrics that can be collected and monitored. When monitoring an application, specific components are identified for monitoring, depending on what it is you want to watch for, then thresholds are set to provide alerts to the team of people that can work on the particular problem. For example, if the connection pool is experiencing slower SQL timings than normal, then the back end database and network administrators would be contacted so they could figure out why this is happening.

Figure 1. Basic components of a WebSphere Application Server environment

Application monitoring can be divided into the following categories:

Fault
This type of monitoring is primarily to detect major errors related to one or more components. Faults can consist of errors such as the loss of network connectivity, a database server going off line, or the application suffers a Java out-of-memory situation. Faults are important events to detect in the lifetime of an application becuase they negatively affect the user experience.
Performance
Performance monitoring is specifically concerned with detecting less than desirable application performance, such as degraded servlet, database or other back end resource response times. Generally, performance issues arise in an application as the user load increases. Performance problems are important events to detect in the lifetime of an application since they, like Fault events, negatively affect the user experience.
Configuration
Configuration monitoring is a safeguard designed to ensure that configuration variables affecting the application and the back end resources remain at some predetermined configuration settings. Configurations that are incorrect, such as a too low maximum JVM heap size setting or DB2 maxapplheapsz, can negatively affect the application performance. Large environments with several machines, or environments where administration is manually performed, are candidates for mistakes and inconsistent configurations. Understanding the configuration of the applications and resources is critical for maintaining stability.
Security
Security monitoring detects intrusion attempts by unauthorized system users.
Accounting
Some installations charge application owners maintenance and administration fees. This type of monitoring measures usage so that, for example, organizations that have a centralized IT division with profit/loss responsibilities can appropriately bill its customers based on their usage.

Each of these five categories can also be integrated into daily or weekly management reports for the application. If multiple application monitoring tools are used, the individual subsystems should be capable of either providing or exporting the collected data in different file formats that can then be fed into a reporting tool. Some of the more powerful application monitoring tools can not only monitor a variety of individual subsystems, but can also provide some reporting or graphing capabilities.

Historical data

One of the major side benefits of application monitoring is in being able to establish the historical trends of an application. Applications experience generational cycles, where each new version of an application may provide more functionality and/or fixes to previous versions. Proactive application monitoring provides an way to gauge whether changes to the application have affected performance and, more importantly, how. If a fix to a previous issue is showing slower response times, one has to question whether the fix provided was properly implemented. Likewise, if new features prove to be especially slower than others, one can focus the development team on understanding the differences.

Historical data is achieved by defining a baseline based upon some predefined performance test and then re-executing the performance test when new application versions are made available. This baseline has to be performed on the application at some point in time and can be superceded by a new baseline once performance goals are met. Changes to the application are then directly measured against the baseline as a measurable quantity. Performance statistics also assist in resolving misconceptions about how an application is (or has been) performing, helping to offset subjective observations not based on fact. When performance data is not collected, subjective observations often lead to erroneous conclusions about application performance.

Back to top

Metrics

The following sections define a collection of metrics applicable to a typical WebSphere Application Server environment. In the vein of extreme programming, collect the bare minimum metrics and thresholds which you feel are needed for your application, selecting those that will provide the data points necessary to assist in the problem determination process. Start with methods that access backend systems and servlet/JSP response timings. Prepare to change the set of collected metrics or thresholds as your environment evolves and grows.

Keep in mind that the collection of metrics available will depend on your infrastructure. Some components, such as network switches and routers, have built-in SNMP capabilities to send traps when faults occur. Other back end resources are easily monitored by Tivoli© Distributed Monitor tools. Monitoring the application and JVM environment are available through tools such as Wily's Introscope, which is capable of emitting SNMP traps to a Tivoli console. The mix and match of tools in every environment will be different, based on technical and business requirements. What may be an effective tool in one environment may fall short in others.

Fault monitoring

Not unexpectedly, the single most comprehensive collection of metrics from the application environment is for fault monitoring. These metrics involve not only detecting application-related faults, but also those faults related to the physical server the application is running on, the back end resources being accessed, and the network connectivity components (switches, routers, etc.). Many of the metrics described in the fault grouping correlate to threshold metrics in other categories.

	Type of Monitoring	Applicable Metric	Threshold
Hardware and Network	Server availability	Heartbeat/ping all servers	UP/DOWN
	Error report	Monitor error report logs hard errors	ERRORS
	Network latency	Ping time between network components	UP/DOWN/SNMP traps
	CPU utilization	CPU utilization all servers	> 99% over x minutes
	Memory utilization	Memory utilization all servers	> 99% over x minutes
	Paging/swapping	OS level metric all servers	In process of paging/swapping
	File system	Available file space all servers	Out of space
	Network components	Capture SNMP traps	UP/DOWN/ERROR
WebSphere Application Server	Admin server process	Monitor admin server process	UP/DOWN
	Application server process	Monitor application server process	UP/DOWN
	Java naming server	Scripts to run JNDI queries	UP/DOWN/ERROR
	Web application	Running	STARTED/STOPPED
	EJB container	Running	STARTED/STOPPED
	Datasources	Available	UP/DOWN
Gateways	CTG client process	Available	UP/DOWN/ERROR
	SNA	Available	UP/DOWN/ERROR
	DB2 connect	Available	UP/DOWN/ERROR
Web Server	HTTPD processes	Available	UP/DOWN/ERROR
Web Server	Timed out connection	Connection timeout	Occurred
Databases	DB2 process	Available	UP/DOWN/ERROR
Databases	Oracle process	Available	UP/DOWN/ERROR
MQSeries	Queue Manager	Available	UP/DOWN/ERROR
	MQ Broker	Available	UP/DOWN
	Queue Manager listener	Available	UP/DOWN
	Queue depth	Depth exceeds threshold	> 3500
Application	Functional	End-to-end application test	PASSED/FAILED
Application	Error logs	Search for errors emitted by the application	ERROR OCCURRED

Note that some tools provide error messages only in log files that must be monitored.

DB2: monitor db2diag.log
CTG: monitor CICSCLI.LOG
SNA: monitor sna.err
Application log files are per application. If the environment is clustered, then the log files from each application clone must be monitored.

Perfromance monitoring

The metrics in the performance monitoring grouping are specific to detecting degraded behavior by any of the resources related to the application.

	Type of Monitoring	Applicable Metric	Threshold
Hardware and Network	Network latency	Ping time and network bandwidth measurements	Timings > 1000 ms or network bandwidth maxed
	CPU utilization	CPU utilization all servers	> 80% over x minutes
	Memory utilization	Memory utilization all servers	> 80% over x minutes
	Paging/swapping	OS level metric all servers	In process of paging/swapping
	File system	Available file space all servers	> 80% used
	Network components	Capture SNMP traps	Degraded counters
WebSphere Application Server	Java naming server	Scripts to run JNDI queries	Response time > 3 secs
	Servlet engine	Average servlet and JSP response times	Response time > 8 secs
	EJB container	Average response time	Response time > 900 ms
	JDBC	Average response time by SQL INSERT, UPDATE, DELETE	Response time > 1600 ms
Gateways	CTG client	Average response time	Response time > 900 ms
	MQ client	Average response time	Response time > 400 ms
	SNA	Average response time	Response time > x secs
	DB2 connect	Average response time	Response time > 1000 ms
Web Server	HTTP response	Average response time retrieving 1K GIF	Response time > 1000 ms
Databases	DB2	Average response time	Response time > 1000 ms
Databases	Oracle	Average response time	Response time > 1000 ms
MQSeries	Queue Manager	Average response time	Response time > 200 ms
	Queue Manager listener	Available	UP/DOWN
	Queue depth	Depth exceeds threshold	> 500
Application	Complex page requests	Average response time	> 10 secs or less
Application	Error logs	Search for warnings emitted by the application	Warnings occur

Metrics specific to an application can involve a number of Complex Page Requests, used to determine application performance by specific functions. Some functions may have lower thresholds than others. How often the metrics need to be collected depends on the tool and metric being collected. For example, metrics such as average servlet response time and CPU Utilization should be collected at least every minute or two, whereas complex page requests may be executed only once, every 10 to 20 minutes.

Configuration monitoring

The variety of back end resources that can exist in a WebSphere Application Server configuration is non-trivial. In addition to these configurations, there are also a variety of configurations specific to the application. However, configuration changes occur infrequently in the production environment, making them ideal candidates for periodic monitoring on a less frequent basis.

	Type of Monitoring	Applicable Metric
Hardware and Network	Network	Each network component configuration
	Server	OS level configuration
	File system	JFS configurations
WebSphere Application Server	Java naming server	JNDI values
	Servlet engine	Configurations
	EJB container	Configurations
	JDBC/Connection pool	Configurations
Gateways	CTG client/server	Configurations
	MQ client/server	Configurations
	SNA	Configurations
	DB2 connect	Configurations
Web Server	HTTP server	Configurations
Databases	DB2 server	Configurations
Databases	Oracle server	Configurations
MQSeries	Queue Manager	Configurations
	Queue Manager listener	Configurations
	Queue depth	Configurations
Application	Application-specific	Configurations

Attempting to take configuration snapshots with the XMLConfig tool must be handled with some forethought. XMLConfig is a performance intensive application, especially in large WebSphere© Application Server environments. Therefore, scheduling XMLConfig exports during low volume or maintenance windows is recommended.

Security monitoring

Security monitoring is concerned with the ability to detect intrusion and denial of service attacks. Security monitoring can be complex, since each network component (e.g., firewall, router, third party authentication software, etc) has its own security protocols and detection capabilities. There are a number of good authoritative references on the subject of security that can help you with specific details, such as setting the appropriate monitoring points. Due to the nature of this type of monitoring, you will want to have a third party, who is competent in security, audit your installation to make sure that your monitoring points are adequately set for comprehensive threat detection.

Accounting monitoring

In environments where it is necessary to charge application owner fees based on usage, most data for accounting can be derived from the Web server access logs (a capability of the WebSphere Site Analyzer). Applications with Java fat clients that do not communicate via a Web server may require that the application provide additional logging capabilities that allow the capture of usage data. Data mining techniques can be used by large, high volume installations, but this also requires the ability to store large amounts of data for some minimum period of time.

Back to top

Conclusion

Monitoring a variety of application metrics in production can help you understand the status of the components within an application server environment, from both a current and historical perspective. As more back end resources and applications are added to the mix, you need only to instruct the application monitoring tool to collect additional metrics. With judicious planning and the right set of data, proactive monitoring can help you quickly correct negative application performance, if not help you avoid it altogether.

Interpreting raw data within a business context can help management understand how applications are performing, since the correlation of the volume statistics with, say, total revenue may be easily produced depending on the raw data you're collecting. Understanding how a site is generating revenue can help guide future changes to an application.

Perhaps it's inevitable that some application errors will occur. At the very least, proactive monitoring provides you with the ability to detect problems as they happen, and fix them before anyone notices. If problems are going to happen, it's better that you find them before your customers do.

Performance Testing

Contact us

Chatper 19 : Monitoring : websphere

No comments:

Search this blog