Sunday, June 21, 2009

Perf_Ad


What is Transaction Response Time?


Transaction Response Time represents the time taken for the application to complete a defined transaction or business process.


Why is important to measure Transaction Response Time?

The objective of a performance test is to ensure that the application is working perfectly under load. However, the definition of "perfectly" under load may vary with different systems.
By defining an initial acceptable response time, we can benchmark the application if it is performing as anticipated.

The importance of Transaction Response Time is that it gives the project team/ application team an idea of how the application is performing in the measurement of time. With this information, they can relate to the users/customers on the expected time when processing request or understanding how their application performed.


What does Transaction Response Time encompass?


The Transaction Response Time encompasses the time taken for the request made to the web server, there after being process by the Web Server and sent to the Application Server. Which in most instances will make a request to the Database Server. All this will then be repeated again backward from the Database Server, Application Server, Web Server and back to the user. Take note that the time taken for the request or data in the network transmission is also factored in.

To simplify, the Transaction Response Time comprises of the following:
1. Processing time on Web Server
2. Processing time on Application Server
3. Processing time on Database Server
4. Network latency between the servers, and the client
The following diagram illustrates Transaction Response Time.



Figure 1 Transaction Response Time = (t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8 + t9) X 2

Note:
Factoring the time taken for the data to return to the client.


How do we measure?

Measuring of the Transaction Response Time begins when the defined transaction makes a request to the application. From here, till the transaction completes before proceeding with the next subsequent request (in terms of transaction), the time is been measured and will stop when the transaction completes.

Differences with Hits Per Seconds
Hits per Seconds measures the number of "hits" made to a web server. These "hits" could be a request made to the web server for data or graphics. However, this counter does not represent well to users on how well their application is performing as it measures the number of times the web server is being accessed.


How can we use Transaction Response Time to analyze performance issue?
Transaction Response Time allows us to identify abnormalities when performance issues surface. This will be represented as slow response of the transaction, which differs significantly (or slightly) from the average of the Transaction Response Time.
With this, we can further drill down by correlation using other measurements such as the number of virtual users that is accessing the application at the point of time and the system-related metrics (e.g. CPU Utilization) to identify the root cause.
Bringing all the data that have been collected during the load test, we can correlate the measurements to find trends and bottlenecks between the response time, the amount of load that was generated and the payload of all the components of the application.

How is it beneficial to the Project Team?


Using Transaction Response Time, Project Team can better relate to their users using transactions as a form of language protocol that their users can comprehend. Users will be able to know that transactions (or business processes) are performing at an acceptable level in terms of time.
Users may be unable to understand the meaning of CPU utilization or Memory usage and thus using a common language of time is ideal to convey performance-related issues.

===========================================================

Before starting loadtest we have to make sure the following check lists: :)

1.End users, customer, and project members have been notified in advance of the execution dates and hours for the capacity test
2.All service level agreements, response time requirements have been agreed upon by all stakeholders.
3.Contact list with names and phone numbers has been drafted for support personnel (onsite and remote)
4.Functional Testing of the application has been completed.
5.Restart the controller machine.
6.Ramp Up / Duration / Ramp Down is configured correctly.
7.All Load Generators are in a "Ready" status.
8.All Load Generators are assigned to appropriate scripts.
9.All scripts have the correct number of Vusers assigned to them.
10.All scripts have the correct number of Iterations assigned to them.
11.Correct pacing has been agreed upon and configured for all appropriate scripts.
12.Logging is set to Send messages only when an error occurs for all scripts.
13.Think Times have been enabled/disabled in the test scripts
14.Generate snapshot on error is enabled for all appropriate scripts.
15.Timeout values have been set to the appropriate values.
16.All content checks have been updated for the appropriate scripts.
17.Rendezvous points have been enabled/disabled for appropriate scripts
18.All necessary data has been prepared/staged and is updated in all scripts.
19.Any scripts with unique data requirements have been verified.
20.All scripts have been refreshed in the controller and reflect the most recent updates.
21.IP Spoofing has been enabled in the controller.
22.IP Spoofing has been configured on all appropriate Load Generators.
23.All LoadRunner Monitors have been identified, configured and tested.
24.Auto Collate Results should be enabled
25.Results directory and file name should be updated.

==========================================

Max. Running Vuser

It defines at any point of point of time, the maximum number of vusers running together concurrently (in Run state). This is the "state" or usually the requirement of a load test to reach "X" number of concurrent users. If a load test were to required to run 100 concurrent users, then the Max. Running Vuser must be 100. This is different from Vuser Quantity explained in the following.
Vuser Quantity

In the Controller, the Vuser Quantity is the total number of participating in the load test but it is different from Max. Running Vuser which is explained above. To view the total number of Vuser participating the load test (Vuser Quantity), open the Vuser Summary graph. Please take note that the Summary Report in the Analysis session displays the maximum number of vusers running in the scenario and not the total vuser particiapting.

Underlying Operating System and Network Improvements
If you control the OS and hardware where the application will be deployed, there are a number of changes you can make to improve performance. Some changes are generic and affect most applications, while some are application-specific. This article applies to most server systems running Java application, including servlets, where you usually specify (or have specified to you) the underlying system, and where have some control over tuning the system. Client and standalone Java programs are likely to benefit from this chapter only if you have some degree of control over the target system, but some tips in the chapter apply to all Java programs.
It is usually best to target the OS and hardware as a last tuning choice. Tuning the application itself generally provides far more significant speedups than tuning the systems on which the application is running. Application tuning also tends to be easier (though buying more powerful hardware components is easier still and a valid choice for tuning). However, application and system tuning are actually complementary activities, so you can get speedups from tuning both the system and the application if you have the skills and resources. Here are some general tips for tuning systems:

• Constantly monitor the entire system with any monitoring tools available and keep records. This allows you to get a background usage pattern and also lets you compare the current situation with situations previously considered stable.
• You should run offline work during off-hours only. This ensure that there is no extra load on the system when the users are executing online tasks and enhance performance of both online and offline activities.
• If you need to run extra tasks during the day, try to slot them into times with low user activity. Office activity usually peaks at 9am and 2:30pm and has a load between noon and 1pm or at shift changeovers. You should be able to determine the user-activity cycles appropriate to your system by examining the results of normal monitoring. The reduced conflict for system resources during periods of low activity improves performance.
• You should specify timeouts for all process under the control of your application (and others on the system, if possible) and terminate processes that have passed their timeout value.
• Apply any partitioning available from the system to allocate determinate resources to your application. For example, you can specify disk partitions, memory segments, and even CPUs to be allocated to particular process.

CPU
Disk
Memory
Network

The above is taken from the publication, "Java Performance Tuning" written by Jack Shirazi. I would recommend to read this book as it provides not just tuning and bottleneck concepts bounded by Java. A simplified version (which is the summary of the chapter can be found here [Comming soon]).
Basics: CPU Bottlenecks
Java provides a virtual machine runtime system that is just that: an abstraction of a CPU that runs in software. (Note that this chapter is taken from the "Java Performance Tuning" written by Jack Shirazi and therefore alot of discussions circled around Java technologies.) These virtual machines run on a real CPU, and in this section the book discuss the performance characteristics of those real CPUs.

CPU Load


The CPU and many other parts of the system can be monitored using system-level utilities. On Windows, the task manager and performance monitor can be used for monitoring. On UNIX, a performance monitor (such as perfmeter) is usually available, as well as utilities such as vmstat. Two aspects of the CPU are worth watching as primary performance points. These are the CPU utilization (usually expressed in percentage terms) and the run-able queue of processes and threads (often called the load or the task queue). The first indictor is simply the percentage of the CPU (Or CPUs) being used by all the various threads. If this is up to 100% for significant periods of time, you may have a problem. On the other hand, if it isn't, the CPU is under-utilized, but that is usually preferable. Low CPU usage can indicate that your application may be blocked for significant periods on disk or network I/O. High CPU usage can indicate thrashing (lack of RAM) or CPU contention (indicating that you need to tune the code and reduce the number of instructions being processed to reduce the impact on the CPU).

A reasonable target is 75% CPU utilization (which from what I read from different authors varies from 75% till 85%). This means that the system is being worked toward its optimum, but that you have left some slacks for spikes due to other system or application requirements. However, note that if more than 50% of the CPU is used by system processes (i.e. administrative and IS process), your CPU is probably under-powered. This can be identified by looking at the load of the system over some period when you are not running any applications.

The second performance indicator, the run-able queue, indicates the average number of processes or threads waiting to be scheduled for the CPU by the OS. They are run-able processes, but the CPU has no time to run them and is keeping them waiting for some significant amount of time. As soon as the run queue goes above zero, the system may display contention for resources, nut there usually some value above zero that still gives acceptable performance for any particular system. You need to determine what that value is in order to use this statistics as a useful warning indicator. A simplistic way to do this is to create a short program that repeatedly does some simple activity. You can then time each run of that activity. You can run copies of this process one after the other so that more and one copies are simultaneously running. Keep increasing the number of copies being run until the run queue starts increasing. By watching the times recorded for the activity, you can graph that time against the run queue. This should give you some indication of when the run-able queue becomes too large for useful responses on your system, administrator if the threshold is exceeded. A guideline by Adrian Cockcroft is that performance starts to degrade if the run queue grows bigger than four times the number of CPUs.

If you can upgrade the CPU of the target environment, doubling the CPU speed is usually better than doubling the number of CPUs. And remember that parallelism in an application doesn't necessarily need multiple CPUs. If I/O is significant, the CPU will have plenty of time for many threads.

Process Priorities

The OS also has the ability to prioritize the processes in terms of providing CPU time by allocating process priority levels. CPU priorities provide a way to throttle high-demand CPU processes, thus giving other processes a greater share of the CPU. If there are other processes that need to run on the same machine but it doesn't matter if they were run slowly, you can give your application processes a (much) higher priority than those other processes, thus allowing your application the lion's share of CPU time on a congested system. This is worth keeping in mind if your application consists of multiple processes, you should also consider the possibility of giving your various processes different levels of priority.

Being tempted to adjust the priority levels of processes, however, is often a sign that the CPU is underpowered for the tasks you have given it.
The above is taken from the publication, "Java Performance Tuning" written by Jack Shirazi. I would recommend to read this book as it provides not just tuning and bottleneck concepts bounded by Java.
Disks Bottlenecks
In most cases, applications can be tuned so that disk I/O does not cause any serous performance problems. But if, application tuning, you find that disk I/O s still causing a performance problem; your best bet may be to upgrade the system disks. Identifying whether the system has a problem with disk utilization is the first step. Each system provides its own tools to identify disk usage (Windows has a performance monitor, and UNIX has the sar, vmstat, iostat utilities.) At minimum, you need to identify whether the paging is an issue (look at disk-scan rates) and assess the overall utilization of your disks (e.g. performance monitor on Windows, output from iostat –D on UNIX). It may be that the system has a problem independent of your application (e.g. unbalanced disks), and correcting this problem may resolve the problem issue.

If the disk analysis does not identify an obvious system problem that is causing the I/O overhead, you could try making a disk upgrade or a reconfiguration. This type of tuning can consist of any of the following:
• Upgrading to faster disks
• Adding more swap space to handle larger buffers
• Changing the disk to be striped (where files are striped across several disks, thus providing parallel I/O. e.g. with a RAID system)
• Running the data on raw partitions when this is shown to be faster.
• Distributing simultaneously accessed files across multiple disks to gain parallel I/O
• Using memory-mapped disks or files

If you have applications that run on many systems and you do not know the specification of the target system, bear in mind that you can never be sure that ant particular disk is local to the user. There is a significant possibility that the disk being used by the application is a network-mounted disk. This doubles the variability in response times and throughput. The weakest link will probably not even be constant. A network disk is a shared resource, as is the network itself, so performance is hugely and unpredictably affected by other users and network load.


Disk I/O
Do not underestimates the impact of disk writes on the system as a whole. For example, all database vendors strongly recommend that the system swap files be placed on a separate disk from their databases. The impact of if not doing so can decrease database throughput (and system activity) but an order of magnitude. This performance decreases come from not splitting I/O of two disk-intensive applications (in this case, OS paging and database I/O).

Identifying that there is an I/O problem is usually fairly easy. The most basic symptom is that things take longer than expected, while at the same time the CPU is not at all heavily worked. The disk-monitoring utilities will also tell you that there is a lot of work being done to the disks. At the system level, you should determine the average peak requirements on the disks. Your disks will have some statistics that are supplied by the vendor, including:

The average and peak transfer rates, normally in megabytes (MB) per seconds, e.g. 5MB/sec. Form this, you can calculate how long an 8K page takes to be transferred from disk, and for example, 5MB/sec is about 5K/ms, so an 8K page takes just under 2ms to transfer.

Average seek time, normally in milliseconds (ms). This is the time required for the disk head to move radially to the correct location on the disk.

Rotational speed, normally in revolutions per minutes (rpm), e.g. 7200rpm. From this, you can calculate the average rotational delay in moving the disk under the disk-head reader, i.e., the time taken for half a revolution. For example, for 7200rpm, one revolution takes 60,000ms (60 seconds) divided by 7200rpm, which is about 8.3 ms. So half a revolution takes just over 4ms, which is consequently the average rotational delay.

This list allows you to calculate the actual time it takes to load a random 8K page from the disk, this being seek time + rotational delay + transfer time. Using the examples given in the list, you have 10 + 4 + 2 = 16 ms to load a random 8K page (almost an order of magnitude slower than the raw disk throughput). This calculation gives you a worst –case scenario for the disk-transfer rates for your application, allowing you to determine if the system is up to the required performance. Note that if you are reading data stored sequentially in disk (as when reading a large file), the seek time and rotational delay are incurred less than once per 8K page loaded. Basically, these two times are incurred only at the beginning of opening the file and whenever the file is fragmented. But this calculation is confounded by other processes also executing I/O to the disk at the same time. This overhead is part of the reason why swap and other intensive I/O files should not be put on the same disk.

One mechanism for speeding up disk I/O is to stripe disks. Disk striping allows data from a particular file to be spread over several disks. Striping allows reads and writes to be performed in parallel across the disks without requiring any application changes. This can speed up disk I/O quite effectively. However, be aware that the seek and rotational overhead previously listed still applies, and if you are making many small random reads, there may be no performance gain from striping disks.

Finally, note again that using remote disks adversely affects I/O performance. You should not be using remote disks mounted from the network with any I/O-intensive operations if you need good performance.


Clustering Files

Reading many files sequentially is faster if the files are clustered together on the disk, allowing the disk-head reader to flow from one file to the next. This clustering is best done in conjunction with defragmenting the disks. The overhead in finding the location of a file on the disk (detailed in the previous section) is also minimized for sequential reads if the files are clustered.

If you cannot specify clustering files at the disk level, you can still provide similar functionality by putting all the files together into one large file (as is done with the ZIP file systems). This is fine if all the files are read-only files or if there is just one file that is writable (you place that at the end). However, when there is more than one writable file, you need to manage the location of the internal files in your system as one or more grow. This becomes a problem and is not usually worth the effort. (If the files have a known bounded size, you can pad the files internally, thus regaining the single file efficiency.)
Cached File Systems (RAM Disks, tmpfs, cachefs)

Most OS provide the ability to map a file system into the system memory . This ability can speed up reads and writes to certain files in which you control your target environment. Typically, this technique has been used to speed up the reading and writing of temporary files. For example, some compilers (of languages in general, not specifically Java) generate many temporary files during compilation. If these files are created and written directly to the system memory, the speed of compilation is greatly increased. Similarly, if you have the a set of external files that are needed by your application, it is possible to map these directly into the system memory, thus allowing their reads and writes to be speeded up greatly.

But note that these types of file systems are not persistent. In the same way the system memory of the machine gets cleared when it is rebooted, so these file systems are removed on reboot. If the system crashes, anything in a memory-mapped file system is lost. For this reason, these types of file systems are usually suitable only for temporary files or read-only versions of disk-based files (such as mapping a CD-ROM into a memory-resident file system).

Remember that you do not have the same degree of fine control over these file systems that you have over your application. A memory-mapped file system does not use memory resources as efficiently as working directly from your application. If you have direct control over the files you are reading and writing, it is usually better to optimize this within your application rather than outside it. A memory-mapped file system takes space directly from system memory. You should consider whether it would be better to let your application grow in memory instead of letting the file system take up that system memory. For multi-user applications, it is usually more efficient for the system to map shared files directly into memory, as a particular fule then takes up just one memory location rather than duplicate in each process. Note that from SDK 1.4, memory-mapped files are directly supported from the java.nio package. Memory-mapped files are slightly different from memory-mapped file systems. A memory-mapped file uses system resources to read the file into system memory, and that data can then be accessed form Java through the appropriate java.nio buffer. A memory-mapped file system does not require the java.nio package and, as far as Java is concerned, files in that file system are simply files like any others. The OS transparently handles the memory mapping.

The creation of memory-mapped file systems is completely system-dependent, and there is no guarantee that it is available on any particular system (though most modern OS do support this feature). On UNIX system, the administrator needs to look at the documentation of the mount command and its subsections on cachefs and tmpfs. Under Windows, you should find details by looking at the documentation on how to setup a RAM disk, a portion of memory mapped to a logical disk drive.

In a similar way, there are products available that pre-cache shared libraries (DLL) and even executables in memory. This usually means only that an application starts quicker or loads the quicker, and so may not be much help in speeding up a running system.

But you can apply the technique of memory-mapping file systems directly and quite usefully for applications in which processes are frequently started. Copy the Java distribution and all class files (all JDK, application, and third-party class files) onto a memory-mapped file system and ensure that all executions and classloads take place from the file system. Since everything (executables, DLLs, class files, resources, etc.) is already in memory, the startup time is much faster. Because only the startup (and class loading) time is affected, this technique gives only a small boost to applications that are not frequently starting processes, but can be usefully applied if startup time is a problem.

Disk Fragmentation

When files are stored on disk, the bytes in the files are note necessarily stored contiguously: their storage depends on file size and contiguous space available on the disk. This non-contiguous storage is called fragmentation. Any particular file may have some chunks in one place, and a pointer to the next chunk that may be quite a distance away on the disk.

Hard disks tend to get fragmented over time. This fragmentation delays both reads from files (including loading applications into computer memory on startup) and writes to files. This delay occurs because the disk header must wind on to the next chunk with each fragmentation, and this takes time.

For optimum performance on any system, it is a good idea to periodically defragment the disk. This reunites files that have been split up so that disk heads do not spend so much time searching for data once the file-header locations have been identified, thus speeding up data access. Defragmenting may not be effective on all systems, however.
Disk Sweet Spots

Most disks have a location from which data is transferred faster than from other locations. Usually, the closer the data is to the outside edge of the disk, the faster it can be read from the disk. Most hard disks rotate at constant angular speed. This means that the linear speed of the disk under a point is faster the farther away the point is from the center of the disk. Thus, data at the edge of the disk can be read from (and written to) at the faster possible rate commensurate with the maximum density of data storable on disk.

This location with faster transfer rates usually termed the disk sweet spot. Some
(Commercial) utilities provide mapped access to the underlying disk and allow you to reorganize files to optimize access. On most server systems, the administrator has control over how logical partitions of the disk apply to the physical layout, and how to position files to the disk sweet spots. Experts for high-performance database system sometimes try to position the index tables of the database as close as possible to the disk sweet spot. These tables consist of relatively small amounts of data that affect the performance of the system in a disproportionately large way, so that any speed improvement in manipulating these tables is significant.

Note that some of the latest OS are beginning to include "awareness" of disk sweet spots, and attempt to move executables to sweet spots when defragmenting the disk. You may need to ensure that the defragmentation procedure does not disrupt your own use of the disk sweet spot.
The above is taken from the publication, "Java Performance Tuning" written by Jack Shirazi. I would recommend to read this book as it provides not just tuning and bottleneck concepts bounded by Java..

============================================================
Memory Bottlenecks
Maintaining watch directly on the system memory (RAM) is not usually that helpful in identifying performance problems. A better indication that memory might be affecting performance can be gained by watching for paging of data from memory to the swap files. Most current OS have a virtual memory that is made up of the actual (real) system memory using RAM chips, and one or more swap files on the system disks. Processes that are currently running are operating in real memory. The OS can take pages from any of the processes currently in real memory and swap them out to disk. This is known as paging. Paging leaves free space in real memory to allocate to other processes that need to bring in a page from disk.

Obviously, if all the processes currently running can fit into real memory, there is no need for the system to swap out any pages. However, if there are too many processes to fit into real memory, paging allows the system to free up system memory to run more processes. Paging affects system performance in many ways. One obvious way is that if a process has had some pages moved to disk and the process becomes run-able, the OS has to pull back the pages from dusk before that process can be run. This leads to delays in performance. In addition, both CPU and the disk I/O spend time doing the paging, reducing available processing power and increasing the load on the disks. This cascading effect involving both the CPU and I/O can degrade the performance of the whole system in such a way that it maybe difficult to even recognize that paging is the problem. The extreme version of too much paging is thrashing, in which the system is spending so much time moving pages around that it fails to perform any other significant work. (The next step is likely to be a system crash).

As with run-able queues (see CPU section), a little paging of the system does not affect the performance enough to cause concern. In fact, some paging can be considered good. It indicated that the system's memory resources are fully utilized. But at the point where paging becomes a significant overhead, the system is overloaded.

Monitoring paging is relatively easy. On UNIX, the utilities vmstat and iostat provide details as to the level of paging, disk activity and memory levels. On Windows, the performance monitor has categories to show these details, as well as being able to monitor the system swap files.

If there is more paging than is optimal, the system's RAM is insufficient or processes are too big. To improve this situation, you need to reduce the memory being used by reducing the number of processes or the memory utilization of some processes. Alternatively, you can add RAM. Assuming that it is your application that is causing the paging (Otherwise, either the system needs an upgrade, or someone else's processes may also have to be tuned), you need to reduce the memory resources you are using.

When the problem is caused by a combination of your application and others, you can partially address the situation by using process priorities (see the CPU "section"). The equivalent to priority levels for memory usage is an all-or-nothing option, where you can lock process in memory. This option is not available on all systems and is more often applied to shared memory than to processes, but nevertheless, it is useful to know. If this option is applied, the process is locked into real memory and is not paged out at all. You need to be aware that using this option reduces the amount of RAM available to all other processes, which can make overall system performance worse. Any deterioration in system performance is likely occurring at heavy system load, so make sure you extrapolate the effect of reducing the system memory in this way.
Network Bottlenecks
At the network level, many things can affect performance. The bandwidth (the amount of data that can be carried by the network) tends to be the first culprit checked. Assuming you have determined that bad performance is attributable to the network component of an application, there is more likely cause of bad network performance than network bandwidth. The most likely cause of bad network performance is the application itself and how it is handling distributed data and functionality.
The overall speed of a particular network connection is limited by the slowest link in the connection chain and the length of the chain. Identifying the slowest link is difficult and may not even be consistent: it can vary at different times of the day or for different communication paths. A network communication path lead from an application through a TCP/IP stack (which adds various layers of headers, possibly encrypting and compressing data as well), then through the hardware interface, through a modem, over a phone line, through another modem, over to a service provider's router, through many heavily congested data lines of various carrying capacities and multiple routers with different maximum throughputs and configurations, to a machine at the other end with its own hard interface, TCP/IP stack and application. A typical web download route is just like this. In addition, there are dropped packets, acknowledgements, retries, bus contention, and so on.
Because so many possibilities causes of bad network performance are external to an application, one option you can consider including in an application is a network speed testing facility that reports to the user. This should test the speed of data transfer from the machine to various destinations: to itself, to another machine on the local network, to the Internet Service Provider, to the target server across the network, and to any other destinations appropriate. This type of diagnostics report can tell users that they are obtaining bad performance from something other than your application. If you feel that the performance of your application is limited by the actual network communication speed, and not by other (application) factors, this facility will report the maximum possible speeds to youruser.

Latency
Latency is different from the load-carrying capacity (bandwidth) of a network. Bandwidth refers to how much data can be sent down the communication channel for a given period of time and is limited by the link in the communication chain that has the lowest bandwidth. The latency is the amount of time a particular data packet takes to get from one end of the communication channel to the other. Bandwidth tells you the limits within which your application can operate before the performance become affected by the volume of data being transmitted. Latency often affects the user's view of the performance even when bandwidth isn't a problem.
In most cases, especially Internet traffic, latency is an important concern. You can determine the basic round-trip time for a data packets from any two machines using the ping utility. This utility provides a measure of the time it takes a packet of data to reach another machine and be returned. However, the time measure is for a basic underlying protocol (ICMP packet) to travel between the machines. If the communication channel is congested and the overlying protocol requires re-transmissions (often the case for Internet traffic), one transmission at the application level can actually be equivalent to many round trips.
It is important to be aware of these limitations. It is often possible to tune the application to minimize the number of transfers by packing data together, caching and redesigning the distributed application protocol to aim for a less conversational mode of operation. At the network level, you need to monitor the transmission statistics (using the ping and netstat utilities and packet sniffers) and consider tuning any network parameters that you have access to in order to reduce re-transmissions.
TCP/IP Stacks
The TCP/IP stack is the section of code that is responsible for translating each application-level network request (send, receive, connect, etc.) through the transport layers down to the wire and back up to the application at the other end of the connection. Because the stacks are usually delivered with the operation system and performance-tested before delivery (since a slow network connection on an otherwise fast machine and fast network is pretty obvious), it is unlikely that the TCP/IP stack itself is a performance problem.
In addition to the stack itself, stacks include several tunable parameters. Most of these parameters deal with transmission details beyond the scope of the book. One parameter worth mentioning is the maximum packet size. When your application sends data, the underlying protocol breaks the data into packets that are transmitted. There is an optimal size for packets transmitted over a particular communication channel, and the packet size actually used by the stack is compromise. Smaller packets are less likely to be dropped, but they introduced more overhead, as data probably has to be broken up into more packets with more header overhead.
If your communication takes place over a particular set of endpoints, you may want to alter the packet sizes. For a LAN segment with no router involved, the packets can be big (e.g. 8KB). For a LAN with routers, you probably want to set the maximum packet size to the size the routers allow to pass unbroken. (Routers can break up the packets into smaller ones; 1500 bytes is the typical maximum packet size and the standard for the Ethernet. The maximum packet size is configurable by the router's network administrator.) If your application is likely to be sending data over the Internet and you cannot guarantee the route and quality of routers it will pass through, 500 bytes per packet is likely to be optimal.
Network Bottlenecks
Other causes of slow network I/O can be attributed directly to the load or configuration of the network. For example, a LAN may become congested when many machines are simultaneously trying to communicate over the network. The potential throughput of the network could handle the load, but the algorithms to provide communication channels slow the network, resulting in a lower maximum throughput. A congested Ethernet network has an average throughput approximately one third the potential maximum throughputs. Congested networks have other problems, such as dropped network packets. If you are using TCP, the communication rate on a congested network is much slower as the protocol automatically resends the dropped packets. If you are using UDP, your application must resend multiple copies for each transfer. Dropping packets in this way is common for the Internet. For LANs, you need to coordinate closely with network administrators to alert them to the problem. For single machines connected by a service provider, suggesting improvements. The phone line to the service provider may be noisier than expected: if so, you also need to speak to the phone line provider. It is also worth checking with the service provider, who should have optimal configurations they can demonstrate.
Dropped packets and re-transmissions are a good indication of network congestion problems, and you should be on constant lookup for them. Dropped packets often occur when routers are overloaded and find it necessary to drop some of the packets being transmitted as the router's buffer overflow. This means that the overlying protocol will request the packets to be resent. The netstat utility lists re-transmission and other statistics that can identify these sorts of problems. Re-transmissions may indicate that the maximum packet size is too large.
DNS Lookup
Looking up network address is an often-overlooked cause of bad network performance. When your application tries to connect to a network address such as foo.bar.somthing.org (e.g. downloading a webpage from http://foo.bar.something.org), your application first translates foo.bar.somthing.org into a four-byte network IP address such as 10.33.6.45. This is the actual address that the network understands and uses for routing network packets. The is this translation works is that your system is configured with some seldom-used files that can specify this translation, and a more frequently used Domain Name System (DNS) server that can dynamically provide you with the address from the given string. DBS translation works as follows:
1. The machine running the application sends the text string of the hostname (e.g. foo.bar.something.org) to the DNS server.
2. The DNS server checks its cache to find an IP address corresponding to that hostname. If the server does not find an entry in the cache, it asks its own DNS server (usually further up the Internet domain-name hierarchy) until ultimately the name is resolved. (This may be by components of the name being resolved, e.g. first .org, then something.org, etc, each time asking another machine as the search request is successively resolved.) This resolved IP address is added to the DBS server's cache.
3. The IP address s returned to the original machine running the application.
4. The application uses the IP address to connect to the desired destination.
The address lookup does not need to be repeated once a connection is established, but any other connections (within the same session of the application or in other session s at the same time and later) need to repeat the lookup procedure to start another connection.
You can improve this situation by running a DNS server locally on the machine, or on a local server if the application uses a LAN. A DNS server can be run as a "caching-only" server that resets its cache each time the machine is rebooted. There would be little point in doing this if the machine used only one or two connections per hostname between successive reboots. For more frequent connections, a local DNS server can provide a noticeable speedup to connections. Nslookup is useful for investigating how a particular system does translations.

No comments: