For my research, I have been carefully measuring network latency. The simplest case is an application sends a single byte via TCP over the network to another application. That application reads the byte and writes a reply. This round trip time represents a minimum latency for a request to be sent to a server, and a response to come back. I was measuring this using netperf over a 100 Mb Ethernet switch, the measured latency was 250 µs. When I measured it over a gigabit Ethernet switch, the latency fell exactly in half to 125 µs. This is when I became suspicious that something strange was going on. It turns out that the problem is interrupt coalescing, which many Ethernet adapters use to improve performance, at the cost of latency.
Typically, when a network device receives a packet, it copies it into the system memory using DMA, then raises an interrupt to signal that a packet has arrived. This is perfect for low loads. However, for Gigabit or 10G Ethernet, the maximum packet rate is extremely high, and handling one interrupt per packet could be very inefficient. Interrupt coalescing, also called interrupt moderation, is a feature where the network adapter will raise one interrupt for a group of packets. My problem was that the old version of the e1000 driver on my Linux systems used a fixed minimum inter-interrupt interval of 125 µs. Thus, the client would send the packet, the server would process it and reply, and then the response would sit in memory until the timer expired. In reality, the round trip latency was lower than 125 µs, but the interrupt throttle timer imposed a minimum latency.
This interrupt timer does strange things to the performance of a client and server which makes many small requests. For example, netperf will frequently measure very close to 8000 round trips per second, but it will occasionally measure a smaller value. The reason is that sometimes the timing of the interrupts on the two ends are closely synchronized. This causes two interrupt timer periods to elapse between message receptions: one for the transmit interrupt, and another for the receive interrupt. This is probably a performance anomaly which would only rarely happen in reality, since real applications will likely do more than 125 µs of work with the request, so the interrupt timer will be less of an issue. However, for simple benchmarks, it can make a huge difference between reliable, low-latency performance and performance with additional delay and unpredictable variation.
In my more realistic benchmark, where the server does approximately 110 µs of processing of each request, tuning this parameter only makes a small difference. It increases the throughput with a small number of clients significantly (almost double), requiring fewer clients before it saturates (3 instead of 5). However, it decreases the peak throughput with large numbers of clients. This is exactly what one would expect, since improving performance under high load is the entire purpose of interrupt combining.