🔍

Troubleshooting Load Spikes: When Top Shows Nothing but Load Average Says Otherwise

· Server Scout

When the Numbers Don't Match

You get an alert at 2 AM: load average has spiked to 8.0 on your quad-core server. You SSH in, run top, and see... nothing unusual. CPU usage sits at 15%, memory looks fine, and no single process is consuming excessive resources. Yet uptime stubbornly reports that astronomical load figure.

This scenario trips up even experienced sysadmins because we instinctively equate high load with high CPU usage. But load average measures more than just CPU demand - it counts any process that's ready to run or waiting for I/O. The real culprit often lurks in places top doesn't immediately reveal.

The Hidden I/O Bottleneck

Start with iostat -x 1 to check if storage is the problem. Look for devices showing high %util values (approaching 100%) or elevated await times. A single overwhelmed disk can push load averages sky-high whilst barely registering in CPU metrics.

Network I/O can be equally deceptive. Processes waiting on slow network responses contribute to load but won't appear CPU-intensive. Check active connections with ss -tuln and monitor network throughput with iftop or nethogs.

The Process State Detective Work

Run ps axl and examine the STAT column. Look for processes marked with 'D' - these are stuck in uninterruptible sleep, usually waiting for I/O operations to complete. Unlike sleeping processes (marked 'S'), these D-state processes count towards load average.

A handful of processes stuck in D-state can inflate your load average dramatically. Common causes include failing storage devices, NFS mounts with connectivity issues, or applications making blocking system calls to unresponsive resources.

Memory Pressure and Swapping

Even with 'free' memory showing, your system might be swapping heavily. Check /proc/meminfo for SwapTotal and SwapFree values, or use vmstat 1 to monitor swap activity in real-time. The 'si' and 'so' columns show swap in/out rates - any sustained activity here will elevate load averages as processes queue for memory access.

Applications with memory leaks often trigger this behaviour. They consume available RAM gradually, forcing the kernel to swap out other processes. The result: acceptable memory usage figures but terrible performance characteristics.

Kernel and Driver Issues

Occasionally, kernel modules or drivers malfunction and create artificial load. Check dmesg for recent error messages, particularly around storage controllers, network interfaces, or virtualisation components. On virtual machines, issues with the hypervisor or VM tools can manifest as unexplained load spikes.

The /proc/loadavg file also shows the number of currently runnable processes. If this number seems disproportionately high compared to what ps reveals, you might be dealing with kernel-level issues that require deeper investigation.

Monitoring for Pattern Recognition

These ghost load scenarios are much easier to diagnose with historical data. Effective lightweight monitoring tracks load averages alongside I/O wait times and memory pressure, helping you correlate spikes with their actual causes rather than scrambling to piece together evidence after the fact.

Detailed process and I/O monitoring reveals patterns that spot checks miss - like gradual memory leaks that only trigger swapping at specific times, or I/O bottlenecks that coincide with backup schedules or batch processing jobs.

Finding the Real Problem

Next time your load average spikes without obvious CPU culprits, work through I/O statistics, process states, and memory pressure systematically. The kernel documentation at kernel.org provides detailed information about interpreting /proc filesystem data for deeper analysis.

Most mysterious load issues have logical explanations - they just require looking beyond the obvious metrics. Try Server Scout's monitoring to build the historical context that makes these investigations much more straightforward.

Ready to Try Server Scout?

Start monitoring your servers and infrastructure in under 60 seconds. Free for 3 months.

Start Free Trial