💾

Debugging Disk Performance When iostat Looks Normal: The Hidden Metrics That Matter

· Server Scout

Your web application starts timing out. Database queries that normally complete in milliseconds now take seconds. Users are complaining, but when you check iostat -x 1, everything looks perfectly normal. Utilisation sits at 15%, await times are under 10ms, and queue depth barely registers. Yet something is clearly throttling your disk performance.

This scenario hits more administrators than you'd expect, especially as storage hardware becomes more complex. The traditional metrics we rely on don't always tell the complete story.

The Kernel's Hidden Storage Metrics

Linux tracks dozens of storage-related counters that iostat never shows you. Start by examining /proc/diskstats directly rather than relying on summarised tools:

cat /proc/diskstats | grep -E '(nvme0n1|sda)'

The 11th and 12th fields show discards completed and time spent on discards. If your SSD controller is aggressively garbage collecting, these numbers climb whilst iostat remains oblivious. Applications experience intermittent freezes as the drive periodically stops responding to regular I/O.

Check /sys/block/*/stat for additional context. The in_flight counter reveals queued operations that haven't yet appeared in iostat's measurements.

When the Storage Stack Lies

Modern NVMe drives implement their own internal queuing and caching mechanisms. A drive might report completion to the kernel whilst still processing writes internally. Your iostat shows excellent performance, but the drive's internal buffers are saturated.

Use nvme smart-log /dev/nvme0n1 to examine the drive's internal perspective. Pay attention to the "Percentage Used" and "Available Spare" fields. Some enterprise SSDs throttle performance dramatically when spare capacity drops below certain thresholds.

For traditional SATA drives, smartctl -A /dev/sda reveals pending sector reallocations and other hardware-level issues that create invisible delays.

The Container Storage Problem

If you're running containers, the situation becomes murkier. Docker's overlay filesystems add layers of abstraction between your application and the underlying storage. A simple file write might trigger copy-on-write operations across multiple filesystem layers.

Check docker system df -v to identify bloated containers consuming excessive space. More importantly, examine the storage driver's specific metrics. For overlay2, look at /var/lib/docker/overlay2/*/diff to spot containers creating excessive temporary files.

Finding the Real Bottleneck

When standard tools fail, bpftrace can intercept storage operations at the kernel level. This one-liner shows which processes are actually waiting for disk I/O:

bpftrace -e 'tracepoint:block:block_io_start { @[comm] = count(); }'

Combine this with filesystem-level tracing to identify whether the problem lies in the block layer, filesystem metadata operations, or application-level inefficiencies.

Many administrators focus exclusively on throughput and IOPS, but filesystem metadata operations often become the limiting factor. An application creating thousands of small files can saturate the journal whilst showing minimal activity in traditional monitoring tools.

Monitoring Beyond iostat

Production monitoring should capture these deeper metrics automatically. Server Scout's lightweight monitoring approach can track custom storage metrics through bash-based plugins, letting you monitor drive-specific SMART data and NVMe telemetry alongside traditional performance counters.

The kernel exposes storage performance data through dozens of /proc and /sys interfaces. The Linux kernel documentation lists the complete set of available counters, many of which reveal problems that iostat completely misses.

Next time your storage feels sluggish but the obvious metrics look fine, remember that the kernel tracks far more than most tools display. Sometimes the most important performance data hides in plain sight.

Ready to Try Server Scout?

Start monitoring your servers and infrastructure in under 60 seconds. Free for 3 months.

Start Free Trial