The Mystery of the Slow Database
Your MySQL server is crawling. Queries that normally finish in milliseconds are taking seconds. Users are complaining, but when you run iotop, everything looks fine. Disk utilisation sits at a reasonable 40%, no processes are hammering the storage, and iostat shows modest queue depths.
Yet something is clearly wrong.
This scenario plays out more often than you'd expect, especially on modern servers with SSDs. The problem isn't that your monitoring tools are broken - it's that they're measuring the wrong things.
What Standard Tools Actually Measure
Most administrators rely on iotop, iostat, and similar utilities to diagnose storage bottlenecks. These tools excel at showing throughput metrics: how many megabytes per second you're reading or writing, how busy your disks appear to be.
But databases don't care much about throughput. They care about latency.
A single slow I/O operation can block dozens of queries, even if your overall disk usage looks perfectly normal. When MySQL waits 200ms for a single page read, it doesn't matter that your SSD can theoretically handle 500MB/s - that one query is stuck.
The Latency Blind Spot
Here's what typically happens: your database server handles mixed workloads throughout the day. Most I/O operations complete quickly, but a few take much longer - perhaps due to background maintenance, filesystem fragmentation, or the occasional large sequential scan.
Standard monitoring shows averages. If 95% of your I/O completes in 2ms but 5% takes 100ms, the average might look acceptable. Meanwhile, your application grinds to a halt whenever it hits one of those slow operations.
To see what's really happening, you need iostat -x 1 and focus on the await column. Better yet, use iolatency from the BCC toolkit to get proper latency histograms. You'll often discover that whilst your median I/O latency is fine, your 95th or 99th percentile latency is terrible.
Finding the Real Culprit
The usual suspects for latency spikes include:
Background filesystem operations: fstrim, automatic defragmentation, or journal commits can cause periodic slowdowns that barely register in throughput metrics but devastate latency.
Memory pressure: When the kernel starts evicting database buffer cache to make room for other processes, previously fast queries suddenly require disk reads. The symptom appears as I/O latency, but the root cause is memory allocation.
Storage controller issues: RAID rebuilds, firmware bugs, or thermal throttling can introduce latency without significantly impacting throughput.
The key insight is that production database performance depends on consistent low latency, not high throughput. A monitoring system that only tracks the latter will miss the problems that actually affect users.
Better Monitoring for Real Problems
Effective database server monitoring needs to track latency percentiles, not just averages. You want alerts when your 95th percentile I/O latency exceeds reasonable thresholds, even if your average metrics look fine.
Modern monitoring solutions like Server Scout capture these nuanced performance patterns without the overhead of heavyweight agents that might themselves contribute to your latency problems. The goal is visibility into what actually impacts your applications, not just impressive-looking throughput graphs.
If your database performance doesn't match what your standard I/O tools suggest, you're probably measuring the wrong metrics. Start tracking latency percentiles and you'll likely find the real bottleneck hiding in plain sight.
Want to see how latency-focused monitoring changes your perspective on server performance? Server Scout's free trial includes the kind of detailed metrics that reveal these hidden bottlenecks.