🕵️

When Hypervisors Lie: Why Your VM Shows Low CPU but Feels Slow

· Server Scout

The Mystery of the Well-Behaved VM

Your application team is complaining about slow response times, but every metric looks fine. top shows 15% CPU usage, sar reports plenty of idle time, and memory is nowhere near exhausted. Yet database queries that should complete in milliseconds are taking seconds.

You're probably looking at steal time - the silent performance killer that most monitoring tools completely ignore.

Steal time measures how long your VM waited for the hypervisor to give it actual CPU cycles. When the physical host is overloaded or poorly configured, your "idle" VM might actually be queuing for resources it thinks it has.

Spotting the Symptoms

The first clue is the disconnect between resource utilisation and application performance. Your VM reports low CPU usage because it can only measure what the hypervisor tells it about. But if the hypervisor is juggling too many VMs on limited physical cores, your "available" CPU becomes a polite fiction.

Check /proc/stat directly:

cat /proc/stat | head -1
cpu  2049724 12 447901 187696443 14529 0 23723 89542 0 0

That second-to-last number (89542 in this example) is steal time in jiffies. If it's climbing steadily, you've found your problem.

Vmstat makes this easier to read:

vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 3924312  83524 891644    0    0     2     8   45   78  4  1 93  0  2

That final column (st) shows steal time as a percentage. Anything consistently above 5% suggests resource contention at the hypervisor level.

The Hypervisor Resource Crunch

Modern hypervisors oversell physical resources by design. A host with 16 physical cores might run 50 VMs, each allocated 2 vCPUs. This works brilliantly when workloads are complementary, but falls apart when multiple VMs hit peak usage simultaneously.

The problem compounds with memory pressure. When the hypervisor starts swapping VM memory to disk, steal time explodes as VMs wait for pages to be read back from storage.

Similar issues occur with high context switches that standard tools miss, but steal time specifically indicates problems one layer up at the hypervisor level.

Beyond the Obvious Metrics

Steal time often correlates with network and disk latency issues that don't show up in traditional I/O statistics. When the hypervisor is resource-starved, everything slows down - including the virtual hardware that presents storage and network interfaces to your VM.

This is why lightweight monitoring that tracks the metrics that matter becomes crucial in virtualised environments. Heavy monitoring agents make the problem worse by consuming resources that are already constrained.

The Linux kernel documentation explains how /proc/stat calculates these values, but the key point is that steal time represents real performance impact that application-level monitoring completely misses.

Fighting Back

If you control the hypervisor, the solution is straightforward: reduce the overcommit ratio or migrate VMs to less loaded hosts. But most of us rent VMs from providers who won't discuss their infrastructure details.

In those cases, document the steal time patterns and correlate them with application performance issues. Most cloud providers will eventually address consistent resource contention if you can prove it's affecting your workload.

Server Scout tracks steal time alongside traditional CPU metrics, so you can spot hypervisor issues before they impact your applications. The free trial includes all monitoring features - no need to guess whether your performance problems are local or infrastructure-related.

Ready to Try Server Scout?

Start monitoring your servers and infrastructure in under 60 seconds. Free for 3 months.

Start Free Trial