Understanding CPU Steal Time and IO Wait
When monitoring server performance, CPU utilisation tells only part of the story. Two critical metrics that often reveal hidden performance bottlenecks are CPU steal time and IO wait. Server Scout provides optional monitoring for these essential metrics, helping you identify when your server's performance is constrained by virtualisation overhead or disk I/O operations.
What Are These Metrics?
IO Wait represents the percentage of time your CPU cores spend idle whilst waiting for disk I/O operations to complete. When this value is high, it indicates that your applications are frequently blocked waiting for data to be read from or written to storage devices.
CPU Steal Time occurs in virtualised environments when the hypervisor allocates your virtual machine's CPU resources to other VMs running on the same physical host. Essentially, it's time that your VM should have had access to the CPU but was "stolen" by other virtual machines competing for the same physical resources.
Enabling These Metrics in Server Scout
Both metrics are available as optional monitoring features in Server Scout, as they require additional parsing of /proc/stat data.
- Access your Server Scout dashboard and navigate to your server's configuration
- Locate the "Optional Metrics" section
- Enable
cpu_iowaitto monitor IO wait percentages - Enable
cpu_stealto track CPU steal time - Save your configuration changes
The agent will begin collecting these metrics within the next monitoring cycle, typically within 60 seconds.
# You can verify these metrics are being collected by checking:
cat /proc/stat | head -1
Why These Metrics Matter for Cloud and Virtual Environments
In cloud environments and virtualised infrastructure, these metrics become particularly crucial:
- Cloud instances often share physical hardware with other tenants, making CPU steal time a key indicator of resource contention
- Virtual machines may experience unpredictable I/O performance depending on the underlying storage architecture
- Container environments can mask these issues at the application level whilst the host experiences significant resource pressure
High values in either metric can explain seemingly mysterious performance degradation that doesn't appear in standard CPU or memory monitoring.
Recommended Alert Thresholds
Based on industry best practices and real-world experience, consider implementing these alert thresholds:
IO Wait:
- Warning: >10% sustained for 5 minutes
- Critical: >20% sustained for 3 minutes
CPU Steal Time:
- Warning: >5% sustained for 5 minutes
- Critical: >10% sustained for 3 minutes
These thresholds should be adjusted based on your specific workload characteristics and service level requirements.
Troubleshooting High Values
High IO Wait (>20%)
When IO wait consistently exceeds 20%, consider these solutions:
- Upgrade storage performance - Move from standard to SSD-backed storage
- Optimise database queries - Review slow query logs and add appropriate indices
- Implement caching - Reduce disk reads with Redis or Memcached
- Check for disk errors using
dmesgandsmartctl
# Check current I/O statistics
iostat -x 1 5
High CPU Steal Time (>10%)
Persistent steal time issues require different approaches:
- Upgrade instance type - Move to compute-optimised instances with dedicated CPU credits
- Switch to dedicated hosts - Eliminate noisy neighbour problems entirely
- Migrate to different availability zones - Sometimes resource contention varies by location
- Consider bare metal options - For consistently high-performance requirements
# Monitor steal time in real-time
top -d 1
# Look for the "st" column in the CPU summary
Monitoring Best Practices
- Establish baselines during normal operations to understand typical values for your workload
- Correlate with application metrics - High steal time often coincides with increased response times
- Monitor trends over time rather than focusing solely on instantaneous values
- Document patterns - Note whether issues occur at specific times or correlate with backup operations or batch jobs
By actively monitoring CPU steal time and IO wait alongside traditional metrics, you'll gain valuable insights into your server's true performance characteristics and can proactively address bottlenecks before they impact your users.
Frequently Asked Questions
How do I enable CPU steal time and IO wait monitoring in ServerScout?
What is CPU steal time and when does it occur?
What does IO wait mean in server monitoring?
What are good alert thresholds for CPU steal time and IO wait?
How do I troubleshoot high IO wait on my server?
How can I fix high CPU steal time issues?
Why are these metrics important for cloud environments?
Was this article helpful?