Understanding CPU Metrics in Linux Server Monitoring
CPU metrics are fundamental to understanding server performance, yet they're often misinterpreted. Server Scout collects 11 comprehensive CPU metrics every 5 seconds, providing detailed insight into how your processor time is being consumed. This article explains each metric and how to use them for effective performance analysis.
How Server Scout Collects CPU Data
Server Scout's agent reads CPU statistics from /proc/stat, which provides raw counters (called "jiffies") for different CPU time categories. Every 5 seconds, the agent calculates the delta between consecutive readings and converts these into percentages, giving you real-time insight into CPU utilisation patterns.
This fast-tier collection (every 5 seconds) ensures you capture short-lived CPU spikes that slower monitoring intervals might miss entirely.
The CPU Time Breakdown
Linux divides CPU time into several categories. Understanding these categories is crucial for diagnosing performance issues effectively.
| Metric | Description | Typical Range | What High Values Indicate |
|---|---|---|---|
cpu_percent | Overall CPU utilisation (100% minus idle time) | Varies by workload | Sustained >85% warrants investigation |
cpu_user | Time spent executing user-space application code | 20-60% under load | Applications are compute-bound |
cpu_system | Time spent in kernel space (system calls, drivers) | 5-20% typically | I/O bottlenecks or excessive system calls |
cpu_iowait | Time CPU was idle but waiting for I/O operations | <5% ideally | Disk or network I/O bottlenecks |
cpu_steal | Time stolen by hypervisor for other VMs | <2% on good hosts | Overcommitted virtualisation host |
cpu_nice | Time running low-priority (niced) processes | Usually 0% | Background jobs or batch processing |
cpu_irq | Time handling hardware interrupts | <2% normally | Busy network cards or storage controllers |
cpu_softirq | Time handling software interrupts | <5% normally | High network throughput or timer activity |
Core CPU Metrics Explained
cpu_percent - Overall Utilisation
This is your primary CPU health indicator, calculated as 100% minus idle time. It represents the total percentage of time your CPU cores are actively working on something.
Normal behaviour: Highly variable depending on workload. A web server might average 20-40% with spikes to 80%, while a batch processing server might sustain 90%+ during jobs.
Investigation threshold: Sustained utilisation above 85% typically indicates either insufficient CPU capacity or inefficient application behaviour.
cpu_user - Application Processing Time
User CPU time represents the percentage spent executing your applications' code - web servers, databases, calculation tasks, and user programs.
High user CPU typically indicates:
- Applications performing intensive calculations
- Efficient, compute-bound workloads
- Well-optimised applications that don't require excessive system calls
This is generally "good" CPU usage - your applications are doing productive work.
cpu_system - Kernel Processing Time
System CPU time represents the percentage spent in kernel space, handling system calls, device drivers, and core operating system functions.
Sustained system CPU above 20% may indicate:
- Inefficient I/O patterns (many small reads/writes instead of larger operations)
- Applications making excessive system calls
- Driver issues or hardware problems
- Memory pressure causing frequent page management
cpu_iowait - The Most Misunderstood Metric
Critical misconception: Many administrators think iowait represents "CPU time spent waiting for I/O." This is incorrect.
Reality: iowait is CPU time that was idle (not busy) but during which the system had outstanding I/O operations. It's a subset of idle time, not busy time.
Think of it this way: if your CPU has nothing to do but your application is waiting for disk reads, that idle time gets classified as iowait instead of regular idle.
High iowait (>10% sustained) indicates:
- Disk I/O bottlenecks (slow storage, insufficient IOPS)
- Network I/O bottlenecks
- Applications blocked on file operations
- Storage subsystem cannot keep up with demand
Key insight: You can have 50% iowait and still be I/O bound rather than CPU bound. The solution is faster storage, not more CPU cores.
cpu_steal - Virtualisation Overhead
CPU steal only applies to virtualised environments (cloud VMs, VPS hosting). It represents CPU time that your VM should have received but was "stolen" by the hypervisor to service other VMs on the same physical host.
Sustained steal >5% indicates:
- The physical host is overcommitted
- "Noisy neighbour" VMs consuming excessive resources
- Your VM is being starved of CPU cycles
Important: This isn't something you can fix from within your VM. It requires action from your hosting provider (migrating to a less contended host) or upgrading to a dedicated/less shared hosting tier.
cpu_nice - Low Priority Processing
Nice CPU represents time spent running processes with adjusted priority (using the nice command). These processes voluntarily yield CPU time to higher-priority tasks.
Common sources:
- Backup compression jobs
- Batch processing tasks
- Background maintenance scripts
- Processes started with
nice -n 10 command
High nice CPU is generally benign - these are background tasks designed not to interfere with interactive performance.
cpu_irq - Hardware Interrupt Handling
Hardware interrupts occur when devices need immediate CPU attention - network cards receiving packets, storage controllers completing operations, timers firing.
Typically <2% on most systems. Higher values may indicate:
- Very busy network interfaces
- High-throughput storage operations
- Hardware issues causing excessive interrupts
- Poorly configured interrupt balancing
cpu_softirq - Software Interrupt Processing
Software interrupts handle deferred work from hardware interrupts, particularly network packet processing and timer management.
Typically <5% on most systems. Spikes often correlate with:
- High network throughput (packet processing)
- Many concurrent connections
- Timer-heavy applications
- Network-intensive workloads
Additional CPU Context Metrics
| Metric | Collection Frequency | Purpose |
|---|---|---|
cpu_temp | Every 5 seconds | Processor temperature in Celsius |
cpu_cores | Daily | Number of logical CPU cores |
cpu_model | Daily | Processor model identification |
cpu_temp - Temperature Monitoring
Processor temperature monitoring helps identify thermal throttling issues. Values above 80°C sustained may indicate:
- Inadequate cooling
- Dust accumulation in heat sinks
- Failing fans
- Thermal throttling reducing performance
Note: Many cloud VMs don't expose temperature sensors, so this value may be null in virtualised environments.
Reading CPU Charts and Diagnostic Patterns
Server Scout's dashboard displays CPU metrics as stacked area charts, where all the components sum to approximately 100% of total CPU time. Here are common patterns and their interpretations:
High User + Low System = Compute-Bound Applications
- Pattern: 60-80% user, 5-15% system, low iowait
- Interpretation: Applications efficiently using CPU for calculations
- Action: Scale horizontally or upgrade to faster processors
High System + High iowait = I/O-Bound Workload
- Pattern: 20-40% system, 10-30% iowait, moderate user
- Interpretation: Applications bottlenecked by storage performance
- Action: Optimise I/O patterns, upgrade storage, increase buffer sizes
High Steal = Overcommitted Host
- Pattern: Variable steal >5%, inconsistent performance
- Interpretation: Virtualisation host cannot provide consistent resources
- Action: Contact hosting provider or migrate to dedicated resources
High Softirq = Network-Heavy Load
- Pattern: 5-15% softirq, correlates with network throughput spikes
- Interpretation: High network packet processing overhead
- Action: Optimise network configuration, consider interrupt balancing
Monitoring Best Practices
- Monitor trends, not snapshots: CPU metrics naturally fluctuate. Focus on sustained patterns rather than momentary spikes.
- Context matters: A 90% CPU spike during a scheduled backup is normal; the same spike during low traffic hours warrants investigation.
- Correlate with other metrics: High CPU often correlates with increased memory usage, network activity, or disk I/O. Server Scout's dashboard helps identify these relationships.
- Set appropriate thresholds: Generic alerts like "CPU >80%" often generate false positives. Establish baselines specific to your workload patterns.
- Consider load averages: CPU percentage shows current utilisation; load averages (collected every 5 minutes by Server Scout) show sustained demand over time.
Common Troubleshooting Scenarios
Scenario 1: High overall CPU but low user time
- Likely cause: I/O bottleneck causing high system and iowait
- Investigation: Check disk I/O metrics and storage performance
Scenario 2: Intermittent performance issues with CPU steal
- Likely cause: Noisy neighbour or overcommitted virtualisation
- Investigation: Monitor steal patterns; contact hosting provider if sustained
Scenario 3: High CPU with normal application load
- Likely cause: Inefficient code, memory pressure, or resource contention
- Investigation: Profile applications, check memory metrics, review recent changes
Understanding these CPU metrics enables proactive performance management and faster problem resolution. Server Scout's 5-second collection interval ensures you capture the full picture of your server's CPU behaviour, from brief spikes to sustained utilisation patterns.
Back to Complete Reference IndexFrequently Asked Questions
What is a healthy CPU usage percentage for a Linux server?
What does high CPU iowait mean?
What is CPU steal time and why does it matter?
How do CPU breakdown percentages relate to each other?
Why is cpu_temp showing as null on my server?
Was this article helpful?