CPU Metrics Explained

Understanding CPU Metrics in Linux Server Monitoring

CPU metrics are fundamental to understanding server performance, yet they're often misinterpreted. Server Scout collects 11 comprehensive CPU metrics every 5 seconds, providing detailed insight into how your processor time is being consumed. This article explains each metric and how to use them for effective performance analysis.

How Server Scout Collects CPU Data

Server Scout's agent reads CPU statistics from /proc/stat, which provides raw counters (called "jiffies") for different CPU time categories. Every 5 seconds, the agent calculates the delta between consecutive readings and converts these into percentages, giving you real-time insight into CPU utilisation patterns.

This fast-tier collection (every 5 seconds) ensures you capture short-lived CPU spikes that slower monitoring intervals might miss entirely.

The CPU Time Breakdown

Linux divides CPU time into several categories. Understanding these categories is crucial for diagnosing performance issues effectively.

Metric	Description	Typical Range	What High Values Indicate
`cpu_percent`	Overall CPU utilisation (100% minus idle time)	Varies by workload	Sustained >85% warrants investigation
`cpu_user`	Time spent executing user-space application code	20-60% under load	Applications are compute-bound
`cpu_system`	Time spent in kernel space (system calls, drivers)	5-20% typically	I/O bottlenecks or excessive system calls
`cpu_iowait`	Time CPU was idle but waiting for I/O operations	<5% ideally	Disk or network I/O bottlenecks
`cpu_steal`	Time stolen by hypervisor for other VMs	<2% on good hosts	Overcommitted virtualisation host
`cpu_nice`	Time running low-priority (niced) processes	Usually 0%	Background jobs or batch processing
`cpu_irq`	Time handling hardware interrupts	<2% normally	Busy network cards or storage controllers
`cpu_softirq`	Time handling software interrupts	<5% normally	High network throughput or timer activity

Core CPU Metrics Explained

cpu_percent - Overall Utilisation

This is your primary CPU health indicator, calculated as 100% minus idle time. It represents the total percentage of time your CPU cores are actively working on something.

Normal behaviour: Highly variable depending on workload. A web server might average 20-40% with spikes to 80%, while a batch processing server might sustain 90%+ during jobs.

Investigation threshold: Sustained utilisation above 85% typically indicates either insufficient CPU capacity or inefficient application behaviour.

cpu_user - Application Processing Time

User CPU time represents the percentage spent executing your applications' code - web servers, databases, calculation tasks, and user programs.

High user CPU typically indicates:

Applications performing intensive calculations
Efficient, compute-bound workloads
Well-optimised applications that don't require excessive system calls

This is generally "good" CPU usage - your applications are doing productive work.

cpu_system - Kernel Processing Time

System CPU time represents the percentage spent in kernel space, handling system calls, device drivers, and core operating system functions.

Sustained system CPU above 20% may indicate:

Inefficient I/O patterns (many small reads/writes instead of larger operations)
Applications making excessive system calls
Driver issues or hardware problems
Memory pressure causing frequent page management

cpu_iowait - The Most Misunderstood Metric

Critical misconception: Many administrators think iowait represents "CPU time spent waiting for I/O." This is incorrect.

Reality: iowait is CPU time that was idle (not busy) but during which the system had outstanding I/O operations. It's a subset of idle time, not busy time.

Think of it this way: if your CPU has nothing to do but your application is waiting for disk reads, that idle time gets classified as iowait instead of regular idle.

High iowait (>10% sustained) indicates:

Disk I/O bottlenecks (slow storage, insufficient IOPS)
Network I/O bottlenecks
Applications blocked on file operations
Storage subsystem cannot keep up with demand

Key insight: You can have 50% iowait and still be I/O bound rather than CPU bound. The solution is faster storage, not more CPU cores.

cpu_steal - Virtualisation Overhead

CPU steal only applies to virtualised environments (cloud VMs, VPS hosting). It represents CPU time that your VM should have received but was "stolen" by the hypervisor to service other VMs on the same physical host.

Sustained steal >5% indicates:

The physical host is overcommitted
"Noisy neighbour" VMs consuming excessive resources
Your VM is being starved of CPU cycles

Important: This isn't something you can fix from within your VM. It requires action from your hosting provider (migrating to a less contended host) or upgrading to a dedicated/less shared hosting tier.

cpu_nice - Low Priority Processing

Nice CPU represents time spent running processes with adjusted priority (using the nice command). These processes voluntarily yield CPU time to higher-priority tasks.

Common sources:

Backup compression jobs
Batch processing tasks
Background maintenance scripts
Processes started with nice -n 10 command

High nice CPU is generally benign - these are background tasks designed not to interfere with interactive performance.

cpu_irq - Hardware Interrupt Handling

Hardware interrupts occur when devices need immediate CPU attention - network cards receiving packets, storage controllers completing operations, timers firing.

Typically <2% on most systems. Higher values may indicate:

Very busy network interfaces
High-throughput storage operations
Hardware issues causing excessive interrupts
Poorly configured interrupt balancing

cpu_softirq - Software Interrupt Processing

Software interrupts handle deferred work from hardware interrupts, particularly network packet processing and timer management.

Typically <5% on most systems. Spikes often correlate with:

High network throughput (packet processing)
Many concurrent connections
Timer-heavy applications
Network-intensive workloads

Additional CPU Context Metrics

Metric	Collection Frequency	Purpose
`cpu_temp`	Every 5 seconds	Processor temperature in Celsius
`cpu_cores`	Daily	Number of logical CPU cores
`cpu_model`	Daily	Processor model identification

cpu_temp - Temperature Monitoring

Processor temperature monitoring helps identify thermal throttling issues. Values above 80°C sustained may indicate:

Inadequate cooling
Dust accumulation in heat sinks
Failing fans
Thermal throttling reducing performance

Note: Many cloud VMs don't expose temperature sensors, so this value may be null in virtualised environments.

Reading CPU Charts and Diagnostic Patterns

Server Scout's dashboard displays CPU metrics as stacked area charts, where all the components sum to approximately 100% of total CPU time. Here are common patterns and their interpretations:

High User + Low System = Compute-Bound Applications

Pattern: 60-80% user, 5-15% system, low iowait
Interpretation: Applications efficiently using CPU for calculations
Action: Scale horizontally or upgrade to faster processors

High System + High iowait = I/O-Bound Workload

Pattern: 20-40% system, 10-30% iowait, moderate user
Interpretation: Applications bottlenecked by storage performance
Action: Optimise I/O patterns, upgrade storage, increase buffer sizes

High Steal = Overcommitted Host

Pattern: Variable steal >5%, inconsistent performance
Interpretation: Virtualisation host cannot provide consistent resources
Action: Contact hosting provider or migrate to dedicated resources

High Softirq = Network-Heavy Load

Pattern: 5-15% softirq, correlates with network throughput spikes
Interpretation: High network packet processing overhead
Action: Optimise network configuration, consider interrupt balancing

Monitoring Best Practices

Monitor trends, not snapshots: CPU metrics naturally fluctuate. Focus on sustained patterns rather than momentary spikes.

Context matters: A 90% CPU spike during a scheduled backup is normal; the same spike during low traffic hours warrants investigation.

Correlate with other metrics: High CPU often correlates with increased memory usage, network activity, or disk I/O. Server Scout's dashboard helps identify these relationships.

Set appropriate thresholds: Generic alerts like "CPU >80%" often generate false positives. Establish baselines specific to your workload patterns.

Consider load averages: CPU percentage shows current utilisation; load averages (collected every 5 minutes by Server Scout) show sustained demand over time.

Common Troubleshooting Scenarios

Scenario 1: High overall CPU but low user time

Likely cause: I/O bottleneck causing high system and iowait
Investigation: Check disk I/O metrics and storage performance

Scenario 2: Intermittent performance issues with CPU steal

Likely cause: Noisy neighbour or overcommitted virtualisation
Investigation: Monitor steal patterns; contact hosting provider if sustained

Scenario 3: High CPU with normal application load

Likely cause: Inefficient code, memory pressure, or resource contention
Investigation: Profile applications, check memory metrics, review recent changes

Understanding these CPU metrics enables proactive performance management and faster problem resolution. Server Scout's 5-second collection interval ensures you capture the full picture of your server's CPU behaviour, from brief spikes to sustained utilisation patterns.

Back to Complete Reference Index

Frequently Asked Questions

What is a healthy CPU usage percentage for a Linux server?

A healthy CPU usage depends on the server role. Sustained cpu_percent above 85% warrants investigation for most workloads. Brief spikes to 100% during peak processing are normal. The key is whether the server can still respond to requests within acceptable timeframes. Monitor trends rather than reacting to momentary spikes.

What does high CPU iowait mean?

High cpu_iowait indicates the CPU had idle time but was waiting for I/O operations to complete. This is a common misconception: iowait is a subset of idle time, not busy time. Sustained iowait above 10% typically points to disk or network I/O bottlenecks rather than a CPU problem. Investigate storage performance or consider faster disks.

What is CPU steal time and why does it matter?

CPU steal (cpu_steal) measures time the hypervisor took from your virtual machine to serve other VMs on the same physical host. Sustained steal above 5% means your host is overcommitted and your VM is being resource-starved. This is not fixable from inside the VM. Contact your hosting provider or migrate to a less contended host.

How do CPU breakdown percentages relate to each other?

The CPU breakdown percentages (user, system, iowait, steal, nice, irq, softirq, and idle) sum to approximately 100%. A stacked CPU chart shows how total CPU time is divided. High user + low system means applications are compute-bound. High system + high iowait indicates I/O-bound workloads. High softirq points to network-heavy processing.

Why is cpu_temp showing as null on my server?

CPU temperature (cpu_temp) is read from /sys/class/thermal or /sys/class/hwmon. Virtual machines typically lack virtual thermal sensors, so cpu_temp will be null on most cloud VMs and VPS instances. This is normal and expected. Temperature monitoring is most relevant on physical (bare-metal) servers where cooling issues can cause throttling.

Was this article helpful?

Search Results